Document Processing Using Machine Learning

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Document Processing Using Machine Learning aims at presenting a handful of resources for students and researchers working in the document image analysis (DIA) domain using machine learning since it covers multiple document processing problems. Starting with an explanation of how Artificial Intelligence (AI) plays an important role in this domain, the book further discusses how different machine learning algorithms can be applied for classification/recognition and clustering problems regardless the type of input data: images or text.



In brief, the book offers comprehensive coverage of the most essential topics, including:



- The role of AI for document image analysis

- Optical character recognition

- Machine learning algorithms for document analysis

- Extreme learning machines and their applications

- Mathematical foundation for Web text document analysis

- Social media data analysis

- Modalities for document dataset generation



This book serves both undergraduate and graduate scholars in Computer Science/Information Technology/Electrical and Computer Engineering. Further, it is a great fit for early career research scientists and industrialists in the domain.

Author(s): Sk Md Obaidullah; KC Santosh; Teresa Gonçalves; Nibaran Das; Kaushik Roy
Publisher: CRC Press
Year: 2020

Language: English
Pages: xiv+168

Cover
Half Title
Title Page
Copyright Page
Table of Contents
Preface #8,0,-32767Editors #10,0,-32767Contributors #14,0,-327671: Artificial Intelligence for Document Image Analysis #16,0,-32767 1.1 Introduction #17,0,-32767 1.2 Optical Character Recognition #17,0,-32767 1.2.1 Dealing with Noise #18,0,-32767 1.2.2 Segmentation #21,0,-32767 1.2.3 Applications #21,0,-32767 1.2.3.1 Legal Industry
1.2.3.2 Banking
1.2.3.3 Healthcare
1.2.3.4 CAPTCHA
1.2.3.5 Automatic Number Recognition
1.2.3.6 Handwriting Recognition
1.3 Natural Language Processing #23,0,-32767 1.3.1 Tokenization #23,0,-32767 1.3.2 Stop Word Removal #24,0,-32767 1.3.3 Stemming #24,0,-32767 1.3.4 Part of Speech Tagging #24,0,-32767 1.3.5 Parsing #25,0,-32767 1.3.6 Applications #25,0,-32767 1.3.6.1 Text Summarization
1.3.6.2 Question Answering
1.3.6.3 Text Categorization
1.3.6.4 Sentiment Analysis
1.3.6.5 Word Sense Disambiguation
1.4 Conclusion #26,0,-32767 References #26,0,-327672: An Approach toward Character Recognition of Bangla Handwritten Isolated Characters #30,0,-32767 2.1 Introduction #30,0,-32767 2.2 Proposed Framework #31,0,-32767 2.2.1 Database #32,0,-32767 2.2.2 Feature Extraction #33,0,-32767 2.2.3 Attribute Selection and Classification #33,0,-32767 2.3 Results and Discussion #35,0,-32767 2.3.1 Comparative Study #36,0,-32767 2.4 Conclusion #41,0,-32767 References #41,0,-327673: Artistic Multi-Character Script Identification #44,0,-32767 3.1 Introduction #44,0,-32767 3.2 Literature Review #45,0,-32767 3.3 Data Collection and Preprocessing #46,0,-32767 3.4 Feature Extraction #49,0,-32767 3.4.1 Topology-Based Features #49,0,-32767 3.4.2 Texture Feature #51,0,-32767 3.5 Experiments #53,0,-32767 3.5.1 Estimation Procedure #53,0,-32767 3.5.2 Results and Analysis #53,0,-32767 3.6 Conclusion #54,0,-32767 References #55,0,-327674: A Study on the Extreme Learning Machine and Its Applications #58,0,-32767 4.1 Introduction #58,0,-32767 4.2 Preliminaries #59,0,-32767 4.3 Activation Functions of ELM #60,0,-32767 4.3.1 Sigmoid Function #61,0,-32767 4.3.2 Hardlimit Function (‘Hardlim’) #61,0,-32767 4.3.3 Radial Basis Function (‘Radbas’) #61,0,-32767 4.3.4 Sine Function #61,0,-32767 4.3.5 Triangular Basis Function (‘Tribas’) #62,0,-32767 4.4 Metamorphosis of an ELM #62,0,-32767 4.5 Applications of ELMs #63,0,-32767 4.5.1 ELMs in Document Analysis #63,0,-32767 4.5.2 ELMs in Medicine #64,0,-32767 4.5.3 ELM in Audio Signal Processing #64,0,-32767 4.5.4 ELM in Other Pattern Recognition Problems #64,0,-32767 4.6 Challenges of ELM #64,0,-32767 4.7 Conclusion #65,0,-32767 References #65,0,-327675: A Graph-Based Text Classification Model for Web Text Documents #68,0,-32767 5.1 Introduction #68,0,-32767 5.2 Related Works #69,0,-32767 5.2.1 English #69,0,-32767 5.2.2 Chinese, Japanese and Persian #70,0,-32767 5.2.3 Arabic and Urdu #71,0,-32767 5.2.4 Indian Languages except Bangla #71,0,-32767 5.2.5 Bangla #72,0,-32767 5.3 Proposed Methodology #73,0,-32767 5.3.1 Data Collection #73,0,-32767 5.3.2 Pre-Processing #74,0,-32767 5.3.3 Graph-Based Representation #75,0,-32767 5.3.4 Classifier #75,0,-32767 5.4 Results and Analysis #76,0,-32767 5.4.1 Comparison with Existing Methods #78,0,-32767 5.5 Conclusion #79,0,-32767 Acknowledgment #79,0,-32767 References #79,0,-327676: A Study of Distance Metrics in Document Classification #84,0,-32767 6.1 Introduction #85,0,-32767 6.2 Literature Survey #85,0,-32767 6.2.1 Indo–European #85,0,-32767 6.2.2 Sino–Tibetan #86,0,-32767 6.2.3 Japonic #87,0,-32767 6.2.4 Afro–Asiatic #87,0,-32767 6.2.5 Dravidian #87,0,-32767 6.2.6 Indo–Aryan #87,0,-32767 6.3 Proposed Methodology #88,0,-32767 6.3.1 Data Collection #89,0,-32767 6.3.2 Pre-Processing #89,0,-32767 6.3.3 Feature Extraction and Selection #90,0,-32767 6.3.4 Distance Measurement #91,0,-32767 6.3.4.1 Squared Euclidean Distance
6.3.4.2 Manhattan Distance
6.3.4.3 Mahalanobis Distance
6.3.4.4 Minkowski Distance
6.3.4.5 Chebyshev Distance
6.3.4.6 Canberra Distance
6.4 Results and Discussion #93,0,-32767 6.4.1 Comparison with Existing Methods #96,0,-32767 6.5 Conclusion #96,0,-32767 Acknowledgment #97,0,-32767 References #97,0,-327677: A Study of Proximity of Domains for Text Categorization #100,0,-32767 7.1 Introduction #100,0,-32767 7.2 Existing Work #101,0,-32767 7.3 Proposed Methodology #104,0,-32767 7.3.1 Data Collection #104,0,-32767 7.3.2 Pre-Processing #105,0,-32767 7.3.3 Feature Extraction and Selection #106,0,-32767 7.3.4 Classifiers #107,0,-32767 7.4 Results and Analysis #109,0,-32767 7.5 Conclusion #112,0,-32767 Acknowledgment #112,0,-32767 References #112,0,-327678: Supervised Learning for Aggression Identification and Author Profiling over Twitter Dataset #116,0,-32767 8.1 Introduction #116,0,-32767 8.2 Overview of Aggression Identification #117,0,-32767 8.2.1 Dataset #117,0,-32767 8.2.2 Data Characteristics #119,0,-32767 8.2.3 Data Preprocessing #120,0,-32767 8.2.4 Feature Extraction #121,0,-32767 8.2.5 Experimental Setup #122,0,-32767 8.2.6 System Modeling #122,0,-32767 8.2.7 Results #124,0,-32767 8.3 Overview of Author Profiling #124,0,-32767 8.3.1 Datasets #125,0,-32767 8.3.2 Preprocessing #125,0,-32767 8.3.3 Feature Extraction #126,0,-32767 8.3.4 Experimental Setup #126,0,-32767 8.3.5 Algorithm and Fine-Tuning the Model #127,0,-32767 8.3.6 Results #128,0,-32767 8.4 Conclusion and Future Work #131,0,-32767 References #131,0,-327679: The Effect of Using Features Computed from Generated Offline Images for Online Bangla Handwritten Character Recognition #136,0,-32767 9.1 Introduction #136,0,-32767 9.2 Literature Review #141,0,-32767 9.2.1 Direction Code-Based Feature [50] #141,0,-32767 9.2.2 Area and Local Features #143,0,-32767 9.2.2.1 Area Feature
9.2.2.2 Local Feature
9.2.3 Point-Based Feature [2, 48] #145,0,-32767 9.2.4 Transition Count Feature [61] #146,0,-32767 9.2.5 Topological Feature [61] #146,0,-32767 9.2.5.1 Crossing Point
9.3 Database Preparation and Pre-processing #148,0,-32767 9.3.1 Design of the Data Collection Form #148,0,-32767 9.4 Feature Extraction #149,0,-32767 9.4.1 Directed Hausdorff Distance (DHD)-Based Features #149,0,-32767 9.5 Experimental Results and Analysis #151,0,-32767 9.6 Conclusion #154,0,-32767 References #154,0,-3276710: Handwritten Character Recognition for Palm-Leaf Manuscripts #160,0,-32767 10.1 Introduction #160,0,-32767 10.2 Palm-Leaf Manuscripts #161,0,-32767 10.3 Challenges in OHCR for Palm-Leaf Manuscripts #161,0,-32767 10.4 Document Processing and Recognition for Palm-Leaf Manuscripts #163,0,-32767 10.4.1 Preprocessing #164,0,-32767 10.4.1.1 Binarization
10.4.1.2 Noise Reduction
10.4.1.3 Skew Correction
10.4.2 Segmentation #165,0,-32767 10.4.2.1 Text Line Segmentation
10.4.2.2 Character Segmentation
10.4.3 Recognition #167,0,-32767 10.4.3.1 Segmentation-Based Approach
10.4.3.2 Segmentation-Free Approach