This book provides a perspective on the application of machine learning-based methods in knowledge discovery from natural languages texts. By analysing various data sets, conclusions which are not normally evident, emerge and can be used for various purposes and applications. The book provides explanations of principles of time-proven machine learning algorithms applied in text mining together with step-by-step demonstrations of how to reveal the semantic contents in real-world datasets using the popular R-language with its implemented machine learning algorithms. The book is not only aimed at IT specialists, but is meant for a wider audience that needs to process big sets of text documents and has basic knowledge of the subject, e.g. e-mail service providers, online shoppers, librarians, etc.
The book starts with an introduction to text-based natural language data processing and its goals and problems. It focuses on machine learning, presenting various algorithms with their use and possibilities, and reviews the positives and negatives. Beginning with the initial data pre-processing, a reader can follow the steps provided in the R-language including the subsuming of various available plug-ins into the resulting software tool. A big advantage is that R also contains many libraries implementing machine learning algorithms, so a reader can concentrate on the principal target without the need to implement the details of the algorithms her- or himself. To make sense of the results, the book also provides explanations of the algorithms, which supports the final evaluation and interpretation of the results. The examples are demonstrated using realworld data from commonly accessible Internet sources.
Author(s): Jan Žižka; František Dařena; Arnoš Svoboda
Publisher: CRC Press
Year: 2020
Language: English
Pages: xiv+351
Cover
Title Page
Copyright Page
Dedication
Preface
Contents
Authors’ Biographies
1. Introduction to Text Mining with Machine Learning
1.1 Introduction
1.2 Relation of Text Mining to Data Mining
1.3 The Text Mining Process
1.4 Machine Learning for Text Mining
1.4.1 Inductive Machine Learning
1.5 Three Fundamental Learning Directions
1.5.1 Supervised Machine Learning
1.5.2 Unsupervised Machine Learning
1.5.3 Semi-supervised Machine Learning
1.6 Big Data
1.7 About This Book
2. Introduction to R
2.1 Installing R
2.2 Running R
2.3 RStudio
2.3.1 Projects
2.3.2 Getting Help
2.4 Writing and Executing Commands
2.5 Variables and Data Types
2.6 Objects in R
2.6.1 Assignment
2.6.2 Logical Values
2.6.3 Numbers
2.6.4 Character Strings
2.6.5 Special Values
2.7 Functions
2.8 Operators
2.9 Vectors
2.9.1 Creating Vectors
2.9.2 Naming Vector Elements
2.9.3 Operations with Vectors
2.9.4 Accessing Vector Elements
2.10 Matrices and Arrays
2.11 Lists
2.12 Factors
2.13 Data Frames
2.14 Functions Useful in Machine Learning
2.15 Flow Control Structures
2.15.1 Conditional Statement
2.15.2 Loops
2.16 Packages
2.16.1 Installing Packages
2.16.2 Loading Packages
2.17 Graphics
3. Structured Text Representations
3.1 Introduction
3.2 The Bag-of-Words Model
3.3 The Limitations of the Bag-of-Words Model
3.4 Document Features
3.5 Standardization
3.6 Texts in Different Encodings
3.7 Language Identification
3.8 Tokenization
3.9 Sentence Detection
3.10 Filtering Stop Words, Common, and Rare Terms
3.11 Removing Diacritics
3.12 Normalization
3.12.1 Case Folding
3.12.2 Stemming and Lemmatization
3.12.3 Spelling Correction
3.13 Annotation
3.13.1 Part of Speech Tagging
3.13.2 Parsing
3.14 Calculating the Weights in the Bag-of-Words Model
3.14.1 Local Weights
3.14.2 Global Weights
3.14.3 Normalization Factor
3.15 Common Formats for Storing Structured Data
3.15.1 Attribute-Relation File Format (ARFF)
3.15.2 Comma-Separated Values (CSV)
3.15.3 C5 format
3.15.4 Matrix Files for CLUTO
3.15.5 SVMlight Format
3.15.6 Reading Data in R
3.16 A Complex Example
4. Classification
4.1 Sample Data
4.2 Selected Algorithms
4.3 Classifier Quality Measurement
5. Bayes Classifier
5.1 Introduction
5.2 Bayes’ Theorem
5.3 Optimal Bayes Classifier
5.4 Naïve Bayes Classifier
5.5 Illustrative Example of Naïve Bayes
5.6 Naïve Bayes Classifier in R
5.6.1 Running Naïve Bayes Classifier in RStudio
5.6.2 Testing with an External Dataset
5.6.3 Testing with 10-Fold Cross-Validation
6. Nearest Neighbors
6.1 Introduction
6.2 Similarity as Distance
6.3 Illustrative Example of k-NN
6.4 k-NN in R
7. Decision Trees
7.1 Introduction
7.2 Entropy Minimization-Based c5 Algorithm
7.2.1 The Principle of Generating Trees
7.2.2 Pruning
7.3 C5 Tree Generator in R
7.3.1 Generating a Tree
7.3.2 Information Acquired from C5-Tree
7.3.3 Using Testing Samples to Assess Tree Accuracy
7.3.4 Using Cross-Validation to Assess Tree Accuracy
7.3.5 Generating Decision Rules
8. Random Forest
8.1 Introduction
8.1.1 Bootstrap
8.1.2 Stability and Robustness
8.1.3 Which Tree Algorithm?
8.2 Random Forest in R
9. Adaboost
9.1 Introduction
9.2 Boosting Principle
9.3 Adaboost Principle
9.4 Weak Learners
9.5 Adaboost in R
10. Support Vector Machines
10.1 Introduction
10.2 Support Vector Machines Principles
10.2.1 Finding Optimal Separation Hyperplane
10.2.2 Nonlinear Classification and Kernel Functions
10.2.3 Multiclass SVM Classification
10.2.4 SVM Summary
10.3 SVM in R
11. Deep Learning
11.1 Introduction
11.2 Artificial Neural Networks
11.3 Deep Learning in R
12. Clustering
12.1 Introduction to Clustering
12.2 Difficulties of Clustering
12.3 Similarity Measures
12.3.1 Cosine Similarity
12.3.2 Euclidean Distance
12.3.3 Manhattan Distance
12.3.4 Chebyshev Distance
12.3.5 Minkowski Distance
12.3.6 Jaccard Coefficient
12.4 Types of Clustering Algorithms
12.4.1 Partitional (Flat) Clustering
12.4.2 Hierarchical Clustering
12.4.3 Graph Based Clustering
12.5 Clustering Criterion Functions
12.5.1 Internal Criterion Functions
12.5.2 External Criterion Function
12.5.3 Hybrid Criterion Functions
12.5.4 Graph Based Criterion Functions
12.6 Deciding on the Number of Clusters
12.7 K-Means
12.8 K-Medoids
12.9 Criterion Function Optimization
12.10 Agglomerative Hierarchical Clustering
12.11 Scatter-Gather Algorithm
12.12 Divisive Hierarchical Clustering
12.13 Constrained Clustering
12.14 Evaluating Clustering Results
12.14.1 Metrics Based on Counting Pairs
12.14.2 Purity
12.14.3 Entropy
12.14.4 F-Measure
12.14.5 Normalized Mutual Information
12.14.6 Silhouette
12.14.7 Evaluation Based on Expert Opinion
12.15 Cluster Labeling
12.16 A Few Examples
13. Word Embeddings
13.1 Introduction
13.2 Determining the Context and Word Similarity
13.3 Context Windows
13.4 Computing Word Embeddings
13.5 Aggregation of Word Vectors
13.6 An Example
14. Feature Selection
14.1 Introduction
14.2 Feature Selection as State Space Search
14.3 Feature Selection Methods
14.3.1 Chi Squared (x(sup[x]))
14.3.2 Mutual Information
14.3.3 Information Gain
14.4 Term Elimination Based on Frequency
14.5 Term Strength
14.6 Term Contribution
14.7 Entropy-Based Ranking
14.8 Term Variance
14.9 An Example
References
Index