This book addresses the research, analysis, and description of the methods and processes that are used in the annotation and processing of language corpora in advanced, semi-advanced, and non-advanced languages. It provides the background information and empirical data needed to understand the nature and depth of problems related to corpus annotation and text processing and shows readers how the linguistic elements found in texts are analyzed and applied to develop language technology systems and devices. As such, it offers valuable insights for researchers, educators, and students of linguistics and language technology.
Author(s): Niladri Sekhar Dash
Edition: 1
Publisher: Springer Singapore
Year: 2021
Language: English
Pages: 302
Preface
Acknowledgments
Introduction
Why This Book
Text Annotation Versus Text Processing
Summary of the Chapters
Value of this Book
References
Contents
About the Author
Abbreviations
1 Corpus Text Annotation
1.1 Introduction: What Is Text Annotation?
1.2 Characteristics of Annotation
1.3 Kinds of Text Annotation
1.4 Criteria of Text Annotation
1.5 Maxims of Text Annotation
1.6 Justification of Text Annotation
1.7 Annotation Schemas and Models
1.8 Types of Text Annotation
1.8.1 Intralinguistic Annotation
1.8.2 Extralinguistic Annotation
1.9 Present State of Text Annotation
1.10 Utilization of Annotated Texts
References
2 Principles and Rules of Part-of-Speech Annotation
2.1 Introduction
2.2 Principles of POS Annotation
2.3 Rules for POS Annotation
2.4 Conclusion
References
3 Part-of-Speech Annotation
3.1 Introduction
3.2 Concept of POS Annotation
3.3 Morphological Analysis Versus POS Annotation
3.4 Levels of POS Annotation
3.5 Stages of POS Annotation
3.5.1 Pre-editing Stage
3.5.2 POS Assignment Stage
3.5.3 Postediting Stage
3.6 Earlier Methods POS Annotation
3.7 POS Annotation: Indian Scenario
3.8 The BIS POS Tagset
3.9 The BIS Tagset and Bengali
3.9.1 Utility of a POS Annotated Text
References
4 Extratextual Annotation
4.1 Introduction
4.2 Definition of Extratextual Annotation
4.3 Intratextual and Extratextual Annotation
4.4 Relevance of Extratextual Annotation
4.5 Extratextual Annotation: Some Early Works
4.6 Extratextual Annotation Types
4.6.1 File Name: A Gateway
4.6.2 Annotation of Text Categories
4.6.3 Subject Category Annotation
4.6.4 Annotating Title of a Text
4.6.5 Header Part: Metadata Depository
4.7 Conclusion
References
5 Etymological Annotation
5.1 Introduction
5.2 Lexical Borrowing: Some Scenarios
5.3 Vocabulary Classification
5.4 Defining Etymological Annotation Tagset
5.5 Defining Annotation Strategy
5.5.1 Annotating Borrowed Words
5.5.2 Annotating Portmanteau Words
5.5.3 Annotating Affixed Words
5.5.4 Annotating Inflected Words
5.6 Process of Etymological Annotation
5.7 Findings of an Etymologically Annotated Text
5.8 Frequently Used English Words in Bengali
5.9 System Adoption at Lexical Level
5.10 Conclusion
References
6 More Types of Corpus Annotation
6.1 Introduction
6.2 Orthographic Annotation
6.3 Prosodic Annotation
6.4 Semantic Annotation
6.5 Discourse Annotation
6.6 Rhetoric Annotation
6.7 Conclusion
References
7 Morphological Processing of Words
7.1 Introduction
7.2 Models and Approaches
7.2.1 Two-Level Morphology-Based Approach
7.2.2 Paradigm-Based Approach
7.2.3 Stemmer-Based Approach
7.2.4 Acyclic Graph-Based Approach
7.2.5 Morph-Based Approach
7.2.6 Corpus-Based Approach
7.2.7 Suffix Stripping-Based Approach
7.3 Issues in Morphological Processing
7.4 Method of Information Storage
7.5 Method of Morphological Processing
7.6 Processing Detached Words
7.7 Results of Morphological Processing
7.7.1 Rightly Processed Words
7.7.2 Wrongly Processed Words
7.7.3 Double Processed Words
7.7.4 Non-Processed Words
7.8 Ambiguity in Morphological Processing
7.9 Conclusion
References
8 Lemmatization of Inflected Nouns
8.1 Introduction
8.2 Lemma(-tization)
8.3 Lemmatization and Stemming
8.3.1 Lemmatization is Similar to Stemming
8.3.2 Lemmatization is Different from Stemming
8.4 Lemmatization in English and Other Languages
8.5 Surface Structure of Bengali Nouns
8.6 Stages for Noun Lemmatization
8.6.1 Stage 1: POS Annotation
8.6.2 Stage 2: Noun Identification and Isolation
8.6.3 Stage 3: Alphabetical Sorting of Nouns
8.6.4 Stage 4: Noun Classification
8.6.5 Stage 5: Tokenization
8.7 Operation of Lemmatization Process
8.8 Conclusion
Appendix
References
9 Decomposition of Inflected Verbs
9.1 Introduction
9.2 Lexical Decomposition
9.3 Some Early Works
9.4 Morpheme Structure of Bengali Verbs
9.4.1 Root Part (dhātu)
9.4.2 Suffix Part (Bibhakti)
9.5 Conjugation of Bengali Verbs
9.6 Categorization of Bengali Verb Roots
9.6.1 Verb Root Categorization
9.6.2 Verb Suffix Categorization
9.7 Issues in Lexical Decomposition
9.8 Method of Information Storage
9.9 Decomposing Non-conjugated Verbs
9.10 Decomposing Conjugated Verbs
9.11 Data Creation and Storage
9.11.1 Suffix and Root Detection
9.11.2 Suffix––Root Mapping
9.12 Some Special Cases
9.13 The Resultant Output
9.14 Performance of the System
9.15 Conclusion
References
10 Syntactic Annotation
10.1 Introduction
10.2 Ambiguity of the Term
10.3 Transition of the Concept of Syntactic Annotation
10.4 Goals of Syntactic Annotation
10.5 Challenges Involved in Syntactic Annotation
10.6 What is a Syntactic Annotator (Parser)?
10.6.1 Detection of End of a Sentence
10.6.2 Tokenization of Words in Sentences
10.6.3 Grammatical Annotation
10.6.4 Chunking Larger Lexical Blocks
10.6.5 Finding Matrix Verb in Sentence
10.6.6 Identifying Non-Matrix Clauses
10.6.7 Identification of Phrases
10.7 Types of Syntactic Annotation
10.8 Treebank
10.9 Utility of Syntactic Annotated Texts
10.10 Conclusion
References
Author Index
Subject Index