Linguistic annotation and text analytics are active areas of research and development, with academic conferences and industry events such as the Linguistic Annotation Workshops and the annual Text Analytics Summits. This book provides a basic introduction to both fields, and aims to show that good linguistic annotations are the essential foundation for good text analytics. After briefly reviewing the basics of XML, with practical exercises illustrating in-line and stand-off annotations, a chapter is devoted to explaining the different levels of linguistic annotations. The reader is encouraged to create example annotations using the WordFreak linguistic annotation tool. The next chapter shows how annotations can be created automatically using statistical NLP tools, and compares two sets of tools, the OpenNLP and Stanford NLP tools. The second half of the book describes different annotation formats and gives practical examples of how to interchange annotations between different formats using XSLT transformations. The two main text analytics architectures, GATE and UIMA, are then described and compared, with practical exercises showing how to configure and customize them. The final chapter is an introduction to text analytics, describing the main applications and functions including named entity recognition, coreference resolution and information extraction, with practical examples using both open source and commercial tools. Copies of the example files, scripts, and stylesheets used in the book are available from the companion website, located at http://sites.morganclaypool.com/wilcock. Table of Contents: Working with XML / Linguistic Annotation / Using Statistical NLP Tools / Annotation Interchange / Annotation Architectures / Text Analytics
Author(s): Graham Wilcock
Edition: 1
Year: 2009
Language: English
Pages: 160
Tags: Информатика и вычислительная техника;Искусственный интеллект;Компьютерная лингвистика;
Preface......Page 9
Introduction......Page 11
XML Basics......Page 12
XML Parsing and Validation......Page 13
XML Transformations......Page 19
In-Line Annotations......Page 21
Stand-Off Annotations......Page 24
Annotation Standards......Page 27
Further Reading......Page 28
Levels of Linguistic Annotation......Page 29
WordFreak Annotation Tool......Page 30
Sentence Boundaries......Page 32
Tokenization......Page 34
Part-of-Speech Tagging......Page 37
Syntactic Parsing......Page 40
Semantics and Discourse......Page 43
WordFreak with OpenNLP......Page 48
Further Reading......Page 52
Statistical Models......Page 55
Sentences and Tokenization......Page 56
Statistical Tagging......Page 57
Chunking and Parsing......Page 59
Named Entity Recognition......Page 65
Coreference Resolution......Page 69
Further Reading......Page 71
XSLT Transformations......Page 73
WordFreak-OpenNLP Transformation......Page 78
GATE XML Format......Page 80
GATE-WordFreak Transformation......Page 85
XML Metadata Interchange: XMI......Page 91
WordFreak-XMI Transformation......Page 94
Towards Interoperability......Page 101
Further Reading......Page 103
GATE......Page 105
GATE Information Extraction Tools......Page 107
Annotations with JAPE Rules......Page 110
Customizing GATE Gazetteers......Page 113
UIMA......Page 117
UIMA Wrappers for OpenNLP Tools......Page 118
Annotations with Regular Expressions......Page 123
Customizing UIMA Dictionaries......Page 125
Further Reading......Page 128
Text Analytics Tools......Page 129
Named Entity Recognition......Page 132
Training Statistical Models......Page 138
Coreference Resolution......Page 143
Information Extraction......Page 146
Text Mining and Searching......Page 152
New Directions......Page 154
Further Reading......Page 155
Bibliography......Page 157