Python 2.6 Text Processing Beginners Guide

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Author(s): Jeff McNeil
Publisher: Packt Publishing
Year: 2010

Language: English
Pages: 380

Cover
Copyright
Credits
About the Author
About the Reviewer
Table of Contents
Preface
Chapter 1:
Getting Started
Categorizing types of text data
Providing information through markup
Meaning through structured formats
Understanding freeform content
Ensuring you have Python installed
Providing support for Python 3
Implementing a simple cipher
Time for action – implementing a ROT13 encoder
Processing structured markup with a filter
Time for action – processing as a filter
Time for action – skipping over markup tags
State machines
Supporting third-party modules
Packaging in a nutshell
Time for action – installing SetupTools
Running a virtual environment
Configuring virtualenv
Time for action – configuring a virtual environment
Where to get help?
Summary
Chapter 2:
Working with the IO System
Parsing web server logs
Time for action – generating transfer statistics
Using objects interchangeably
Time for action – introducing a new log format
Accessing files directly
Time for action – accessing files directly
Context managers
Handling other file types
Time for action – handling compressed files
Implementing file-like objects
File object methods
Enabling universal newlines
Accessing multiple files
Time for action – spell-checking HTML content
Simplifying multiple file access
Inplace filtering
Accessing remote files
Time for action – spell-checking live HTML pages
Error handling
Time for action – handling urllib 2 errors
Handling string IO instances
Understanding IO in Python 3
Summary
Chapter 3:
Python String Services
Understanding the basics of string object
Defining strings
Time for action – employee management
Building non-literal strings
String formatting
Time for action – customizing log processor output
Percent (modulo) formatting
Mapping key
Conversion flags
Minimum width
Precision
Width
Conversion type
Using the format method approach
Time for action – adding status code data
Making use of conversion specifiers
Creating templates
Time for action – displaying warnings on malformed lines
Template syntax
Rendering a template
Calling string object methods
Time for action – simple manipulation with string methods
Aligning text
Detecting character classes
Casing
Searching strings
Dealing with lists of strings
Treating strings as sequences
Summary
Chapter 4:
Text Processing Using the Standard Library
Reading CSV data
Time for action – processing Excel formats
Time for action – CSV and formulas
Reading non-Excel data
Time for action – processing custom CSV formats
Writing CSV data
Time for action – creating a spreadsheet of UNIX users
Modifying application configuration files
Time for action – adding basic configuration read support
Using value interpolation
Time for action – relying on configuration value interpolation
Handling default options
Time for action – configuration defaults
Writing configuration data
Time for action – generating a configuration file
Reconfiguring our source
A note on Python 3
Time for action – creating an egg-based package
Understanding the setup.py file
Working with JSON
Time for action – writing JSON data
Encoding data
Decoding data
Summary
Chapter 5:
Regular Expressions
Simple string matching
Time for action – testing an HTTP URL
Understanding the match function
Learning basic syntax
Detecting repetition
Specifying character sets and classes
Applying anchors to restrict matches
Wrapping it up
Advanced pattern matching
Grouping
Time for action – regular expression grouping
Using greedy versus non-greedy operators
Assertions
Performing an 'or' operation
Implementing Python-specific elements
Other search functions
search
findall and finditer
split
sub
Compiled expression objects
Dealing with performance issues
Parser flags
Unicode regular expressions
The match object
Processing bind zone files
Time for action – reading DNS records
Summary
Chapter 6:
Structured Markup
XML data
SAX processing
Time for action – event-driven processing
Incremental processing
Time for action – driving incremental processing
Building an application
Time for action – creating a dungeon adventure game
The Document Object Model
xml.dom.minidom
Time for action – updating our game to use DOM processing
Creating and modifying documents programmatically
XPath
Accessing XML data using ElementTree
Time for action – using XPath in our adventure
Reading HTML
Time for action – displaying links in an HTML page
BeautifulSoup
Summary
Chapter 7:
Creating Templates
Time for action – installing Mako
Basic Mako usage
Time for action – loading a simple Mako template
Generating a template context
Managing execution with control structures
Including Python code
Time for action – reformatting the date with Python code
Adding functionality with tags
Rendering files with %include
Generating multiline comments with %doc
Documenting Mako with %text
Defining functions with %def
Time for action – defining Mako def tags
Importing %def sections using %namespace
Time for action – converting mail message to use namespaces
Filtering output
Expression filters
Filtering the output of %def blocks
Setting default filters
Inheriting from base templates
Time for action – updating base template
Growing the inheritance chain
Time for action – adding another inheritance layer
Inheriting attributes
Customizing
Custom tags
Time for action – creating custom Mako tags
Customizing filters
Overviewing alternative approaches
Summary
Chapter 8:
Understanding Encodings and i18n
Understanding basic character encodings
ASCII
Limitations of ASCII
KOI8-R
Unicode
Using Unicode with Python 3
Understanding Unicode
Design goals
Organizational structure
Backwards compatibility
Encoding
UTF-32
UTF-8
Encodings in Python
Time for action – manually decoding
Reading Unicode
Writing Unicode strings
Time for action – copying Unicode data
Time for action – fixing our copy application
The codecs module
Time for action – changing encodings
Adopting good practices
Internationalization and Localization
Preparing an application for translation
Time for action – preparing for multiple languages
Time for action – providing translations
Looking for more information on internationalization
Summary
Chapter 9:
Advanced Output Formats
Dealing with PDF files using PLATYPUS
Time for action – installing ReportLab
Generating PDF documents
Time for action – writing PDF with basic layout and style
Writing native Excel data
Time for action – installing xlwt
Building XLS documents
Time for action – generating XLS data
Working with OpenDocument files
Time for action – installing ODFPy
Building an ODT generator
Time for action – generating ODT data
Summary
Chapter 10:
Advanced Parsing and Grammars
Defining a language syntax
Specifying grammar with Backus-Naur Form
Grammar-driven parsing
PyParsing
Time for action – installing PyParsing
Time for action – implementing a calculator
Parse actions
Time for action – handling type translations
Suppressing parts of a match
Time for action – suppressing portions of a match
Processing data using the Natural Language Toolkit
Time for action – installing NLTK
NLTK processing examples
Removing stems
Discovering collocations
Summary
Chapter 11:
Searching and Indexing
Understanding search complexity
Time for action – implementing a linear search
Text indexing
Time for action – installing Nucular
An introduction to Nucular
Time for action – full text indexing
Time for action – measuring index benefit
Scripts provided by Nucular
Using XML files
Advanced Nucular features
Time for action – field-qualified indexes
Performing an enhanced search
Time for action – performing advanced Nucular queries
Indexing and searching other data
Time for action – indexing Open Office documents
Other index systems
Apache Lucene
ZODB and zc.catalog
SQL text indexing
Summary
Appendix A:
Looking for Additional Resources
Python resources
Unofficial documentation
Python enhancement proposals
Self-documenting
Using other documentation tools
Community resources
Following groups and mailing lists
Finding a users' group
Attending a local Python conference
Honorable mention
Lucene and Solr
Generating C-based parsers with GNU Bison
Apache Tika
Getting started with Python 3
Major language changes
Print is now a function
Catching exceptions
Using metaclasses
New reserved words
Major library changes
Changes to list comprehensions
Migrating to Python 3
Time for action – using 2to3 to move to Python 3
Summary
Appendix B:
Pop Quiz Answers
Chapter 1: Getting Started
ROT 13 Processing Answers
Chapter 2: Working with the IO System
File-like objects
Chapter 3: Python String Services
String literals
String formatting
Chapter 4: Text Processing Using the Standard Library
CSV handling
JSON formatting
Chapter 5: Regular Expressions
Regular expressions
Understanding the Pythonisms
Chapter 6: Structured Markup
SAX processing
Chapter 7: Creating Templates
Template inheritance
Chapter 8: Understanding Encoding and i18n
Character encodings
Python encodings
Internationalization
Chapter 9: Advanced Output Formats
Creating XLS documents
Chapter 11: Searching and Indexing
Introduction to Nucular
Index