Mastering Regular Expressions 3rd : Regular Expressions

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Author(s): [美] Jeffrey E·F·Friedl
Edition: 3
Publisher: O'Reilly Media
Year: 2006

Language: English
Pages: 544

Table of Contents
Preface
The Need for This Book
Intended Audience
How to Read This Book
Organization
The Details
Tool-Specific Information
Typographical Conventions
Exercises
Links, Code, Errata, and Contacts
Safar i®Enabled
Personal Comments and
Introduction to Regular Expressions
Solving Real Problems
Regular Expressions as a Language
The Filename Analogy
The Language Analogy
The goal of this book
The Regular-Expression Frame of Mind
If You Have Some Regular-Expression Experience
Searching Text Files: Egrep
Egrep Metacharacter s
Start and End of the Line
Character Classes
Matching any one of several character s
Negated character classes
Matching Any Character with Dot
Alternation
Matching any one of several subexpressions
Ignoring Differences in Capitalization
Word Boundaries
In a Nutshell
Optional Items
Other Quantifiers: Repetition
Defined range of matches: intervals
Parentheses and Backreferences
The Great Escape
Expanding the Foundation
Linguistic Diver sification
The Goal of a Regular Expression
A Few More Examples
Variable names
A string within double quotes
Dollar amount (with optional cents)
An HTTP/HTML URL
An HTML tag
Regular Expression Nomenclature
Regex
Matching
Metacharacter
Flavor
Subexpression
Character
Improving on the Status Quo
Summary
Personal Glimpses
Extended Introductory Examples
About the Examples
A Short Introduction to Perl
Matching Text with Regular Expressions
Toward a More Real-World Example
Side Effects of a Successful Match
Intertwined Regular Expressions
A short aside--metacharacter s galore
Generic "whitespace" with \s
Intermission
Modifying Text with Regular Expressions
Example: Form Letter
Example: Prettifying a Stock Price
Automated Editing
A Small Mail Utility
Real-world problems, real-world solutions
The "real" real world
Adding Commas to a Number with Lookaround
Lookaround doesn't "consume" text
A few more lookahead examples
Back to the comma example . . .
Word boundar ies and negative lookaround
Commafication without lookbehind
Text-to-HTML Conversion
Cooking special characters
Separating paragraphs
"Linkizing" an email address
Matching the username and hostname
Putting it together
"Linkizing" an HTTP URL
Building a regex library
Why `$' and ` @' sometimes need to be escaped
That Doubled-Word Thing
Double-word example in Perl
Moving bits around: operators, functions, and objects
Double-word example in Java
Overview of Regular Expressions Features and Flavors
Regular Expressions and Cars
In This Chapter
A Casual Stroll Across the Regex Landscape
The Origins of Regular Expressions
Grep's metacharacters
Grep evolves
Egrep evolves
Other species evolve
POSIXŁAn attempt at standardization
Henry Spencer's regex package
Perl evolves
A partial consolidation of flavors
Versions as of this book
At a Glance
Care and Handling of Regular Expressions
Integrated Handling
Procedural and Object-Oriented Handling
Regex handling in Java
Regex handling in VB and other .NET languages
Regex handling in PHP
Regex handling in Python
Why do approaches differ?
A Search-and-Replace Example
Search and replace in Java
Search and replace in VB.NET
Search and replace in PHP
Search and Replace in Other Languages
Awk
Tcl
GNU Emacs
Care and Handling: Summary
Strings, Character Encodings, and Modes
Strings as Regular Expressions
Strings in Java
Strings in VB.NET
Strings in C#
Strings in PHP
Strings in Python
Strings in Tcl
Regex literals in Perl
Character-Encoding Issues
Richness of encoding-related support
Unicode
Characters versus combining-character sequences
Multiple code points for the same character
Unicode 3.1+ and code points beyond U +FFFF
Unicode line terminator
Regex Modes and Match Modes
Case-insensitive match mode
Free-spacing and comments regex mode
Dot-matches-all match mode (a.k.a., Łsingle-line modeŁ)
An unfortunate name.
Enhanced line-anchor match mode (a.k.a., Łmultiline modeŁ)
Literal-text regex mode
Common Metacharacters and Features
Constructs Covered in This Section
Character Representations
Character shorthands
These are machine dependent?
Octal escapeŁ \num
Hex and Unicode escapes: \xnum, \x{num}, \unum, \Unum, ...
Control characters: \cchar
Character Classes and Class-Like Constructs
Normal classes: [a-z]and [^a-z]
Almost any character: dot
Dot ver sus a negated character class
Exactly one byte
Unicode combining character sequence: \X
Class shorthands: \w, \d, \s, \W, \D, \S
Unicode properties, scripts, and blocks: \p{Prop }, \P{Prop }
Scripts.
Blocks.
Other properties/qualities.
Simple class subtraction:
Full class set operations:
Class subtraction with set operators.
Mimicking class set operations with lookaround.
POSIX bracket-expression Łcharacter classŁ: [[:alpha:]]
POSIX bracket-expression Łcollating sequencesŁ: [[.span-ll.]]
POSIX bracket-expression Łcharacter equivalentsŁ: [[=n=]]
Emacs syntax classes
Anchors and Other ŁZero-Width AssertionsŁ
Start of line/string: ^, \A
End of line/string: $, \Z, \z
Start of match (or end of previous match): \G
End of previous match, or start of the current match?
Word boundaries: \b, \B, \<, \>, ...
Lookahead (?=ŁŁŁ), (?!ŁŁŁ); Lookbehind, (?<=ŁŁŁ), (? Comments and Mode Modifiers
Mode modifier: (?modifier ), such as (?i)or (?-i)
Mode-modified span: (?modifier :ŁŁŁ), such as (?i:ŁŁŁ)
Comments: (?#ŁŁŁ)and #ŁŁŁ
Literal-text span: \QŁŁŁ\E
Grouping, Capturing, Conditionals, and Control
Capturing/Grouping Parentheses: (ŁŁŁ)and \1, \2,
Grouping-only parentheses: (?:ŁŁŁ)
Named capture: (?ŁŁŁ)
Atomic grouping: (?>ŁŁŁ)
Alternation: ŁŁŁ<ŁŁŁ<ŁŁŁ
Conditional: (?if then |else )
Using lookaround as the test.
Other tests for the conditional.
Greedy quantifier s: ,, +, ?, {num,num}
Inter valsŁ {min ,max }or \{min ,max \}
Lazy quantifier s: , ?, +?, ??, {num,num}?
Possessive quantifier s: , +, ++, ?+, {num,num}+
Guide to the Advanced Chapters
The Mechanics of Expression Processing
Start Your Engines!
Two Kinds of Engines
New Standards
The impact of standards
Regex Eng ine Types
From the Depar tment of Redundancy Depar tment
Testing the Engine Type
Traditional NFA or not?
DFA or POSIX NFA?
Match Basics
About the Examples
Rule 1: The Match That Begins Earliest Wins
The ŁtransmissionŁ and the bump-along
The transmission's main work: the bump-along
Engine Pieces and Par ts
No ŁelectricŁ parentheses, backreferences, or lazy quantifier s
Rule 2: The Standard Quantifiers Are Greedy
A subjective example
Being too greedy
First come, fir st ser ved
Getting down to the details
Regex-Directed Versus Text-Directed
NFA Engine: Regex-Directed
The control benefits of an NFA engine
DFA Engine: Text-Directed
First Thoughts: NFA and DFA in Comparison
Consequences to us as users
Backtracking
A Really Crummy Analogy
A crummy little example
Two Important Points on Backtracking
Saved States
A match without backtracking
A match after backtracking
A non-match
A lazy match
Backtracking and Greediness
Star, plus, and their backtracking
Revisiting a fuller example
More About Greediness
Problems of Greediness
Multi-Character "Quotes"
Using Lazy Quantifiers
Greediness and Laziness Always Favor a Match
The Essence of Greediness, Laziness, and Backtracking
Possessive Quantifiers and Atomic Grouping
Atomic grouping with !(?>ŁŁŁ)"
The essence of atomic grouping
Some states may remain.
Faster failures with atomic grouping.
Possessive Quantifier s, ?+, ++, ++, and {m,n}+
The Backtracking of Lookaround
Mimicking atomic grouping with positive lookahead
Is Alternation Greedy?
Taking Advantage of Ordered Alternation
Ordered alternation pitfalls
NFA, DFA, and POSIX
"The Longest-Leftmost"
Really, the longest
POSIX and the Longest-Leftmost Rule
Speed and Efficiency
DFA efficiency
Summary: NFA and DFA in Comparison
DFA versus NFA: Differences in the pre-use compile
DFA versus NFA: Differences in match speed
DFA versus NFA: Differences in what is matched
DFA versus NFA: Differences in capabilities
DFA versus NFA: Differences in ease of implementation
Summary
Practical Regex Techniques
Regex Balancing Act
A Few Short Examples
Continuing with Continuation Lines
Matching an IP Address
Know your context
Working with Filenames
Removing the leading path from a filename
Accessing the filename from a path
Both leading path and filename
Matching Balanced Sets of Parentheses
Watching Out for Unwanted Matches
Matching Delimited Text
Allowing escaped quotes in double-quoted strings
Knowing Your Data and Making Assumptions
Stripping Leading and Trailing Whitespace
HTML-Related Examples
Matching an HTML Tag
Matching an HTML Link
Examining an HTTP URL
Validating a Hostname
Plucking Out a URL in the Real World
Extended Examples
Keeping in Sync with Your Data
Keeping the match in sync with expectations
Maintaining sync after a non-match as well
Maintaining sync with \G
This example in perspective
Parsing CSV Files
Distrusting the bump-along
Another approach.
One change for the sake of efficiency
Other CSV formats
Crafting an Efficient Expression
Tests and Backtracks
Traditional NFA versus POSIX NFA
A Sobering Example
A Simple Change--Placing Your Best Foot Forward
Efficiency Versus Correctness
Advancing Further--Localizing the Greediness
Reality Check
"Exponential" matches
A Global View of Backtracking
More Work for a POSIX NFA
Work Required During a Non-Match
Being More Specific
Alternation Can Be Expensive
Benchmarking
Know What You're Measuring
Benchmarking with PHP
Benchmarking with Java
Benchmarking with VB.NET
Benchmarking with Ruby
Benchmarking with Python
Benchmarking with Tcl
Common Optimizations
No Free Lunch
Everyone's Lunch is Different
The Mechanics of Regex Application
Pre-Application Optimizations
Compile caching
Compile caching in the integrated approach
Compile caching in the procedural approach
Compile caching in the object-oriented approach
Pre-check of required character/substring optimization
Length-cognizance optimization
Optimizations with the Transmission
Start of string/line anchor optimization
Implicit-anchor optimization
End of string/line anchor optimization
Initial character/c lass/substring discrimination optimization
Embedded literal string check optimization
Length-cognizance transmission optimization
Optimizations of the Regex Itself
Literal string concatenation optimization
Simple quantifier optimization
Needless parentheses elimination
Character following lazy quantifier optimization
"Excessive" backtracking detection
Exponential (a.k.a., super-linear) short-circuiting
State-suppression with possessive quantifiers
Small quantifier equivalence
Need cognizance
Techniques for Faster Expressions
Common Sense Techniques
Avoid recompiling
Use non-capturing parentheses
Don't add superfluous parentheses
Don't use superfluous character classes
Use leading anchors
Expose Literal Text
"Factor out" required components from quantifier s
"Factor out" required components from the front of alternation
Expose Anchors
Expose ^and \Gat the front of expressions
Expose $at the end of expressions
Lazy Versus Greedy: Be Specific
Split Into Multiple Regular Expressions
Mimic Initial-Character Discrimination
Don't do this with Tcl
Don't do this with PHP
Use Atomic Grouping and Possessive Quantifier s
Lead the Engine to a Match
Put the most likely alternative first
Distribute into the end of alternation
This optimization can be dangerous.
Unrolling the Loop
Method 1: Building a Regex From Past Experiences
Constructing a general Łunrolling-the-loopŁ pattern
The Real Unrolling-the-Loop" Patter n
Avoiding the neverending match
1) The start of special and normal must never inter sect.
2) Special must not match nothingness.
3) Special must be atomic.
General things to look out for
Method 2: A Top-Down View
Method 3: An Internet Hostname
Observations
Using Atomic Grouping and Possessive Quantifier s
Making a neverending match safe with possessive quantifier s
Making a neverending match safe with atomic grouping
Short Unrolling Examples
Unrolling "multi-character" quotes
Unrolling the continuation-line example
Unrolling the CSV regex
Unrolling C Comments
To unroll or to not unroll . . .
Avoiding regex headaches
A direct approach
Making it work
Unrolling the C loop
Return to reality
The Freeflowing Regex
A Helping Hand to Guide the Match
A Well-Guided Regex is a Fast Regex
Wrapup
In Summary: Think!
Perl
In This Chapter
Perl in Earlier Chapters
Regular Expressions as a Language
Perl's Greatest Strength
Perl's Greatest Weakness
Perl's Regex Flavor
Regex Operands and Regex Literals
Features supported by regex literals
Picking your own regex delimiters
How Regex Literals Are Parsed
Regex Modifiers
Regex-Related Perlisms
Dynamic Scope and Regex Match Effects
Global and private var iables
Dynamically scoped values
A better analogy: clear transparencies
Regex side effects and dynamic scoping
Dynamic scoping ver sus lexical scoping
Expression Context
Contorting an expression
Special Variables Modified by a Match
Using $1within a regex?
The qr/ŁŁŁ/ Operator and Regex Objects
Building and Using Regex Objects
Match modes (or lack thereof) are ver y sticky
Viewing Regex Objects
Using Regex Objects for Efficiency
The Match Operator
Match's Regex Operand
Using a regex literal
Using a regex object
The default regex
Special match-once ?ŁŁŁ?
Specifying the Match Target Operand
The default target
Negating the sense of the match
Different Uses of the Match Operator
Normal "does this match?"--scalar context without /g
Normal "pluck data from a string"Łlist context, without /g
"Pluck all matches"Łlist context, with the /g modifier
Iterative Matching: Scalar Context, with /g
The "current match location" and the pos()function
Pre-setting a string's pos
Using \G
"Tag-team" matching with /gc
Pos-related summary
The Match Operator's Environmental Relations
The match operator's side effects
Outside influences on the match operator
Keeping your mind in context (and context in mind)
The Substitution Operator
The Replacement Operand
The /e Modifier
Multiple uses of /e
Context and Return Value
The Split Operator
Basic Split
Basic match operand
Target string operand
Basic chunk-limit operand
Advanced split
Returning Empty Elements
Trailing empty elements
The chunk-limit operand's second job
Special matches at the ends of the string
Split's Special Regex Operands
Split has no side effects
Split's Match Operand with Capturing Parentheses
Fun with Perl Enhancements
Using a Dynamic Regex to Match Nested Pair s
Using the Embedded-Code Construct
Using embedded code to display match-time information
Using embedded code to see all matches
Finding the longest match
Finding the longest-leftmost match
Using embedded code in a conditional
Using local in an Embedded-Code Construct
A Warning About Embedded Code and my Variables
Matching Nested Constructs with Embedded Code
Overloading Regex Literals
Adding start- and end-of-word metacharacter s
Adding support for possessive quantifiers
Problems with Regex-Literal Overloading
Mimicking Named Capture
Perl Efficiency Issues
"There's More Than One Way to Do It"
Regex Compilation, the /o Modifier, qr/ŁŁŁ/,
The internal mechanics of preparing a regex
Perl steps to reduce regex compilation
Unconditional caching
On-demand recompilation
The "compile once" /o modifier
Potential "gotchas" of /o
Using regex objects for efficiency
Using /o with qr/ŁŁŁ/
Using the default regex for efficiency
Understanding the "Pre-Match" Copy
Pre-match copy suppor ts $1, $&, $', $+, . . .
The pre-match copy is not always needed
The variables $`, $&, and $'are naughty
How expensive is the pre-match copy?
Avoiding the pre-match copy
Don't use naughty modules.
The Study Function
When not to use study
When study can help
Benchmarking
Regex Debugging Information
Run-time debugging infor mation
Other ways to invoke debugging messages
Final Comments
Java
Java's Regex Flavor
Java Support for \p{ŁŁŁ}and \P{ŁŁŁ}
Unicode proper ties
Unicode blocks
Special Java character proper ties
Unicode Line Terminators
Using java.util.regex
The Pattern.compile()Factor y
Pattern's matchermethod
The Matcher Object
Applying the Regex
Querying Match Results
Match-result example
Simple Search and Replace
Simple search and replace examples
The replacement argument
Advanced Search and Replace
Search-and-replace examples
In-Place Search and Replace
Using a different-sized replacement
The Matcher's Reg ion
Points to keep in mind
Setting and inspecting region bounds
Looking outside the current region
Transparent bounds
Anchoring bounds
Method Chaining
Methods for Building a Scanner
Examples illustrating hitEndand requireEnd
The hitEndbug and its workaround
Other Matcher Methods
Querying a matcher's target text
Other Pattern Methods
Pattern's split Method, with One Argument
Empty elements with adjacent matches
Pattern's split Method, with Two Arguments
Split with a limit less than zero
Split with a limit of zero
Split with a limit greater than zero
Additional Examples
Adding Width and Height Attributes to Image Tags
Validating HTML with Multiple Patterns Per Matcher
Parsing Comma-Separated Values (CSV) Text
Java Version Differences
Differences Between 1.4.2 and 1.5.0
New methods in Java 1.5.0
Unicode-support differences between 1.4.2 and 1.5.0
Differences Between 1.5.0 and 1.6
.NET
.NET's Regex Flavor
Additional Comments on the Flavor
Named capture
An unfortunate consequence
Conditional tests
"Compiled" expressions
Right-to-left matching
Backslash-dig it ambiguities
ECMAScr ipt mode
Using .NET Regular Expressions
Regex Quickstart
Quickstart: Checking a string for match
Quickstart: Matching and getting the text matched
Quickstart: Matching and getting captured text
Quickstart: Search and replace
Package Overview
Importing the regex namespace
Core Object Overview
Regex objects
Match objects
Group objects
Capture objects
All results are computed at match time
Core Object Details
Creating Regex Objects
Catching exceptions
Regex options
Using Regex Objects
Using a replacement delegate
Using Splitwith capturing parentheses
Using Match Objects
Using Group Objects
Static "Convenience" Functions
Regex Caching
Support Functions
Regex.Escape(string )
Regex.Unescape(str ing )
Match.Empty
Regex.CompileToAssembly(ŁŁŁ)
Advanced .NET
Regex Assemblies
Matching Nested Constructs
Capture Objects
PHP
PHP's Regex Flavor
The Preg Function Interface
"Pattern" Arguments
PHP single-quoted strings
Delimiters
Pattern modifiers
Mode modifiers outside the regex
PHP-specific modifiers
The Preg Functions
preg_match
Capturing match data
Trailing "non-participatory" elements stripped
Named capture
Getting more details on the match: PREG_OFFSET_CAPTURE
The offset argument
preg_match_all
Collecting match data
The default PREG_PATTERN_ORDER ar rangement
The PREG_SET_ORDER ar rangement
pregR matchR alland the PREG_OFFSET_CAPTURE flag
pregR matchR allwith named capture
preg_replace
Basic one-string, one-pattern, one-replacement pregR replace
Multiple subjects, patterns, and replacements
Ordering of array arguments
preg_replace_callback
A callback versus the e pattern modifier
preg_split
preg_split's limit argument
preg_split's flag arguments
preg_grep
preg_quote
"Missing" Preg Functions
preg_regex_to_pattern
The problem
The solution
Syntax-Checking an Unknown Pattern Argument
Syntax-Checking an Unknown Regex
Recursive Expressions
Matching Text with Nested Parentheses
Recursive reference to a set of capturing parentheses
Recursive reference via named capture
More on possessive quantifiers
No Backtracking Into Recursion
Matching a Set of Nested Parentheses
PHP Efficiency Issues
The S Pattern Modifier: "Study"
Standard optimizations, without the S pattern modifier
Enhancing the optimization with the S pattern modifier
When the S pattern modifier can't help
Suggested use
Extended Examples
CSV Parsing with PHP
Checking Tagged Data for Proper Nesting
The main body of this expression
Possessive quantifiers
Real-world XML
HTML ?
Index