Regular expressions pattern options
In regular expressions there are some constructs that influence the way a pattern is matched:
- Non-greedy
? - Non-capturing
(?:pattern) - Positive lookahead
(?=pattern) - Negative lookahead
(?!pattern) - Positive lookbehind
(?<=pattern) - Negative lookbehind
(?<!pattern) - Testing regular expressions
*, +, ?, {n} and {n,m} are greedy: the pattern matches as many characters as possible, until the first non-matching pattern.Example: in the string
example12345, the pattern [a-z]+ matches example.The non-greedy quantifiers
*?, +?, ??, {n}? and {n,m}? match as little characters as possible. Example: in the string
example12345, the pattern [a-z]+? finds 7 different matches: e, x, a, m, p, l, e.Example: in the string
www<aaaa>xxxx<bbb>yyyy<cccc>zzzz, the pattern <[a-z><]+> has one only match: <aaaa>xxxx<bbb>yyyy<cccc>.Example: in the same string
www<aaaa>xxxx<bbb>yyyy<cccc>zzzz, the pattern <[a-z><]+?> has 3 matches: <aaaa>, <bbb> and <cccc>\number (for instance to check if the matching pattern is repeated).Example: in the string
test1 test2 test1, the pattern ([a-z0-9]+)\s([a-z0-9]+)\s\1 matches because \1 evaluates the first captured pattern test1.When
?: comes immediately after the left parenthesis, the pattern is not captured.Example: in the string
test1 test2 test1, the pattern (?:[a-z0-9]+)\s([a-z0-9]+)\s\1 does not match because the first captured match (\1) is test2.Since I've got acquainted with this pattern, I tend to use it whenever there is no need to store the pattern in memory (almost always).
Moreover, the Microsoft VBScript Regular Expressions 5.5 library provides a match for every match and a submatch for every captured pattern; so, understanding the
(?:pattern) would be very useful for such usage.?= matches the pattern before the ?= construct only if after that pattern there is a match for the (?=pattern).Example: in the string
BMW Z3, the pattern BMW (?=Z[0-9]) matches BMW .Example: in the string
BMW 325, the pattern BMW (?=Z[0-9]) doesn't match.Moreover, the lookahead
?= just looks ahead, without doing anything to what it looks: usually, when a match is found for a pattern, then the string is searched for the same pattern beginning from the byte after the last match found; when ?= is used, the part of the string that matches (?=pattern) is not included in the match, and still the string is searched beginning from the byte after the last match found, that is, the byte after the match of the pattern before the ?=.Example: by default, in the string
x=a2*b*3+7*50;, the pattern [*/+-=]([a-z0-9]+)[*/+-;] matches =a2*, *3+ and *50; because when the first match is found (a2), the next match is searched in the string b*3+7*50;.Example: using the
?= construct, in the string x=a2*b*3+7*50;, the pattern [*/+-=]([a-z0-9]+)(?=[*/+-;]) matches =a2, *b, *3, +7 and *50 because when the first match is found (a2), the next match is searched in the string *b*3+7*50;.
?! matches the pattern before the ?! only if after that pattern there is NOT a match for the (?!pattern).Example: in the string
BMW 325, the pattern BMW (?!Z[0-9]) matches BMW .As for the positive lookahead, the lookahead pattern is not included in the match and after a match is found the pattern is searched again beginning from the byte after that match.
Example: in the string
Error contacting <aa@xx>. The mail address <aa@xx.com> is wrong., the pattern [a-z0-9_.-]+@[a-z0-9_.-]+(?!.*(expired|invalid|deleted|unknown)) matches both aa@xx and aa@xx.com because the first match is searched in the entire string and the second match is searched in . The mail address <aa@xx.com> is wrong..
?<= matches the pattern after the ?<= construct if before that pattern there is a match for the (?<=pattern).Example: in the string
BMW 325, the pattern (?<=BMW )[0-9]{3} matches 325.This construct is not supported by VB Regular Expressions 5.5.
?<! matches the pattern after the ?<! construct if before that pattern there is NOT a match for the (?<!pattern).Example: in the string
Fiat 500, the pattern (?<!BMW )[0-9]{3} matches 500.This construct is not supported by VB Regular Expressions 5.5.
Testing regular expressions
There are many regex testing tools online, but I found most of them don't support the most advanced constructs I described above. This one instead supports all of the above constructs.
The following Visual Basic 6 code shows the VBScript Regular Expressions behaviour. To make it work without changes it's required to add Microsoft Rich Textbox Control to the Controls, Microsoft VBScript Regular Expressions 5.5 to the References, and these controls on a form:
- txtRE, a textbox where to write the regular expression pattern
- txtString, a textbox containing the string that will be searched for txtRE pattern
- rtfResult, a RichTextbox where the program will display the matches and submatches
- a command which calls ApplyRegExp
Private Sub ApplyRegExp()
On Error GoTo Err
Dim RE As RegExp
Dim Matches As MatchCollection
Dim Match As Match
Dim SubMatches As SubMatches
Dim Riga As String
Dim m As Integer, sm As Integer, s As String, i As Integer
Dim TextString As String
Set RE = New RegExp
RE.IgnoreCase = True
RE.Global = True
RE.MultiLine = True
RE.Pattern = txtRE.Text
TextString = Replace(txtString.Text, vbCrLf, vbCr)
Set Matches = RE.Execute(TextString)
rtfResult.Text = "Matches: " & Matches.Count & vbCrLf
For m = 0 To Matches.Count - 1
Set Match = Matches.Item(m)
rtfResult.SelStart = Len(rtfResult.Text)
rtfResult.SelLength = 0
rtfResult.SelText = "Match " & m & ": " & CStr(Match.Value) & vbCrLf
rtfResult.SelStart = Len(rtfResult.Text) - Len(CStr(Match.Value)) - 2
rtfResult.SelLength = Len(CStr(Match.Value))
rtfResult.SelColor = vbRed
Set SubMatches = Match.SubMatches
rtfResult.SelStart = Len(rtfResult.Text)
rtfResult.SelLength = 0
rtfResult.SelText = "----> Submatches: " & SubMatches.Count & vbCrLf
For sm = 0 To SubMatches.Count - 1
rtfResult.SelStart = Len(rtfResult.Text)
rtfResult.SelLength = 0
rtfResult.SelText = "--------> Submatch " & sm & ": " & CStr(SubMatches.Item(sm)) & vbCrLf
rtfResult.SelStart = Len(rtfResult.Text) - Len(CStr(SubMatches.Item(sm))) - 2
rtfResult.SelLength = Len(CStr(SubMatches.Item(sm)))
rtfResult.SelColor = vbBlue
Next sm
Set SubMatches = Nothing
Set Match = Nothing
Next m
Set Matches = Nothing
rtfResult.SelStart = 0
rtfResult.SelLength = 0
Set RE = Nothing
Exit Sub
Err:
MsgBox "Error: " & Err.Description
Err.Clear
End Sub
Posted by: Z24 | Sat, Dec 22 2007 |
Category: /programming |
Permanent link |
home
Tagged as: perl, programming, regex, visual basic
http://www.mycomputingart.com/
To contact the webmaster and author write to: info<at>mycomputingart<dot>com
© mycomputingart.com, year(today()).


