Regular expressions pattern options
In regular expressions there are some constructs that influence the way a pattern is matched: There are many regex testing tools online, but I found most of them don't support the most advanced constructs I described above. This one instead supports all of the above constructs. The following Visual Basic 6 code shows the VBScript Regular Expressions behaviour. To make it work without changes it's required to add Microsoft Rich Textbox Control to the Controls, Microsoft VBScript Regular Expressions 5.5 to the References, and these controls on a form:
?(?:pattern)(?=pattern)(?!pattern)(?<=pattern)(?<!pattern)
*, +, ?, {n} and {n,m} are greedy: the pattern matches as many characters as possible, until the first non-matching pattern.
Example: in the string example12345, the pattern [a-z]+ matches example.
The non-greedy quantifiers *?, +?, ??, {n}? and {n,m}? match as little characters as possible.
Example: in the string example12345, the pattern [a-z]+? finds 7 different matches: e, x, a, m, p, l, e.
Example: in the string www<aaaa>xxxx<bbb>yyyy<cccc>zzzz, the pattern <[a-z><]+> has one only match: <aaaa>xxxx<bbb>yyyy<cccc>.
Example: in the same string www<aaaa>xxxx<bbb>yyyy<cccc>zzzz, the pattern <[a-z><]+?> has 3 matches: <aaaa>, <bbb> and <cccc>\number (for instance to check if the matching pattern is repeated).
Example: in the string test1 test2 test1, the pattern ([a-z0-9]+)\s([a-z0-9]+)\s\1 matches because \1 evaluates the first captured pattern test1.
When ?: comes immediately after the left parenthesis, the pattern is not captured.
Example: in the string test1 test2 test1, the pattern (?:[a-z0-9]+)\s([a-z0-9]+)\s\1 does not match because the first captured match (\1) is test2.
Since I've got acquainted with this pattern, I tend to use it whenever there is no need to store the pattern in memory (almost always).
Moreover, the Microsoft VBScript Regular Expressions 5.5 library provides a match for every match and a submatch for every captured pattern; so, understanding the (?:pattern) would be very useful for such usage.?= matches the pattern before the ?= construct only if after that pattern there is a match for the (?=pattern).
Example: in the string BMW Z3, the pattern BMW (?=Z[0-9]) matches BMW .
Example: in the string BMW 325, the pattern BMW (?=Z[0-9]) doesn't match.
Moreover, the lookahead ?= just looks ahead, without doing anything to what it looks: usually, when a match is found for a pattern, then the string is searched for the same pattern beginning from the byte after the last match found; when ?= is used, the part of the string that matches (?=pattern) is not included in the match, and still the string is searched beginning from the byte after the last match found, that is, the byte after the match of the pattern before the ?=.
Example: by default, in the string x=a2*b*3+7*50;, the pattern [*/+-=]([a-z0-9]+)[*/+-;] matches =a2*, *3+ and *50; because when the first match is found (a2), the next match is searched in the string b*3+7*50;.
Example: using the ?= construct, in the string x=a2*b*3+7*50;, the pattern [*/+-=]([a-z0-9]+)(?=[*/+-;]) matches =a2, *b, *3, +7 and *50 because when the first match is found (a2), the next match is searched in the string *b*3+7*50;.
?! matches the pattern before the ?! only if after that pattern there is NOT a match for the (?!pattern).
Example: in the string BMW 325, the pattern BMW (?!Z[0-9]) matches BMW .
As for the positive lookahead, the lookahead pattern is not included in the match and after a match is found the pattern is searched again beginning from the byte after that match.
Example: in the string Error contacting <aa@xx>. The mail address <aa@xx.com> is wrong., the pattern [a-z0-9_.-]+@[a-z0-9_.-]+(?!.*(expired|invalid|deleted|unknown)) matches both aa@xx and aa@xx.com because the first match is searched in the entire string and the second match is searched in . The mail address <aa@xx.com> is wrong..
?<= matches the pattern after the ?<= construct if before that pattern there is a match for the (?<=pattern).
Example: in the string BMW 325, the pattern (?<=BMW )[0-9]{3} matches 325.
This construct is not supported by VB Regular Expressions 5.5.
?<! matches the pattern after the ?<! construct if before that pattern there is NOT a match for the (?<!pattern).
Example: in the string Fiat 500, the pattern (?<!BMW )[0-9]{3} matches 500.
This construct is not supported by VB Regular Expressions 5.5.
Testing regular expressions
Private Sub ApplyRegExp()
On Error GoTo Err
Dim RE As RegExp
Dim Matches As MatchCollection
Dim Match As Match
Dim SubMatches As SubMatches
Dim Riga As String
Dim m As Integer, sm As Integer, s As String, i As Integer
Dim TextString As String
Set RE = New RegExp
RE.IgnoreCase = True
RE.Global = True
RE.MultiLine = True
RE.Pattern = txtRE.Text
TextString = Replace(txtString.Text, vbCrLf, vbCr)
Set Matches = RE.Execute(TextString)
rtfResult.Text = "Matches: " & Matches.Count & vbCrLf
For m = 0 To Matches.Count - 1
Set Match = Matches.Item(m)
rtfResult.SelStart = Len(rtfResult.Text)
rtfResult.SelLength = 0
rtfResult.SelText = "Match " & m & ": " & CStr(Match.Value) & vbCrLf
rtfResult.SelStart = Len(rtfResult.Text) - Len(CStr(Match.Value)) - 2
rtfResult.SelLength = Len(CStr(Match.Value))
rtfResult.SelColor = vbRed
Set SubMatches = Match.SubMatches
rtfResult.SelStart = Len(rtfResult.Text)
rtfResult.SelLength = 0
rtfResult.SelText = "----> Submatches: " & SubMatches.Count & vbCrLf
For sm = 0 To SubMatches.Count - 1
rtfResult.SelStart = Len(rtfResult.Text)
rtfResult.SelLength = 0
rtfResult.SelText = "--------> Submatch " & sm & ": " & CStr(SubMatches.Item(sm)) & vbCrLf
rtfResult.SelStart = Len(rtfResult.Text) - Len(CStr(SubMatches.Item(sm))) - 2
rtfResult.SelLength = Len(CStr(SubMatches.Item(sm)))
rtfResult.SelColor = vbBlue
Next sm
Set SubMatches = Nothing
Set Match = Nothing
Next m
Set Matches = Nothing
rtfResult.SelStart = 0
rtfResult.SelLength = 0
Set RE = Nothing
Exit Sub
Err:
MsgBox "Error: " & Err.Description
Err.Clear
End Sub
[/programming]
permanent link
Powered by Blosxom.