MyComputingArt

Articles about computing. What are you interested in?

.htaccess, apache, bloxsom, broadcast, case, client-server, command-line, configuration, cool'n'quiet, cooling, cpu, disk suspension, dsl, error, fan, fan controller, file management, firewall, freeware, google earth, gps, grub, hardware, heatsink, howto, images, lapping, linux, measurement, motherboard, mp3, mysql, password, pda, perl, programming, qemu, rdp, regex, router, screen recording, script, security, shell, silencing, software, system recover, tools, ubuntu, virtualization, visual basic, VMWare, vnc, web, windows, wireless, xen, xp


About
About
RSS
rss
Skin
Default
Old browsers
Categories
Archives
Search
word word = any word
+word +word = all the words
regexp pattern


Powered by Blosxom




Ads



Regular expressions pattern options

In regular expressions there are some constructs that influence the way a pattern is matched:


?
By default, the quantifiers *, +, ?, {n} and {n,m} are greedy: the pattern matches as many characters as possible, until the first non-matching pattern.
Example: in the string example12345, the pattern [a-z]+ matches example.
The non-greedy quantifiers *?, +?, ??, {n}? and {n,m}? match as little characters as possible.
Example: in the string example12345, the pattern [a-z]+? finds 7 different matches: e, x, a, m, p, l, e.
Example: in the string www<aaaa>xxxx<bbb>yyyy<cccc>zzzz, the pattern <[a-z><]+> has one only match: <aaaa>xxxx<bbb>yyyy<cccc>.
Example: in the same string www<aaaa>xxxx<bbb>yyyy<cccc>zzzz, the pattern <[a-z><]+?> has 3 matches: <aaaa>, <bbb> and <cccc>
(?:pattern)
By default, when a pattern is enclosed between parenthesis, the matching pattern is captured in memory and can be accessed using the backreference \number (for instance to check if the matching pattern is repeated).
Example: in the string test1 test2 test1, the pattern ([a-z0-9]+)\s([a-z0-9]+)\s\1 matches because \1 evaluates the first captured pattern test1.
When ?: comes immediately after the left parenthesis, the pattern is not captured.
Example: in the string test1 test2 test1, the pattern (?:[a-z0-9]+)\s([a-z0-9]+)\s\1 does not match because the first captured match (\1) is test2.
Since I've got acquainted with this pattern, I tend to use it whenever there is no need to store the pattern in memory (almost always).
Moreover, the Microsoft VBScript Regular Expressions 5.5 library provides a match for every match and a submatch for every captured pattern; so, understanding the (?:pattern) would be very useful for such usage.
(?=pattern)
The positive lookahead ?= matches the pattern before the ?= construct only if after that pattern there is a match for the (?=pattern).
Example: in the string BMW Z3, the pattern BMW (?=Z[0-9]) matches BMW .
Example: in the string BMW 325, the pattern BMW (?=Z[0-9]) doesn't match.
Moreover, the lookahead ?= just looks ahead, without doing anything to what it looks: usually, when a match is found for a pattern, then the string is searched for the same pattern beginning from the byte after the last match found; when ?= is used, the part of the string that matches (?=pattern) is not included in the match, and still the string is searched beginning from the byte after the last match found, that is, the byte after the match of the pattern before the ?=.
Example: by default, in the string x=a2*b*3+7*50;, the pattern [*/+-=]([a-z0-9]+)[*/+-;] matches =a2*, *3+ and *50; because when the first match is found (a2), the next match is searched in the string b*3+7*50;.
Example: using the ?= construct, in the string x=a2*b*3+7*50;, the pattern [*/+-=]([a-z0-9]+)(?=[*/+-;]) matches =a2, *b, *3, +7 and *50 because when the first match is found (a2), the next match is searched in the string *b*3+7*50;.
(?!pattern)
The negative lookahead ?! matches the pattern before the ?! only if after that pattern there is NOT a match for the (?!pattern).
Example: in the string BMW 325, the pattern BMW (?!Z[0-9]) matches BMW .
As for the positive lookahead, the lookahead pattern is not included in the match and after a match is found the pattern is searched again beginning from the byte after that match.
Example: in the string Error contacting <aa@xx>. The mail address <aa@xx.com> is wrong., the pattern [a-z0-9_.-]+@[a-z0-9_.-]+(?!.*(expired|invalid|deleted|unknown)) matches both aa@xx and aa@xx.com because the first match is searched in the entire string and the second match is searched in . The mail address <aa@xx.com> is wrong..
(?<=pattern)
The positive lookbehind ?<= matches the pattern after the ?<= construct if before that pattern there is a match for the (?<=pattern).
Example: in the string BMW 325, the pattern (?<=BMW )[0-9]{3} matches 325.
This construct is not supported by VB Regular Expressions 5.5.
(?<!pattern)
The negative lookbehind ?<! matches the pattern after the ?<! construct if before that pattern there is NOT a match for the (?<!pattern).
Example: in the string Fiat 500, the pattern (?<!BMW )[0-9]{3} matches 500.
This construct is not supported by VB Regular Expressions 5.5.

Testing regular expressions

There are many regex testing tools online, but I found most of them don't support the most advanced constructs I described above. This one instead supports all of the above constructs.

The following Visual Basic 6 code shows the VBScript Regular Expressions behaviour. To make it work without changes it's required to add Microsoft Rich Textbox Control to the Controls, Microsoft VBScript Regular Expressions 5.5 to the References, and these controls on a form:

Private Sub ApplyRegExp()
On Error GoTo Err
    
    Dim RE As RegExp
    Dim Matches As MatchCollection
    Dim Match As Match
    Dim SubMatches As SubMatches
    Dim Riga As String
    Dim m As Integer, sm As Integer, s As String, i As Integer
    Dim TextString As String
    
    Set RE = New RegExp
    RE.IgnoreCase = True
    RE.Global = True
    RE.MultiLine = True
    RE.Pattern = txtRE.Text
    
	TextString = Replace(txtString.Text, vbCrLf, vbCr)
	Set Matches = RE.Execute(TextString)
	rtfResult.Text = "Matches: " & Matches.Count & vbCrLf
	For m = 0 To Matches.Count - 1
		Set Match = Matches.Item(m)
		rtfResult.SelStart = Len(rtfResult.Text)
		rtfResult.SelLength = 0
		rtfResult.SelText = "Match " & m & ": " & CStr(Match.Value) & vbCrLf
		rtfResult.SelStart = Len(rtfResult.Text) - Len(CStr(Match.Value)) - 2
		rtfResult.SelLength = Len(CStr(Match.Value))
		rtfResult.SelColor = vbRed
		Set SubMatches = Match.SubMatches
		rtfResult.SelStart = Len(rtfResult.Text)
		rtfResult.SelLength = 0
		rtfResult.SelText = "----> Submatches: " & SubMatches.Count & vbCrLf
		For sm = 0 To SubMatches.Count - 1
			rtfResult.SelStart = Len(rtfResult.Text)
			rtfResult.SelLength = 0
			rtfResult.SelText = "--------> Submatch " & sm & ": " & CStr(SubMatches.Item(sm)) & vbCrLf
			rtfResult.SelStart = Len(rtfResult.Text) - Len(CStr(SubMatches.Item(sm))) - 2
			rtfResult.SelLength = Len(CStr(SubMatches.Item(sm)))
			rtfResult.SelColor = vbBlue
		Next sm
		Set SubMatches = Nothing
		Set Match = Nothing
	Next m
	Set Matches = Nothing
    rtfResult.SelStart = 0
    rtfResult.SelLength = 0
    
    Set RE = Nothing

    Exit Sub
    
Err:
    MsgBox "Error: " & Err.Description
    Err.Clear
    
End Sub


Posted by: Z24 | Sat, Dec 22 2007 | Category: /programming | Permanent link | home
Tagged as: , , ,



Valid HTML 4.01 Strict    Valid CSS!

http://www.mycomputingart.com/

To contact the webmaster and author write to: info<at>mycomputingart<dot>com
© mycomputingart.com, year(today()).