MyComputingArt

Articles about computing. What are you interested in?

.htaccess, android, apache, bloxsom, bluetooth, broadcast, case, chat, client-server, command-line, configuration, cool'n'quiet, cooling, cpu, disk suspension, dsl, error, fan, fan controller, file management, firefox, firewall, freeware, google, google earth, gpg, gps, grub, hardware, heatsink, howto, images, internet, jabber, lapping, linux, measurement, messaging, motherboard, mp3, mysql, network, password, pda, perl, phone, programming, programming , qemu, rdp, regex, router, screen recording, script, security, shell, silencing, software, spreadsheet, spyware, system recover, tools, ubuntu, virtualization, visual basic, VMWare, vnc, vpn, web, windows, wireless, xen, xmpp, xp



Regular expressions pattern options

In regular expressions there are some constructs that influence the way a pattern is matched:


By default, the quantifiers *, +, ?, {n} and {n,m} are greedy: the pattern matches as many characters as possible, until the first non-matching pattern.
Example: in the string example12345, the pattern [a-z]+ matches example.
The non-greedy quantifiers *?, +?, ??, {n}? and {n,m}? match as little characters as possible.
Example: in the string example12345, the pattern [a-z]+? finds 7 different matches: e, x, a, m, p, l, e.
Example: in the string www<aaaa>xxxx<bbb>yyyy<cccc>zzzz, the pattern <[a-z><]+> has one only match: <aaaa>xxxx<bbb>yyyy<cccc>.
Example: in the same string www<aaaa>xxxx<bbb>yyyy<cccc>zzzz, the pattern <[a-z><]+?> has 3 matches: <aaaa>, <bbb> and <cccc>
By default, when a pattern is enclosed between parenthesis, the matching pattern is captured in memory and can be accessed using the backreference \number (for instance to check if the matching pattern is repeated).
Example: in the string test1 test2 test1, the pattern ([a-z0-9]+)\s([a-z0-9]+)\s\1 matches because \1 evaluates the first captured pattern test1.
When ?: comes immediately after the left parenthesis, the pattern is not captured.
Example: in the string test1 test2 test1, the pattern (?:[a-z0-9]+)\s([a-z0-9]+)\s\1 does not match because the first captured match (\1) is test2.
Since I've got acquainted with this pattern, I tend to use it whenever there is no need to store the pattern in memory (almost always).
Moreover, the Microsoft VBScript Regular Expressions 5.5 library provides a match for every match and a submatch for every captured pattern; so, understanding the (?:pattern) would be very useful for such usage.
The positive lookahead ?= matches the pattern before the ?= construct only if after that pattern there is a match for the (?=pattern).
Example: in the string BMW Z3, the pattern BMW (?=Z[0-9]) matches BMW .
Example: in the string BMW 325, the pattern BMW (?=Z[0-9]) doesn't match.
Moreover, the lookahead ?= just looks ahead, without doing anything to what it looks: usually, when a match is found for a pattern, then the string is searched for the same pattern beginning from the byte after the last match found; when ?= is used, the part of the string that matches (?=pattern) is not included in the match, and still the string is searched beginning from the byte after the last match found, that is, the byte after the match of the pattern before the ?=.
Example: by default, in the string x=a2*b*3+7*50;, the pattern [*/+-=]([a-z0-9]+)[*/+-;] matches =a2*, *3+ and *50; because when the first match is found (a2), the next match is searched in the string b*3+7*50;.
Example: using the ?= construct, in the string x=a2*b*3+7*50;, the pattern [*/+-=]([a-z0-9]+)(?=[*/+-;]) matches =a2, *b, *3, +7 and *50 because when the first match is found (a2), the next match is searched in the string *b*3+7*50;.
The negative lookahead ?! matches the pattern before the ?! only if after that pattern there is NOT a match for the (?!pattern).
Example: in the string BMW 325, the pattern BMW (?!Z[0-9]) matches BMW .
As for the positive lookahead, the lookahead pattern is not included in the match and after a match is found the pattern is searched again beginning from the byte after that match.
Example: in the string Error contacting <aa@xx>. The mail address <aa@xx.com> is wrong., the pattern [a-z0-9_.-]+@[a-z0-9_.-]+(?!.*(expired|invalid|deleted|unknown)) matches both aa@xx and aa@xx.com because the first match is searched in the entire string and the second match is searched in . The mail address <aa@xx.com> is wrong..
The positive lookbehind ?<= matches the pattern after the ?<= construct if before that pattern there is a match for the (?<=pattern).
Example: in the string BMW 325, the pattern (?<=BMW )[0-9]{3} matches 325.
This construct is not supported by VB Regular Expressions 5.5.
The negative lookbehind ?<! matches the pattern after the ?<! construct if before that pattern there is NOT a match for the (?<!pattern).
Example: in the string Fiat 500, the pattern (?<!BMW )[0-9]{3} matches 500.
This construct is not supported by VB Regular Expressions 5.5.

Testing regular expressions

There are many regex testing tools online, but I found most of them don't support the most advanced constructs I described above. This one instead supports all of the above constructs.

The following Visual Basic 6 code shows the VBScript Regular Expressions behaviour. To make it work without changes it's required to add Microsoft Rich Textbox Control to the Controls, Microsoft VBScript Regular Expressions 5.5 to the References, and these controls on a form:

Private Sub ApplyRegExp()
On Error GoTo Err
    
    Dim RE As RegExp
    Dim Matches As MatchCollection
    Dim Match As Match
    Dim SubMatches As SubMatches
    Dim Riga As String
    Dim m As Integer, sm As Integer, s As String, i As Integer
    Dim TextString As String
    
    Set RE = New RegExp
    RE.IgnoreCase = True
    RE.Global = True
    RE.MultiLine = True
    RE.Pattern = txtRE.Text
    
	TextString = Replace(txtString.Text, vbCrLf, vbCr)
	Set Matches = RE.Execute(TextString)
	rtfResult.Text = "Matches: " & Matches.Count & vbCrLf
	For m = 0 To Matches.Count - 1
		Set Match = Matches.Item(m)
		rtfResult.SelStart = Len(rtfResult.Text)
		rtfResult.SelLength = 0
		rtfResult.SelText = "Match " & m & ": " & CStr(Match.Value) & vbCrLf
		rtfResult.SelStart = Len(rtfResult.Text) - Len(CStr(Match.Value)) - 2
		rtfResult.SelLength = Len(CStr(Match.Value))
		rtfResult.SelColor = vbRed
		Set SubMatches = Match.SubMatches
		rtfResult.SelStart = Len(rtfResult.Text)
		rtfResult.SelLength = 0
		rtfResult.SelText = "----> Submatches: " & SubMatches.Count & vbCrLf
		For sm = 0 To SubMatches.Count - 1
			rtfResult.SelStart = Len(rtfResult.Text)
			rtfResult.SelLength = 0
			rtfResult.SelText = "--------> Submatch " & sm & ": " & CStr(SubMatches.Item(sm)) & vbCrLf
			rtfResult.SelStart = Len(rtfResult.Text) - Len(CStr(SubMatches.Item(sm))) - 2
			rtfResult.SelLength = Len(CStr(SubMatches.Item(sm)))
			rtfResult.SelColor = vbBlue
		Next sm
		Set SubMatches = Nothing
		Set Match = Nothing
	Next m
	Set Matches = Nothing
    rtfResult.SelStart = 0
    rtfResult.SelLength = 0
    
    Set RE = Nothing

    Exit Sub
    
Err:
    MsgBox "Error: " & Err.Description
    Err.Clear
    
End Sub


   PDF

Posted by: Z24 | Sat, Dec 22 2007 | Category: /programming | Permanent link | home
Tagged as: , , ,


About
About
RSS
rss
Donate
Did I save you time or trouble?

Thanks ;-)
Skin
Categories
Archives
Search
Search MyComputingArt

word word = any word
+word +word = all the words
regexp pattern


Search hardware reviews

Visitors

since August 2006

free counters
since September 2009


Powered by Blosxom
FlagCounter Locations of mycomputingart.com visitors Map

Valid HTML 4.01 Strict    Valid CSS!

http://www.mycomputingart.com/

To contact the webmaster and author write to: info<at>mycomputingart<dot>com
© mycomputingart.com, year(today()).