sommergyll.software(c#);
"Guide to designing a less basic regular expressions in C#"
 

Matching a Complex Pattern

Now that we have showed the basic C# classes that are needed to do a pattern matching, we will present a source string which has "interesting" information, which we are keen on finding.

Looking at the Pattern

Pattern:
\s+(\w{2}\s+)+((([\w\s])\.?)+)
012345678901234567890123456789
          1         2         

The purpose of this pattern is to extract the part of the line that contains the characters and skip the initial white space, the hexadecimal codes and the dots, see the source string below. Let's take a closer look at this pattern before discussing the result.

The first two characters, "\s" tells the regex parser to match a character in the source string that is a white-space.

Quantifiers

The "+" at index 2 is called a quantifier and means that the character or Capture item (see below) to its left; white-space in our case, can match at least 1 time. Other quantifiers are the "?", which can match 0 or 1 time and "*", which can match 0, 1 or more times. Unfortunately, these quantifiers are "greedy"! This means that they try to match as much of the input string as possible. That behaviour is not always what we want, and we have to adjust the pattern by adding a "?" just to the right of the quantifier, to match as few character as possible. E.g. instead of the pattern "\w+", we could write "\w+?". Instead of "\w*" we write "\w*?", and so on.

 

Capture Items

At index 3 and 12, there is an open and close parenthesis respectively. The content inside these parentheses define a Capture item. Having parentheses here means we want to capture whatever is matched against the source string OR we need to group characters to form a unit to be matched (i.e. it is not required to use the captured result).

Inside the parentheses is another escaped character: "\w", which means that we can match against any alfanumeric character. The "{2}" at index 6 means that this character can match exactly two times. Other variants are the: "{2, 5}" - between 2 and 5 times and "{2,}" - at least 2 times.

The escape code at index 9 is again matching white space. Notice the "+" at index 13, which tells us that the content inside the parentheses (index 3 and 12) can match 1 or more times.

At index 14, a new Capture item starts. It extends to the last index, 29.

A new, but slightly smaller Capture item starts at index 15. Someone might wonder why we need all these parentheses, but that question will be answered shortly.

Again, at index 16, we have our final Capture item for this pattern. This Capture item contains square brackets, see index 17 and 22, which allows any of the containing characters to be matched with one source character. The "\w" defines as mentioned before, an alfanumeric character and the "\s", any white space character.

At index 25, we have a dot. Using a "." (dot) usually means we want to match ANY character. However, as we have placed an escape character "\" just before it, it will be treated as an "ordinary" dot.

Looking at the Source String and Results listing

Source:
   AB DD D0 A8 F3      T.h.I.S. .I.S. .A. .T.E.SiT
01234567890123456789012345678901234567890123456789
          1         2         3         4

The source contains, as mentioned before, some hexadecimal codes before the string of characters we would like to view. To make it a bit more challenging, we also need to "filter" the dots placed in between the characters.

Notice how the Result string on row 2 match the whole source string, which means that we have managed to write a complete pattern. As we want to extract the user friendly text, we need to have parentheses in strategic places.

The first part of the pattern, "\s+", will match the spaces in the source string (index 0-2). The pattern within the first set of parentheses (index 3 - 12) will be matched and the source is captured and displayed on row 6 in the Results Box. As there is a "+" character in the pattern just to the right of the parentheses, the pattern will contine matching characters of the Regex code, see row 6-10. As this is a repetition of the pattern, those captured strings will be put into separate rows. We do not care about these Captured string in this implementation; they exist only to match the correct character.

Notice how the string on row 5 contains the same value as the last Captured value in that Group Collection, on row 10.

The question why there are so many parentheses in the end of the pattern, is answered by realising that if we remove the ones on index 14 and 29, we will remove row 11 and 12. Well, this is actually ok here and the purpose for including those particular parentheses were just to give more insight of the Capture item. On row 12, we capture the whole character-part of the source string. The parentheses on index 15 and 17 are on the other hand necessary as we need to group the "[\w\s]" with the "\." in order to quantify with the "+" character. Finally by capturing only the "[\w\s]" part, we filter out the dot and keep the "interesting" part which in this case is the alfanumeric and space character. These are put in separate captured strings, see row 30-44 and we can easily concatenate them to form the desired output string.

Results:                                
1  Results:                                
2  Match: [   AB DD D0 A8 F3      T.h.I.S. .I.S. .A. .T.E.SiT]
3   Group: [   AB DD D0 A8 F3      T.h.I.S. .I.S. .A. .T.E.SiT]
4    Capture: [   AB DD D0 A8 F3      T.h.I.S. .I.S. .A. .T.E.SiT]
                                                        at pos: 0
5   Group: [F3      ]
6    Capture: [AB ] at pos: 3
7    Capture: [DD ] at pos: 6
8    Capture: [D0 ] at pos: 9
9    Capture: [A8 ] at pos: 12
10   Capture: [F3      ] at pos: 15
11  Group: [T.h.I.S. .I.S. .A. .T.E.SiT]
12   Capture: [T.h.I.S. .I.S. .A. .T.E.SiT] at pos: 23
13  Group: [T]
14   Capture: [T.] at pos: 23
15   Capture: [h.] at pos: 25
16   Capture: [I.] at pos: 27
17   Capture: [S.] at pos: 29
18   Capture: [ .] at pos: 31
19   Capture: [I.] at pos: 33
20   Capture: [S.] at pos: 35
21   Capture: [ .] at pos: 37
22   Capture: [A.] at pos: 39
23   Capture: [ .] at pos: 41
24   Capture: [T.] at pos: 43
25   Capture: [E.] at pos: 45
26   Capture: [S] at pos: 47
27   Capture: [i] at pos: 48
28   Capture: [T] at pos: 49
29  Group: [T]
30   Capture: [T] at pos: 23
31   Capture: [h] at pos: 25
32   Capture: [I] at pos: 27
33   Capture: [S] at pos: 29
34   Capture: [ ] at pos: 31
35   Capture: [I] at pos: 33
36   Capture: [S] at pos: 35
37   Capture: [ ] at pos: 37
38   Capture: [A] at pos: 39
39   Capture: [ ] at pos: 41
40   Capture: [T] at pos: 43
41   Capture: [E] at pos: 45
42   Capture: [S] at pos: 47
43   Capture: [i] at pos: 48
44   Capture: [T] at pos: 49 
 
 

Regex Start | Regex Replace | ComboBox control | Front Page

Disclaimer

© Copyright 2003-2010 Sommergyll Software. All Rights Reserved.

Basic Guide on Regular Expressions