Matching a Complex Pattern
Now that we have showed the basic C# classes that are needed to do a pattern matching,
we will present a source string which has "interesting" information, which we are
keen on finding.
Looking at the Pattern
Pattern:
\s+(\w{2}\s+)+((([\w\s])\.?)+)
012345678901234567890123456789
1 2
The purpose of this pattern is to extract the part of the line that contains the
characters and skip the initial white space, the hexadecimal codes and the dots,
see the source string below. Let's take a closer look at this pattern before discussing
the result.
The first two characters, "\s" tells the regex parser to match a character in the
source string that is a white-space.
Quantifiers
The "+" at index 2 is called a quantifier and means that the character or Capture
item (see below) to its left; white-space in our case, can match at least
1 time. Other quantifiers are the "?", which can match 0 or 1 time and "*", which
can match 0, 1 or more times. Unfortunately, these quantifiers are "greedy"!
This means that they try to match as much of the input string as possible. That
behaviour is not always what we want, and we have to adjust the pattern by adding
a "?" just to the right of the quantifier, to match as few character as possible.
E.g. instead of the pattern "\w+", we could write "\w+?". Instead of "\w*" we write "\w*?", and so
on.
Capture Items
At index 3 and 12, there is an open and close parenthesis respectively. The content
inside these parentheses define a Capture item. Having parentheses here means we
want to capture whatever is matched against the source string OR we need to group
characters to form a unit to be matched (i.e. it is not required to use the captured
result).
Inside the parentheses is another escaped character: "\w", which means that we can match
against any alfanumeric character. The "{2}" at index 6 means that this character
can match exactly two times. Other variants are the: "{2, 5}" - between
2 and 5 times and "{2,}" - at least 2 times.
The escape code at index 9 is again matching white space. Notice the "+" at index
13, which tells us that the content inside the parentheses (index 3 and 12) can
match 1 or more times.
At index 14, a new Capture item starts. It extends to the last index, 29.
A new, but slightly smaller Capture item starts at index 15. Someone might wonder why
we need all these parentheses, but that question will be answered shortly.
Again, at index 16, we have our final Capture item for this pattern. This Capture
item contains square brackets, see index 17 and 22, which allows any of the containing
characters to be matched with one source character. The "\w" defines as mentioned
before, an alfanumeric character and the "\s", any white space character.
At index 25, we have a dot. Using a "." (dot) usually means we want to match ANY
character. However, as we have placed an escape character "\" just before it, it
will be treated as an "ordinary" dot.
Looking at the Source String and Results listing
Source:
AB DD D0 A8 F3 T.h.I.S. .I.S. .A. .T.E.SiT
01234567890123456789012345678901234567890123456789
1 2 3 4
The source contains, as mentioned before, some hexadecimal codes before the string
of characters we would like to view. To make it a bit more challenging, we also
need to "filter" the dots placed in between the characters.
Notice how the Result string on row 2 match the whole source string, which means
that we have managed to write a complete pattern. As we want to extract the user
friendly text, we need to have parentheses in strategic places.
The first part of the pattern, "\s+", will match the spaces in the source string (index
0-2). The pattern within the first set of parentheses (index 3 - 12) will be matched
and the source is captured and displayed on row 6 in the Results Box. As there is
a "+" character in the pattern just to the right of the parentheses, the pattern
will contine matching characters of the Regex code, see row 6-10. As this is a repetition
of the pattern, those captured strings will be put into separate rows. We do not care
about these Captured string in this implementation; they exist only to match the correct
character.
Notice how the string on row 5 contains the same value as the last Captured value
in that Group Collection, on row 10.
The question why there are so many parentheses in the end of the pattern, is answered
by realising that if we remove the ones on index 14 and 29, we will remove row 11
and 12. Well, this is actually ok here and the purpose for including those particular
parentheses were just to give more insight of the Capture item. On row 12, we capture
the whole character-part of the source string. The parentheses on index 15 and 17
are on the other hand necessary as we need to group the "[\w\s]" with the "\." in
order to quantify with the "+" character. Finally by capturing only the "[\w\s]"
part, we filter out the dot and keep the "interesting" part which in this case is
the alfanumeric and space character. These are put in separate captured strings, see row 30-44
and we can easily concatenate them to form the desired output string.
Results:
1 Results:
2 Match: [ AB DD D0 A8 F3 T.h.I.S. .I.S. .A. .T.E.SiT]
3 Group: [ AB DD D0 A8 F3 T.h.I.S. .I.S. .A. .T.E.SiT]
4 Capture: [ AB DD D0 A8 F3 T.h.I.S. .I.S. .A. .T.E.SiT]
at pos: 0
5 Group: [F3 ]
6 Capture: [AB ] at pos: 3
7 Capture: [DD ] at pos: 6
8 Capture: [D0 ] at pos: 9
9 Capture: [A8 ] at pos: 12
10 Capture: [F3 ] at pos: 15
11 Group: [T.h.I.S. .I.S. .A. .T.E.SiT]
12 Capture: [T.h.I.S. .I.S. .A. .T.E.SiT] at pos: 23
13 Group: [T]
14 Capture: [T.] at pos: 23
15 Capture: [h.] at pos: 25
16 Capture: [I.] at pos: 27
17 Capture: [S.] at pos: 29
18 Capture: [ .] at pos: 31
19 Capture: [I.] at pos: 33
20 Capture: [S.] at pos: 35
21 Capture: [ .] at pos: 37
22 Capture: [A.] at pos: 39
23 Capture: [ .] at pos: 41
24 Capture: [T.] at pos: 43
25 Capture: [E.] at pos: 45
26 Capture: [S] at pos: 47
27 Capture: [i] at pos: 48
28 Capture: [T] at pos: 49
29 Group: [T]
30 Capture: [T] at pos: 23
31 Capture: [h] at pos: 25
32 Capture: [I] at pos: 27
33 Capture: [S] at pos: 29
34 Capture: [ ] at pos: 31
35 Capture: [I] at pos: 33
36 Capture: [S] at pos: 35
37 Capture: [ ] at pos: 37
38 Capture: [A] at pos: 39
39 Capture: [ ] at pos: 41
40 Capture: [T] at pos: 43
41 Capture: [E] at pos: 45
42 Capture: [S] at pos: 47
43 Capture: [i] at pos: 48
44 Capture: [T] at pos: 49
|