Matching lines in multiline regular expressions
I needed to parse some text that looked like
code:
The regex previously used was
with singleline and multiline flags enabled (using the Jakarta Oro PCRE compatible library).
However, it was found that some customers think their middle name is ' ----------' (even though there isn't a form anywhere on the website where one can select that), which breaks the code splitting the text above. However, a simple solution exists: to match only ten dashes or equal signs on a single line, while still capturing the groups, you can use
I think moving a single character four places is quite an elegant bugfix
code:
1
2
3
4
5
6
7
| fieldname: value anotherfieldname: anothervalue etc: somemore ---------- repeatedfieldname: newvalue mayalsobeanotherfieldname: withanothervalue ========== |
The regex previously used was
^(.*?)[=-]{10}$
with singleline and multiline flags enabled (using the Jakarta Oro PCRE compatible library).
However, it was found that some customers think their middle name is ' ----------' (even though there isn't a form anywhere on the website where one can select that), which breaks the code splitting the text above. However, a simple solution exists: to match only ten dashes or equal signs on a single line, while still capturing the groups, you can use
(.*?^)[=-]{10}$
I think moving a single character four places is quite an elegant bugfix
|
|
Converting a certificate + key to a usable Java keystore |
|
|
Using exceptions in Java |
Comments
If a user can enter "----------" in one of the fields, what prevents them from entering exactly 10 dashes? This only seems to reduce the chances to encounter the bug.
I modified the example code, as I now realize it was unclear: the data the user enters does not constitute an entire line; it is appended to a 'fieldname: ' string.
BTW, the previous regex already only matched the case where a user entered exactly ten dashes, because one of the lines would then end in ten dashes, which constituted a match. The new regex only matches lines that have only ten dashes on them. No user can cause that to happen.
BTW, the previous regex already only matched the case where a user entered exactly ten dashes, because one of the lines would then end in ten dashes, which constituted a match. The new regex only matches lines that have only ten dashes on them. No user can cause that to happen.
Why is the beginning-of-line '^' character within the round braces? Not that it really matters, but I'm curious if it is needed or if 'it just happens to work so why not'?
In other words, does it matter if you replace
(.*?^)[=-]{10}$
... by
(.*?)^[=-]{10}$
?
In other words, does it matter if you replace
(.*?^)[=-]{10}$
... by
(.*?)^[=-]{10}$
?
The second option was actually my first solution, but then my colleagure remarked that in the original regex, an end-of-line character before the [=-]{10} would be included in the captured group. In the second form, that wouldn't be the case. There weren't any situations in which that would matter, as far as we could see, but it's usually best is to change as little as possible to an otherwise working system
.
After I posted this, we actually modified it to
Again, we didn't think it would work out differently; we aren't even sure there are cases in which they are different.
After I posted this, we actually modified it to
^(.*?^)[=-]{10}$
Again, we didn't think it would work out differently; we aren't even sure there are cases in which they are different.