Matching lines in multiline regular expressions

By Confusion on Monday 20 October 2008 10:41 - Comments (4)
Category: Software engineering, Views: 4.813

I needed to parse some text that looked like

code:
1
2
3
4
5
6
7
fieldname: value
anotherfieldname: anothervalue
etc: somemore
----------
repeatedfieldname: newvalue
mayalsobeanotherfieldname: withanothervalue
==========


The regex previously used was
^(.*?)[=-]{10}$

with singleline and multiline flags enabled (using the Jakarta Oro PCRE compatible library).

However, it was found that some customers think their middle name is ' ----------' (even though there isn't a form anywhere on the website where one can select that), which breaks the code splitting the text above. However, a simple solution exists: to match only ten dashes or equal signs on a single line, while still capturing the groups, you can use
(.*?^)[=-]{10}$

I think moving a single character four places is quite an elegant bugfix :P

Volgende: Converting a certificate + key to a usable Java keystore 10-'08 Converting a certificate + key to a usable Java keystore
Volgende: Using exceptions in Java 10-'08 Using exceptions in Java

Comments


By Tweakers user Floort, Monday 20 October 2008 11:51

If a user can enter "----------" in one of the fields, what prevents them from entering exactly 10 dashes? This only seems to reduce the chances to encounter the bug.

By Tweakers user Confusion, Monday 20 October 2008 11:58

I modified the example code, as I now realize it was unclear: the data the user enters does not constitute an entire line; it is appended to a 'fieldname: ' string.

BTW, the previous regex already only matched the case where a user entered exactly ten dashes, because one of the lines would then end in ten dashes, which constituted a match. The new regex only matches lines that have only ten dashes on them. No user can cause that to happen.

By Tweakers user vanaalten, Monday 20 October 2008 21:45

Why is the beginning-of-line '^' character within the round braces? Not that it really matters, but I'm curious if it is needed or if 'it just happens to work so why not'?

In other words, does it matter if you replace
(.*?^)[=-]{10}$
... by
(.*?)^[=-]{10}$

?

By Tweakers user Confusion, Monday 20 October 2008 22:46

The second option was actually my first solution, but then my colleagure remarked that in the original regex, an end-of-line character before the [=-]{10} would be included in the captured group. In the second form, that wouldn't be the case. There weren't any situations in which that would matter, as far as we could see, but it's usually best is to change as little as possible to an otherwise working system :P.

After I posted this, we actually modified it to
^(.*?^)[=-]{10}$

Again, we didn't think it would work out differently; we aren't even sure there are cases in which they are different.

Comments are closed