Regular Expressions and replaceAll()

Regular expressions are not a new technique, and continue to have applications to solve problems with a minimal amount of code. As my primary language for software development is a pad of paper Java, the modest repalceAll() method of the stock String class will be the playground for these examples. The repalceAll() method is common enough that the usage is ubiquitous, you'll probably find it in any Java project of enough size somewhere, yet the majority of the applications would be to find a character or sequence of characters and swap them. Yet the replaceAll() method is built upon regular expressions, and has untapped power lying in it's simple interface. Ponder the replaceAll() definition:

public String replaceAll(String regex, String replacement)'s simplicity. Some things are not right when there is nothing left to add, but rather when there is nothing left to take away.

The functionality that's hidden in plane sight is the fact that the second parameter, replacement, is also a regular expression. The reason this is amazing is that you can match references from the first parameter, regex!

But why would you want to extract any bit of text from the match, and drop back in as the replacement? Consider an example of processing XML, maybe you need to strip out name space information. Sure you could create a bunch of objects with jDom, then iterate through the tree and change the element names, but why do that when two calls to regex could easily modify the entire document?

Take this XML document snippet:

<foo:bar someAttr="value">
   Bunch of text
<foo:fubar anotherAttr="value">
   Another string

Which you want to transform to:

<bar someAttr="value">
   Bunch of text
<fubar anotherAttr="value">
   Another string

I'm not going to comment on the bad idea of removing name space information, but hey if you have to do it, you have to, no judgment here.

So how can two replaceAll method do the job? Like this:

String xml = "<foo:bar someAttr=\"value\">Bunch of text</foo:bar>" +
    <foo:fubar anotherAttr=\"value\">Another string<'/foo:fubar>";
xml = xml replaceAll("(<)(\\w+:)(.*?>)", "$1$3")
    .replaceAll("(</)(\\w+:)(.*?>)", "$1$3");

Which is a lot of \\, +, ? and *! So here's the breakdown:

First replaceAll, regex parameter:

(<)Match a < character, and capture it as group 1
(\\w+:)Match one or more word character [a-zA-Z_0-9] followed by a :, and capture it as group 2.
(.*?>)Match anything, up to the first > character; the shortest possible match, and capture it as group 3.

Then the replacement parameter:

$1Replace with the first matched group
$3Replace with the third matched group

And bingo, the start of each element is stripped of name space information. The group 2 is removed.

The second replaceAll follows the same pattern, matching the </ as the first group.