Refining your Regex Ninja Skills

For a project I’m working on I wanted to create a list of the current features apps in the Android Marketplace. I’m documenting them in a Markdown document so I put some regular expressions (regex) to work and thought I’d share.

To retrieve the list I just copy and pasted the the HTML source from http://www.android.com/market. It looked something like this:

<div id="com.fourtechnologies.mynetdiary.ad" class="active"><span><img src="data/icons/com.fourtechnologies.mynetdiary.ad.png"></span><a>Calorie Counter PRO MyNetDiary</a></div><div id="basesign.alltie"><span><img src="data/icons/basesign.alltie.png"></span><a>Easy Tie</a></div><div id="com.oanda.fxtrade"><span><img src="data/icons/com.oanda.fxtrade.png"></span><a>OANDA fxTrade for Android</a></div>
etc...

What I wanted in Markdown was this:

##Calorie Counter PRO MyNetDiary
![Calorie Counter PRO MyNetDiary](http://www.android.com/market/data/icons/com.fourtechnologies.mynetdiary.ad.png)
com.fourtechnologies.mynetdiary.ad 

##Easy Tie
![Easy Tie](http://www.android.com/market/data/icons/basesign.alltie.png)
basesign.alltie 

##OANDA fxTrade for Android
![OANDA fxTrade for Android](http://www.android.com/market/data/icons/com.oanda.fxtrade.png)
com.oanda.fxtrade 

etc...

I could have spent fifteen minutes and just manually edited each line but that’s no fun. Instead, I fired up BBEdit and used a grep search with a regular expression to do it for me. I simply searched for this:

<div id="([^"]+)"[^>]*><span><img src="([^"]+)"></span><a>([^<]+)</a></div>

and replaced it with this:

##\3\n![\3](http://www.android.com/market/\2)\n\1\n\n

What this does is pick out the bits of each pattern I want—app name is \3, image url is \2 and the app id is \1—and replaces it with my Markdown format. Perfect, and only about thirty seconds worth of work!

Using regular expressions to search through HTML/XML is relatively easy. For simple cases, you can get the content between two quotes using "([^"]+)" which means “search for a quote, then capture anything not a quote until you find another quote”. You can do the same for the content in a single HTML tag using >([^<]+)<.

Regular expressions can do a lot more and can get a lot more complicated. Consider this url matching regular expression by John Gruber.

To keep my skills fresh I try to use regular expressions whenever I can. If you don’t, I suggest checking out a few resources to see what you’re missing:

Regular expression may look daunting but once you understand them they can be a lifesaver, saving you hours of manual parsing.