Friday, September 29, 2006

Regular Expression Tools

Since Kelly has been developing some regular expressions for modifying VB6 code in preparation for migration to VB.net (because he needs some things to end up differently in VB.net than the built in upgrade wizard would cause them to be), I had regex on the mind and decided to listen to this podcast about them.

If you've never used them before, regular expressions are basically just strings with special characters and placeholders for manipulating (not just finding) text. This is one I've used a few times to search through code. I need regex because I want to find all ocurrences where any index of SameArray is modified but the identifier within the brackets varies.

SomeArray\[\w+\]\s+:=
  • The text between the array index brackets can be anything. 'w' indicates anything that can be part of a language identifier.
  • The brackets themselves are escaped because they normally delimit character sets.
  • The '\s' indicates a space. There might be more than one space between the right bracket and the assignment.
  • The '+' indicates one or more of the preceeding token.

This is a simple expression that makes for a much better search. I didn't have it until now, but Regulazy is a graphical tool that is perfect for creating such expressions quickly. Regulazy creates .Net regular expressions, and something new I learned on the podcast is that there is no "standard" implementation for regex. The tool I use to search through files with the above expression is a simple utility done in Java, and this expression works for Java too. The basic tokens are the same (kind of a standard, I guess), but each implementation varies; for example, .Net allows you to name groups and refer to them in other areas of your expression. Perl allows you to use variables in the expression.

The podcast gave great resources for creating the expressions, but not for using them. I haven't come across many good utilities for searching files with regex expressions (although it wouldn't be much work to write one with the .Net System.Text.RegularExpressions namespace). Neither Notepad++ nor the grep search tool that comes with the GExperts suite that we use would run my expression (even though it uses only simple tokens that should be the same across all implementations).

Here's the thing about regex; they can be very powerful. A single expression can be used as a substitute for a class with tons of conditional code. But then, expressions like this are not very maintainable since they end up looking like a cartoon character's long outburst of profanity - that's the trade off. One thing that helps a bit is that you can insert comments in .Net expressions like yay: (?# Look Mom! There's comments in my expression! ).

Despite the complexity, some validation requirements are a no-brainer match (no pun intended) for regex - like validating email addresses. I wrote an email address validation routine/object once - not easy after reading through the actual RFC for email addresses. The .Net expression for this and other very common validations are available on MSDN. There are many more available on regexlib.com (along with an online expression tester). I'm sure the expression that Kelly developed, no simple expression that, would be a good addition to regexlib.com.

Even the regex guru guest of the podcast recommended that they should be used sparingly. If you do want to create one, use his regulazy tool. That will get simple ones created for you very easily. Then, if you need to create a sophisticated expression (replacements, named groups, ect.), use his Regulator tool. This app includes things like intellisense for expression creation. It will also interface with regexlib.com to help you find a pre-existing expression that may suit your needs. Both are free!

Another very useful resource that isn't regex, but is related is logparser.com. This site maintains utilities that allow SQL like querying of commonly used log files (like IIS or SQLServer).

One more note for .Net regex; if you are using the regex object to match repeatedly over alot of text, you can pass a param to the regex class that will cause the runtime to create an assembly (a class hard coded to meet your expression's specs) and run that assembly for speed. Cool, huh?

No comments: