First Look

Genetic programming meets regular expressions

Creating regular expressions for complex matching tasks can be incredibly difficult; genetic programming makes it easy

evolving code
Credit: Mark Gibbs

 If you’ve ever done any serious programming you’ll have run into something called regular expressions:

 ... (abbreviated regex or regexp and sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. … Regular expressions are so useful in computing that the various systems to specify regular expressions have evolved to provide both a basic and extended standard for the grammar and syntax; modern regular expressions heavily augment the standard.

Here’s an example in Perl:

$string1 = "Hello World\n";
if ($string1 =~ m/...../) {
     print "$string1 has length >= 5\n";

}

In the above example, $string1 is the string we’re searching and “m/…../“ is the regex we’re searching for. The regex matches any character except a newline. Each “.” stands for any character and as there are five of them, the characters in the input string are matched in sequence. Now, as there are more than five characters in the input sting than can be matched by the regex (which only tries to match five) the result is true so the print statement is executed.

In that example you get a sense of how powerful regexs are but there’s a downside: The specification of a regex for anything beyond simple searches gets extremely complicated. For example, the regex: 

\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b

… will match any IP address just fine, but will also match 999.999.999.999 as if it were a valid IP address. To restrict all 4 numbers in the IP address to 0..255, you can use the following regex. …

\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b 

But even that regex has problems (it rejects IP addresses with following port numbers (such as ":21") because it expects the IP address to be surrounded by white space.  If you wanted to allow for that as well as following punctuation such as "?" and "." but not "-" or "&"then you're going to have to work a little harder. Oh, and what about preceding punctuation such as ":"?

perl problems xkcd

A better way would be to gather examples and then have a process that searches for solution ... which leads us to genetic programming.

 Genetic programming is a fascinating technique that the Web site genetic-programming.org explains creates:

... a working computer program from a high-level problem statement of a problem. Genetic programming starts from a high-level statement of “what needs to be done” and automatically creates a computer program to solve the problem.

There are now 36 instances where genetic programming has automatically produced a result that is competitive with human performance, including  15 instances where genetic programming has created an entity that either infringes or duplicates the functionality of a previously patented 20th-century invention, 6 instances where genetic programming has done the same with respect to a 21st-centry invention, and 2 instances where genetic programming has created a patentable new invention.

Not surprisingly, genetic programming has been used to "evolve" complex regular expressions from data and the Web site Regex Generator++ demonstrates the technique. This service, created by Prof. Alberto Bartoli, Giorgio Davanzo, Andrea De Lorenzo, Prof. Eric Medvet, and Enrico Sorio, researchers at the University of Trieste, “evolves” regexes that have the best fitness in terms of effectively matching strings based on examples. They explain the methodology used in their paper Automatic Synthesis of Regular Expressions from Examples

To demonstrate their technique the team has created a service you can experiment with: To begin the process you provide training examples that contains the target strings which you identify by highlighting (make sure you use plain text otherwise it will barf). 

screen shot 2015 08 02 at 2.29.45 pm

You need at least two training examples with at least 25 matches identified in each (my test training sets for IP addresses are below). You then click “Evolve!” and about 15 to 20 minutes later, a JavaScript-compatible regex is displayed (regexes come in different formats for different languages and platforms).

screen shot 2015 08 02 at 2.27.00 pm

Here’s the regex I got from Regex Generator ++:

[11]\d*+\.\d\d++\.\d++\.[12]\d*+|\d++\.[11]\.\d[\d\.]++|(?:\d++\.\d++\.\d++\.\d++(?= yyDDyu iiuio uioui u8 987 86 578yyyyy678 \d\.\d\.\d\.\d:34 kl;kl; 97 [^ :][^<]))++

Those of you with good regex-fu will immediately notice that it isn’t optimized, it’s really hard to understand, and, when tested at Regular Expressions 101, it doesn’t find all the IP addresses in the first training example.  

screen shot 2015 08 02 at 2.56.35 pm

The evolved regex found only 16 of the 25 matches so more evolving is needed or, quite possibly, better examples (I kludged up my examples and for serious applications you’d be advised to consider your examples very carefully as flawed data will cause the resulting regexes to be less fit in real world contexts). I tried the example data with the first IP matching regex above; it scored 46 matches but 16 weren’t valid (such as 8.8.888.9 and 899.0.0.1). Using the second regex, 28 matches were made so it still didn't capture all of the matches,

What this technique demonstrates (although rather better through the examples on the Regex Generator++ site than my example) is that complex algorithms that humans would find very hard to create can be quickly and easily generated and then evolved to perform better.

If you want to dig deeper, check out the team’s publications:

  1. Bartoli, Davanzo, De Lorenzo, Mauri, Medvet, Sorio, Automatic Generation of Regular Expressions from Examples with Genetic ProgrammingACM Genetic and Evolutionary Computation Conference (GECCO), 2012, Philadelphia (US)
  2. De Lorenzo, Medvet, Bartoli, Automatic String Replace by ExamplesACM Genetic and Evolutionary Computation Conference (GECCO), 2013, Amsterdam (Netherlands)—the string replace functionality described in this paper is based on an extension of the work showcased on this web app; it is currently not exposed on the web.
  3. Bartoli, Davanzo, De Lorenzo, Medvet, Sorio, Automatic Synthesis of Regular Expressions from ExamplesIEEE Computer, 2014

Note: The Regex Generator ++ service can be very slow to respond at times; it’s running on hardware that appears to often be seriously overloaded. Also note that all input MUST be plain text; the service doesn’t check for illegal content and will throw unexplained errors. 

Training Example 1:

germany-italy 1-2 172.30.40.254 is a server ip It’s nine o'clock on a Saturday  The SSN is 632-78-9871 Call 555-1212 now! 999.888.999 try 1.1.1.999 or 678.1.0.1 ping  12.23.0.254  this is a valid ip 127.0.0.1 \12.3 is just a number How about 1.1.0.1? ping from 192.168.0.1 today is 7/11/2012 msg to 66.231.55.67 sent telnet 17.23.133.22:8080 this is old plain text 127.0.0.1 \12.3 is just a number How about 1.1.1.1? ping from 192.168.0.1 today is 7/11/2012 msg to \66.231.55.67 sent telnet 17.23.133.22:8080 this is old plain text germany-italy 1-2 172.30.40.254 is a server ip It’s nine o'clock on a Saturday  The SSN is 632-78-9871 Call 555-1212 now! 999.888.999 try 1.1.1.999 or 678.1.0.1 ping  12.23.0.254 and 78.78.78.256 time is 10:34:36 \ 100.0.00.101 or otherwise 1000.0.00.00 sometimes  55.0.0.55 combined with the likes of;99.99.99.999 plus 09.09.09.009 and  \1.00.1.0 Kihyi iouio uio Miopiop iop 209.209.209.209.3 10.10.10.11 ioKKiopiopio iopEEipiop opiopiip 10.1010.10- 10.2.3.4 .99 ioi ipiop ip 4324566.98098 90890 8888 uo iojuio iuo uu iouo 89.98.89.98:100 Eiouoiuoip hiy uhi uy 7897 8yi hyiuui  8.88.88.9:8.88.888.9 ouo uu uu ohiuoy8uy899 oioioioi uiouiouoiu 3.3.003.3 hDjhuk-4.5.6.7-4.8 0.0.0.255 0.0.0.256 090-990-90-9 90.90.90.90.90 ok;lk;k 90.11.11.00 khkllj 899.0.0.01 11.34.45.67 kjllkjkjl 11.34.45.67kjkl 209.209.209.209 yyDDyu iiuio uioui u8 987 86 578678 1.0.0.0:34 kl;kl; 97 77 u89 unnu 98y7 bn uo 66.66.66.66 .3.4.99 ioi ipiop ip 4324566.98098 90 7.8.9.0 -9.0.0.0 iouuiouououio mkjnl 1.23.45.67 mkljnl lkjl IIIUILKJH +89.98.89.98:100 8.1.0.0 JHKLJOOooo uiop 97 67.8768-.897897

Training Example 2:

1.2.3.4 thidqfqee s islid ip 17.22.0.1 \12.3 is just a number Hoeeeew aborrrut 12.1.1.1? ping from 192.168.02.1 today is 7/11/2012 msg to 6.231.55.67 sent telnet 17.23.133.22:8080 this is old plain text this is a valid ip 127.01.0.1 \12.3 is just a number How about 1.11.11.1? ping from 192.168.20.1 dasds is 7/11/2012 msg to 66.231.5.67 sent telnet 17.23.133.22:8080 this is old plain text 127.2.0.1 \12.223 is just a number How about 1.1.1.1? ping from 192.168.0.1 today is 7/11/2012 msg to \66.231.55.67 sent telnet 17.213.133.212:8080 this is old plain text germany-italy 1-2 172.30.40.254 is a serwwwver ip It’s nine o'clock on a Saturday  The yysard is 63-78-9871 Call 555-1212 now! 999.888.999 try 1.1.1.99 or 78.1.0.1 ping  12.23.0.254 and 78.78.78.26 time is 10:34:36 \100.0.00.101 or otherwise 1000.0.00.00 sometimes  55.ayrae 0.0.55.1 combined with the likes of;99.99.99.99 plus 09.09.09.09 and  \1.00.1.0 Kihyi iouio uio Miopiop iop 209.209.209.209.3 10.10.10.11 ioKKiopiopio iopEEipiop opiopiip 10.1010.10-10 2.3.4.99 ioi ipiop ip 4324566.98098 90890 8888 uo iojuio iuo uu iouo 8.98.9.98:100 Eiouoiuoip hiy uhi uy 7897 8yi hyiuui  8.88.88.9:8.88.888.9 ouo uu uu ohiuoy8uy899 oioioioi uiouiouoiu 3.3.003.3 hDjhuk-4.5.6.7-4.8  0.0.0.25 0.0.0.256 090-990-90-9 90.90.90.90.90 ok;lk;k 90.11.11.00 khkllj 899.0.0.01 11.34.4.67 kjllkjkjl 11.34.45.67kjkl 209.209.209.209 yyDDyu iiuio uioui u8 987 86 578yyyyy678 1.0.0.0:34 kl;kl; 97 77 u89 unnu 98y7 bn uo 66.6.66.6 .3.4.99 ioi ipiop ip 432456trr yryeyt6.98098 90 7.8.9.0 -9.0.0.0 iouuiouououio mkjnl 1.23.45.67 mkljnl lkjl IIIUILKJH +89.98.89.98:100 8.1.20.0 JHKLJOOooo uiop 97 67.8768-.897897

Evolved JavaScript Regex:

[11]\d*+\.\d\d++\.\d++\.[12]\d*+|\d++\.[11]\.\d[\d\.]++|(?:\d++\.\d++\.\d++\.\d++(?= yyDDyu iiuio uioui u8 987 86 578yyyyy678 \d\.\d\.\d\.\d:34 kl;kl; 97 [^ :][^<]))++

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Related:
Must read: Hidden Cause of Slow Internet and how to fix it
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.