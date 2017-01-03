We’ve all been there: You’ve got tons of unstructured or semi-structured data to sift through and like any savvy nerd, you know that doing so handraulically is against all that is good and holy [queue “Mission Impossible” music]. Your problem, which you have to accept, is how to efficiently and effectively automate the extraction process. As an old hand on the digital ship you’ll probably turn to a tool you know well such as grep, that good ol’ workhorse that implements regular expressions. Say you want to find all of the IPv4 addresses in a text file. You might resort to using:

grep -oE “\b([0-9]{1,3}\.){3}[0-9]{1,3}\b” datafile.txt

This regex isn’t too hard to understand but it’s not perfect as it will print both valid and invalid IPv4 addresses because it rather simplistically recognizes strings from 0.0.0.0 to 999.999.999.999. So, let’s tighten this regex up so that only valid IPv4 addresses are printed:

grep -E -o "(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" datafile.txt

Voila! Mission accomplished but that is one ugly regex and even more horrifying is the regex for IPv6 addresses (props to mij on RegExLib.com):

grep -oE "(::|(([a-fA-F0-9]{1,4}):){7}(([a-fA-F0-9]{1,4}))|(:(:([a-fA-F0-9]{1,4})){1,6})|((([a-fA-F0-9]{1,4}):){1,6}:)|((([a-fA-F0-9]{1,4}):)(:([a-fA-F0-9]{1,4})){1,6})|((([a-fA-F0-9]{1,4}):){2}(:([a-fA-F0-9]{1,4})){1,5})|((([a-fA-F0-9]{1,4}):){3}(:([a-fA-F0-9]{1,4})){1,4})|((([a-fA-F0-9]{1,4}):){4}(:([a-fA-F0-9]{1,4})){1,3})|((([a-fA-F0-9]{1,4}):){5}(:([a-fA-F0-9]{1,4})){1,2}))" datafile.txt

In common with all useful regexes, making modifications to a complex regex when there’s a change to the input format is a recipe for gloom, doom, and sleepless nights because reconstructing the logic that led to the regex in question is usually non-trivial (by which I mean it is a job no one in their right mind would want to tackle).

But wait! That’s not all that’s problematic! Regexes can easily becomes computationally hideous because it’s easy to create a regex that descends into catastrophic backtracking whereby it wantonly consumes processor cycles (see the excellent article Runaway Regular Expressions: Catastrophic Backtracking). In other words, predicting and optimizing regexes is difficult, unpredictable, and really, really aggravating.

So, what’s a nerd to do? Suffer the slings and arrows of outrageous regexes or take up other tools against a sea of outrageous patterns and by encoding end them: to code, to test no more! Yes! But I digress …

There is a better way, my friends: It’s called the Rosie Pattern Language and it’s elegant, powerful and might just allow us to dump regexes forever.

Rosie was developed by Jamie A. Jennings, who describes herself as “Former academic now working in industry (they had cookies).” Ms. Jennings works as a Senior Technical Staff Member in IBM, and is part of an Advanced Technology team in IBM's Cloud division. Her bio adds “In her spare time she plays ice hockey and writes compilers.”

Be that as it may, the Rosie Pattern Language and Rosie Pattern Engine were released in early 2016 and in it’s Git repo, the system is explained thusly:

Rosie is a supercharged alternative to Regular Expressions (regex), matching patterns against any input text. Rosie ships with hundreds of sample patterns for timestamps, network addresses, email addresses, CSV files, and many more. Unlike most regex tools, Rosie can generate structured (JSON) output. And, Rosie has an interactive pattern development mode to help write and debug patterns. The Rosie Pattern Engine takes less than 400KB (yes, kilobytes) of disk space, and around 20MB of memory. Typical log files are parsed at around 40,000 lines/second on my 4-year old MacBook Pro, where other (popular) solutions do not achieve 10,000 lines/second. Rosie Pattern Language is ideal for big data analytics, because Rosie is fast, has predictable performance (unlike most regex engines), and generates json output for downstream analysis.

For more insight into the “why” of Rosie, check out Ms. Jennings’ article Why Rosie Pattern Language.

Installing Rosie is fairly easy (on macOS via Homebrew it’s a piece of cake) and there are also tips for installing on RHEL 7, Ubuntu 16, Windows 10 Anniversary Edition, and Docker. Once installed, you can check on Rosie’s configuration like this:

RedQueen:rosie mgibbs$ rosie -info Local installation information: ROSIE_HOME = /usr/local/Cellar/rosie/current/share/rosie ROSIE_VERSION = 0.99i HOSTNAME = RedQueen.local HOSTTYPE = x86_64 OSTYPE = darwin16 Current invocation: current working directory = /Users/mgibbs invocation command = bash /usr/local/bin/rosie -info script value of Rosie home = /usr/local/Cellar/rosie/current/share/rosie environment variable $ROSIE_HOME is not set RedQueen:rosie mgibbs$

Rosie has the usual help output:

RedQueen:rosie mgibbs$ rosie -help This is Rosie v0.99i Help: Usage: bash /usr/local/bin/rosie -help * Valid are: -help -patterns -verbose -all -repl -grep -eval -wholefile -info -manifest -f -e -encode -help prints this message -patterns print list of available patterns -verbose output warnings and other informational messages -all write matches to stdout and non-matching lines to stderr -repl start Rosie in the interactive mode (read-eval-print loop) -grep emulate grep (weakly), but with RPL, by searching for all occurrences of in the input -eval output a detailed "trace" evaluation of how the pattern processed the input; this feature generates LOTS of output, so best to use it on ONE LINE OF INPUT -wholefile read the whole input file into memory as a single string, instead of line by line -info prints information about the local rosie installation -manifest load the manifest file instead of MANIFEST from $sys (the Rosie install directory); use a single dash '-' to load no manifest file -f load the RPL file , after manifest (if any) is loaded -e compile the RPL statements in , after manifest and RPL file (if any) are loaded -encode encode output in format: color (default), nocolor, fulltext, or json RPL expression, which may be the name of a defined pattern, against which each line will be matched + one or more file names to process, the last of which may be a dash "-" to read from standard input Notes: (1) lines from the input file for which the pattern does NOT match are written to stderr so they can be redirected, e.g. to /dev/null (2) the -eval option currently does not work with the -grep option RedQueen:rosie mgibbs$

To show how Rosie works, consider the output of the ifconfig command (this is from macOS Sierra):

RedQueen:rosie mgibbs$ ifconfig lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384 options=1203<RXCSUM,TXCSUM,TXSTATUS,SW_TIMESTAMP> inet 127.0.0.1 netmask 0xff000000 inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 nd6 options=201<PERFORMNUD,DAD> gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280 stf0: flags=0<> mtu 1280 en0: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500 options=10b<RXCSUM,TXCSUM,VLAN_HWTAGGING,AV> ether ac:87:a3:31:87:8b inet6 fe80::425:850:43a6:5863%en0 prefixlen 64 secured scopeid 0x4 inet6 2605:e000:6a0b:2500:86d:8286:ca88:9f8 prefixlen 64 autoconf secured inet 192.168.0.180 netmask 0xffffff00 broadcast 192.168.0.255 inet6 2605:e000:6a0b:2500:18f1:f5df:7ef9:2fa4 prefixlen 64 deprecated autoconf temporary inet6 2605:e000:6a0b:2500:38b4:ea13:fff3:9698 prefixlen 64 deprecated autoconf temporary inet6 2605:e000:6a0b:2500:b49e:ab54:bc4:5717 prefixlen 64 deprecated autoconf temporary inet6 2605:e000:6a0b:2500:9831:439f:f75:7a40 prefixlen 64 deprecated autoconf temporary inet6 2605:e000:6a0b:2500:18a4:3d38:7395:5c5a prefixlen 64 deprecated autoconf temporary inet6 2605:e000:6a0b:2500:3188:37ce:a627:b276 prefixlen 64 deprecated autoconf temporary inet6 2605:e000:6a0b:2500:dd77:14e:2011:634 prefixlen 64 autoconf temporary nd6 options=201<PERFORMNUD,DAD> media: autoselect (1000baseT <full-duplex,flow-control,energy-efficient-ethernet>) status: active en1: flags=8823<UP,BROADCAST,SMART,SIMPLEX,MULTICAST> mtu 1500 ether b8:09:8a:cf:83:8f nd6 options=201<PERFORMNUD,DAD> media: autoselect () status: inactive en2: flags=963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX> mtu 1500 options=60<TSO4,TSO6> ether 0a:00:00:78:3d:10 media: autoselect status: active en3: flags=963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX> mtu 1500 options=60<TSO4,TSO6> ether 0a:00:00:78:3d:11 media: autoselect status: inactive bridge0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 options=63<RXCSUM,TXCSUM,TSO4,TSO6> ether 0a:00:00:78:3d:11 inet6 fe80::1493:92de:58ee:96eb%bridge0 prefixlen 64 secured scopeid 0x8 inet 169.254.239.110 netmask 0xffff0000 broadcast 169.254.255.255 Configuration: id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0 maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200 root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0 ipfilter disabled flags 0x2 member: en3 flags=3<LEARNING,DISCOVER> ifmaxaddr 0 port 7 priority 0 path cost 0 member: en2 flags=3<LEARNING,DISCOVER> ifmaxaddr 0 port 6 priority 0 path cost 0 nd6 options=201<PERFORMNUD,DAD> media: autoselect status: active p2p0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> mtu 2304 ether 0a:09:8a:cf:83:8f media: autoselect status: inactive awdl0: flags=8902<BROADCAST,PROMISC,SIMPLEX,MULTICAST> mtu 1484 ether 1e:bb:87:15:37:c8 nd6 options=201<PERFORMNUD,DAD> media: autoselect status: inactive utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 2000 inet6 fe80::42fb:52c5:cff7:807e%utun0 prefixlen 64 scopeid 0xb nd6 options=201<PERFORMNUD,DAD> vboxnet0: flags=8842<BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 ether 0a:00:27:00:00:00 vboxnet1: flags=8842<BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 ether 0a:00:27:00:00:01 vboxnet2: flags=8842<BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 ether 0a:00:27:00:00:02 utun1: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1380 inet6 fe80::37b0:cdae:5ad4:4f0e%utun1 prefixlen 64 scopeid 0xf nd6 options=201<PERFORMNUD,DAD> utun2: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1380 inet6 fe80::8e14:3930:f5f9:ae9c%utun2 prefixlen 64 scopeid 0x10 nd6 options=201<PERFORMNUD,DAD> RedQueen:rosie mgibbs$

Let’s save that into a file:

RedQueen:rosie mgibbs$ ifconfig > ifconfig.txt RedQueen:rosie mgibbs$

We’ll analyze this data using the default collection of pre-defined patterns. These patterns are defined using the Rosie Pattern Language which you can explore interactively using the Interactive read-eval-print loop (repl). For now, let’s just have Rosie analyze our test data:

The command line, rosie basic.matchall ifconfig.txt instructed Rosie to use the pattern basic.matchall which is specified in the RPL file, basic.rpl , like this:

basic.matchall = ( basic.datetime_patterns / basic.network_patterns / basic.element / basic.element_quoted / basic.element_bracketed / [:space:]+ / basic.punctuation / basic.unmatched )+