The Rosie Pattern Language, a better way to mine your data

Hello RPL, goodbye regex! Rosie makes finding that data needle in the data haystack a lot easier.

rosie rpl
Mark Gibbs

We’ve all been there: You’ve got tons of unstructured or semi-structured data to sift through and like any savvy nerd, you know that doing so handraulically is against all that is good and holy [queue “Mission Impossible” music]. Your problem, which you have to accept, is how to efficiently and effectively automate the extraction process. As an old hand on the digital ship you’ll probably turn to a tool you know well such as grep, that good ol’ workhorse that implements regular expressions. Say you want to find all of the IPv4 addresses in a text file. You might resort to using:

grep -oE “\b([0-9]{1,3}\.){3}[0-9]{1,3}\b” datafile.txt

This regex isn’t too hard to understand but it’s not perfect as it will print both valid and invalid IPv4 addresses because it rather simplistically recognizes strings from 0.0.0.0 to 999.999.999.999. So, let’s tighten this regex up so that only valid IPv4 addresses are printed:

grep -E -o "(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" datafile.txt

Voila! Mission accomplished but that is one ugly regex and even more horrifying is the regex for IPv6 addresses (props to mij on RegExLib.com):

grep -oE "(::|(([a-fA-F0-9]{1,4}):){7}(([a-fA-F0-9]{1,4}))|(:(:([a-fA-F0-9]{1,4})){1,6})|((([a-fA-F0-9]{1,4}):){1,6}:)|((([a-fA-F0-9]{1,4}):)(:([a-fA-F0-9]{1,4})){1,6})|((([a-fA-F0-9]{1,4}):){2}(:([a-fA-F0-9]{1,4})){1,5})|((([a-fA-F0-9]{1,4}):){3}(:([a-fA-F0-9]{1,4})){1,4})|((([a-fA-F0-9]{1,4}):){4}(:([a-fA-F0-9]{1,4})){1,3})|((([a-fA-F0-9]{1,4}):){5}(:([a-fA-F0-9]{1,4})){1,2}))"  datafile.txt

In common with all useful regexes, making modifications to a complex regex when there’s a change to the input format is a recipe for gloom, doom, and sleepless nights because reconstructing the logic that led to the regex in question is usually non-trivial (by which I mean it is a job no one in their right mind would want to tackle). 

But wait! That’s not all that’s problematic! Regexes can easily becomes computationally hideous because it’s easy to create a regex that descends into catastrophic backtracking whereby it wantonly consumes processor cycles (see the excellent article Runaway Regular Expressions: Catastrophic Backtracking). In other words, predicting and optimizing regexes is difficult, unpredictable, and really, really aggravating.

So, what’s a nerd to do? Suffer the slings and arrows of outrageous regexes or take up other tools against a sea of outrageous patterns and by encoding end them: to code, to test no more! Yes! But I digress …

There is a better way, my friends: It’s called the Rosie Pattern Language and it’s elegant, powerful and might just allow us to dump regexes forever.

rosie 1 Jamie Jennings / IBM

Rosie was developed by Jamie A. Jennings, who describes herself as “Former academic now working in industry (they had cookies).” Ms. Jennings works as a Senior Technical Staff Member in IBM, and is part of an Advanced Technology team in IBM's Cloud division. Her bio adds “In her spare time she plays ice hockey and writes compilers.”

Be that as it may, the Rosie Pattern Language and Rosie Pattern Engine were released in early 2016 and in it’s Git repo, the system is explained thusly:

Rosie is a supercharged alternative to Regular Expressions (regex), matching patterns against any input text. Rosie ships with hundreds of sample patterns for timestamps, network addresses, email addresses, CSV files, and many more.

Unlike most regex tools, Rosie can generate structured (JSON) output. And, Rosie has an interactive pattern development mode to help write and debug patterns.

The Rosie Pattern Engine takes less than 400KB (yes, kilobytes) of disk space, and around 20MB of memory. Typical log files are parsed at around 40,000 lines/second on my 4-year old MacBook Pro, where other (popular) solutions do not achieve 10,000 lines/second.

Rosie Pattern Language is ideal for big data analytics, because Rosie is fast, has predictable performance (unlike most regex engines), and generates json output for downstream analysis.

For more insight into the “why” of Rosie, check out Ms. Jennings’ article Why Rosie Pattern Language.

Installing Rosie is fairly easy (on macOS via Homebrew it’s a piece of cake) and there are also tips for installing on RHEL 7, Ubuntu 16, Windows 10 Anniversary Edition, and Docker. Once installed, you can check on Rosie’s configuration like this:

RedQueen:rosie mgibbs$ rosie -info
Local installation information:
  ROSIE_HOME = /usr/local/Cellar/rosie/current/share/rosie
  ROSIE_VERSION = 0.99i
  HOSTNAME = RedQueen.local
  HOSTTYPE = x86_64
  OSTYPE = darwin16
Current invocation: 
  current working directory = /Users/mgibbs
  invocation command = bash /usr/local/bin/rosie -info
  script value of Rosie home = /usr/local/Cellar/rosie/current/share/rosie
  environment variable $ROSIE_HOME is not set
RedQueen:rosie mgibbs$

Rosie has the usual help output:

RedQueen:rosie mgibbs$ rosie -help
This is Rosie v0.99i
Help:
Usage: bash /usr/local/bin/rosie -help   *
Valid  are: -help -patterns -verbose -all -repl -grep -eval -wholefile -info -manifest -f -e -encode

-help              prints this message
-patterns          print list of available patterns
-verbose           output warnings and other informational messages
-all               write matches to stdout and non-matching lines to stderr
-repl              start Rosie in the interactive mode (read-eval-print loop)
-grep              emulate grep (weakly), but with RPL, by searching for all
                   occurrences of  in the input
-eval              output a detailed "trace" evaluation of how the pattern
                   processed the input; this feature generates LOTS of output,
                   so best to use it on ONE LINE OF INPUT
-wholefile         read the whole input file into memory as a single string,
                   instead of line by line
-info              prints information about the local rosie installation
-manifest     load the manifest file  instead of MANIFEST from $sys
                   (the Rosie install directory); use a single dash '-' to
                   load no manifest file
-f            load the RPL file , after manifest (if any) is loaded
-e            compile the RPL statements in , after manifest and
                   RPL file (if any) are loaded
-encode       encode output in  format: color (default), nocolor,
                   fulltext, or json

            RPL expression, which may be the name of a defined pattern,
                     against which each line will be matched
+          one or more file names to process, the last of which may be
                     a dash "-" to read from standard input

Notes: 
(1) lines from the input file for which the pattern does NOT match are written
    to stderr so they can be redirected, e.g. to /dev/null
(2) the -eval option currently does not work with the -grep option

RedQueen:rosie mgibbs$

To show how Rosie works, consider the output of the ifconfig command (this is from macOS Sierra):

RedQueen:rosie mgibbs$ ifconfig
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
	options=1203<RXCSUM,TXCSUM,TXSTATUS,SW_TIMESTAMP>
	inet 127.0.0.1 netmask 0xff000000 
	inet6 ::1 prefixlen 128 
	inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 
	nd6 options=201<PERFORMNUD,DAD>
gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
	options=10b<RXCSUM,TXCSUM,VLAN_HWTAGGING,AV>
	ether ac:87:a3:31:87:8b 
	inet6 fe80::425:850:43a6:5863%en0 prefixlen 64 secured scopeid 0x4 
	inet6 2605:e000:6a0b:2500:86d:8286:ca88:9f8 prefixlen 64 autoconf secured 
	inet 192.168.0.180 netmask 0xffffff00 broadcast 192.168.0.255
	inet6 2605:e000:6a0b:2500:18f1:f5df:7ef9:2fa4 prefixlen 64 deprecated autoconf temporary 
	inet6 2605:e000:6a0b:2500:38b4:ea13:fff3:9698 prefixlen 64 deprecated autoconf temporary 
	inet6 2605:e000:6a0b:2500:b49e:ab54:bc4:5717 prefixlen 64 deprecated autoconf temporary 
	inet6 2605:e000:6a0b:2500:9831:439f:f75:7a40 prefixlen 64 deprecated autoconf temporary 
	inet6 2605:e000:6a0b:2500:18a4:3d38:7395:5c5a prefixlen 64 deprecated autoconf temporary 
	inet6 2605:e000:6a0b:2500:3188:37ce:a627:b276 prefixlen 64 deprecated autoconf temporary 
	inet6 2605:e000:6a0b:2500:dd77:14e:2011:634 prefixlen 64 autoconf temporary 
	nd6 options=201<PERFORMNUD,DAD>
	media: autoselect (1000baseT <full-duplex,flow-control,energy-efficient-ethernet>)
	status: active
en1: flags=8823<UP,BROADCAST,SMART,SIMPLEX,MULTICAST> mtu 1500
	ether b8:09:8a:cf:83:8f 
	nd6 options=201<PERFORMNUD,DAD>
	media: autoselect ()
	status: inactive
en2: flags=963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX> mtu 1500
	options=60<TSO4,TSO6>
	ether 0a:00:00:78:3d:10 
	media: autoselect 
	status: active
en3: flags=963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX> mtu 1500
	options=60<TSO4,TSO6>
	ether 0a:00:00:78:3d:11 
	media: autoselect 
	status: inactive
bridge0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	options=63<RXCSUM,TXCSUM,TSO4,TSO6>
	ether 0a:00:00:78:3d:11 
	inet6 fe80::1493:92de:58ee:96eb%bridge0 prefixlen 64 secured scopeid 0x8 
	inet 169.254.239.110 netmask 0xffff0000 broadcast 169.254.255.255
	Configuration:
		id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0
		maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200
		root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0
		ipfilter disabled flags 0x2
	member: en3 flags=3<LEARNING,DISCOVER>
	        ifmaxaddr 0 port 7 priority 0 path cost 0
	member: en2 flags=3<LEARNING,DISCOVER>
	        ifmaxaddr 0 port 6 priority 0 path cost 0
	nd6 options=201<PERFORMNUD,DAD>
	media: autoselect
	status: active
p2p0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> mtu 2304
	ether 0a:09:8a:cf:83:8f 
	media: autoselect
	status: inactive
awdl0: flags=8902<BROADCAST,PROMISC,SIMPLEX,MULTICAST> mtu 1484
	ether 1e:bb:87:15:37:c8 
	nd6 options=201<PERFORMNUD,DAD>
	media: autoselect
	status: inactive
utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 2000
	inet6 fe80::42fb:52c5:cff7:807e%utun0 prefixlen 64 scopeid 0xb 
	nd6 options=201<PERFORMNUD,DAD>
vboxnet0: flags=8842<BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	ether 0a:00:27:00:00:00 
vboxnet1: flags=8842<BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	ether 0a:00:27:00:00:01 
vboxnet2: flags=8842<BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	ether 0a:00:27:00:00:02 
utun1: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1380
	inet6 fe80::37b0:cdae:5ad4:4f0e%utun1 prefixlen 64 scopeid 0xf 
	nd6 options=201<PERFORMNUD,DAD>
utun2: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1380
	inet6 fe80::8e14:3930:f5f9:ae9c%utun2 prefixlen 64 scopeid 0x10 
	nd6 options=201<PERFORMNUD,DAD>
RedQueen:rosie mgibbs$

Let’s save that into a file:

RedQueen:rosie mgibbs$ ifconfig > ifconfig.txt
RedQueen:rosie mgibbs$

We’ll analyze this data using the default collection of pre-defined patterns. These patterns are defined using the Rosie Pattern Language which you can explore interactively using the Interactive read-eval-print loop (repl). For now, let’s just have Rosie analyze our test data:

rosie 2 Mark Gibbs

The command line, rosie basic.matchall ifconfig.txt instructed Rosie to use the pattern basic.matchall which is specified in the RPL file, basic.rpl, like this:

basic.matchall = ( basic.datetime_patterns / basic.network_patterns /
                   basic.element / basic.element_quoted / basic.element_bracketed /
                   [:space:]+ /
		   basic.punctuation / 
		   basic.unmatched
		)+

As you can see, the basic.matchall pattern is built from a number of other patterns which are, in turn, built from other patterns … it’s patterns all the way down which makes it easy to build your own patterns from existing patterns as well as  add new patterns to perform whatever specialized data extraction task you need.

Related:
1 2 Page 1
Must read: 10 new UI features coming to Windows 10