State machines are an important tool in computer programming, and Ragel is a wonderful tool for creating them. Come learn how to use Ragel to compose simple state machines into much more complicated versions useful for parsing and processing all manner of input. We'll progress simple regex-like machines to full-blown context-sensitive scanners capable of ripping-fast processing of protocols and language grammars.
https://www.youtube.com/watch?v=Tr83XxNRg3k
Github: https://github.com/ijcd
Twitter: @ijcd
San Francisco, California
SPEAKER NOTES:
-------------
Most programmers should be familiar with the standard regexp form.
Ragel excels at ripping-fast parsing and processing of textual data.
These can be used to wire machines together. Composition. Computer science.
They are, more or less. However, Ragel has a few more tricks up it's sleeve. We're going to compose these simple machines. And execute arbitrary actions at any point inside them. That is when things get really cool.
From a high-level, it helps you build regular expressions, but with an important difference. With Ragel, you have the ability to stop at any point in the regexp parse and execute code in the host language. This is incredibly powerful. It means that rather than having a large program with lots of regexps, loops, and conditionals, we can have one EBNF-ish looking definition that, instead, calls out to our code when we need it to. You can do some really cool tricks with this.
The machine definition statement associates an FSM expression with a name. Machine expressions assigned to names can later be referenced in other expressions. A definition statement on its own does not cause any states to be generated. It is simply a description of a machine to be used later. States are generated only when a definition is instantiated, which happens when a definition is referenced in an instantiated expression.
This is the part of Ragel that I found fascinating when I first grokked it. This compositional technique is what gives Ragel it's extreme simplicity and combinational power. I can't stress this point enough.
The operation first creates a new start state. Epsilon transitions are drawn from the new start state to the start states of both input machines. Nondeterminism. If there are strings, or prefixes of strings that are matched by both machines then the new machine will follow both parts of the alternation at once. The union operation is shown below.
On the surface, Ragel scanners are similar to those defined by Lex.
Though there is a key distinguishing feature: patterns may be arbitrary Ragel expressions and can therefore contain embedded code.
With a Ragel-based scanner the user need not wait until the end of a pattern before user code can be executed.
Scanners can be used to process sub-languages, as well as for tokenizing programming languages.
You can use fcall and fret to jump around in the parser, like function calls.
2. GOALS FOR THIS TALK
1. Convince you that Ragel is worth trying.
2. Give you some intui?on about how it works.
3. Show you how to setup a basic parser.
13. SOFTWARE! FTW!
I'm a soKware dude. I code things. I code the internets and
the googles. I'm also a recovering technology entrepreneur.
I've been in and out of startup ins?tu?ons my en?re life.
22. BUT REGULAR EXPRESSIONS ARE EASY!
Regular expressions consist of constants and operator
symbols that denote sets of strings and opera?ons over
these sets, respec?vely. (from Wikipedia)
23. RUBY HAS GREAT TOOLS FOR REGULAR
EXPRESSIONS
You can get by with them. You can especially get by with
them in Ruby which draws its heritage from Perl, Sed, and
Awk which made wonderful use of regexps.
25. SOMETIMES YOU WANT MORE CONTROL
I posit that this might be some sort of automaton.
26. FINITE AUTOMATA
• Have states and transi?ons.
• Change state based on sequence of inputs.
• DFA can be in only one state at a ?me.
• NFA can be in more than one state at a ?me.
27. EQUIVALENCE OF REGULAR EXPRESSIONS,
NFAS, AND DFAS
It is possible to convert freely between
regular expressions, determinis9c finite
automata, and nondeterminis9c finite
automata. Given one, we can convert it to
any of the other forms.
h"p://faculty.ycp.edu/~dhovemey/fall2008/cs340/notes/lecture3.html
28. THESE ARE ALL STATE MACHINES
State machines are an important tool in computer
programming, and Ragel is a wonderful tool for crea?ng
them.
32. THERE ARE EXAMPLES EVERYWHERE
• watch with ?mer
• vending machine
• traffic light
• bar code scanner
• gas pumps
• number classifica?on
33. THE CAT'S MEOW?
State machines are great for many reasons. They are simple
to understand, and there has been a great deal of research
around finite automata and state machines. With the right
approach they can also produce code that is faster, easier to
maintain, and more correct and thus, more secure.
34. STILL NOT CONVINCED?
Rather than me trying to convince you that they're useful,
let's just talk about them for a bit and see where we end up.
42. WHAT’S THE BIG DEAL?
They just look like regular expressions.
43. WHAT IS RAGEL EXACTLY?
Ragel is a finite-state machine compiler with
output support for C, C++, C#, Objec9ve-C, D,
Java, OCaml, Go, and Ruby source code. It
supports the genera9on of table or control flow
driven state machines from regular expressions
and/or state charts and can also build lexical
analysers via the longest-match method. Ragel
specifically targets text parsing and input
valida9on.
h"ps://en.wikipedia.org/wiki/Ragel
44. STATE MACHINE GENERATION
Ragel supports the genera9on of table or control flow
driven state machines from regular expressions and/or
state charts and can also build lexical analysers via the
longest-match method. A unique feature of Ragel is that
user ac9ons can be associated with arbitrary state
machine transi9ons using operators that are integrated
into the regular expressions. Ragel also supports
visualiza9on of the generated machine via graphviz.
h"ps://en.wikipedia.org/wiki/Ragel
45. HOW DO YOU PRONOUNCE IT?
• RAY-gull?
• RAY-jul?
• RAH-gull?
• RAH-jul?
51. GENERAL STRUCTURE OF A RAGEL FILE
• Mostly in the host language
• has a .rl extension (simple.rl)
• %% is used for inline statements
• %%{ is used for mul?line statements }%%
52. NAMING A MACHINE
With named machines, you can spread a
machine's statements across several files or
include common sec?ons.
54. MACHINE INSTANTIATION
This causes the actual genera?on of the referenced set of states.
Each instan?a?on generates a dis?nct set of states.
55. FILE INCLUSION AND IMPORT
You can include and import defini?ons from other files.
These can help you keep things organized. See the manual
for the specific seman?cs of each.
71. BUILDING BLOCKS
We have simple machines now.
Like levers, wedges, wheels, and pulleys.
But let's not stop here.
From simple machines we can make complex machines.
75. COMPOSITION
Ragel's DSL allows you to take these simple machines, and
through some basic operators, combine those into bigger
machines, and then combine those into BIGGER machines.
89. ONE OR MORE REPETITION
Produces the concatena?on of the machine with the kleene
star of itself. The result will match one or more repe??ons of
the machine.
Equivalent to:
96. STATE MACHINE MINIMIZATION
• Reduces the number of states through op?miza?on
• Merges equivalent states
• On by default (can be disabled with -n)
97. USER ACTIONS
Composi?on is definitely cool and useful. But on top of that,
Ragel gives you embedded ac?ons. This is where you take all
the composi?on and really make it sing, on key.
104. EMBEDDING OPERATORS CAN GET FANCY
See the manual for more informa?on on these:
• To-State Ac?ons
• From-State Ac?ons
• EOF Ac?ons
• Global Error Ac?ons (for error recovery)
• Local Error Ac?ons (for error recovery)
105. NONDETERMINISM
One of the problems you will run into is when the trailing
match of one machine is the same as the leading match of
the next machine. In these cases, the state will be stuck in
the first machine and never transi?on to the next machine.
106. NONDETERMINISM EXAMPLE
The n in ws will prevent the final n from matching.
The solu?on here is simple: exclude the newline character
from the ws expression.
107. AMBIGUITY PROBLEMS
Here's an incorrect way to parse C language comments:
The any will prevent the trailing */ from ever matching.
108. THIS WORKS BUT IT’S UGLY
We have to carefully exclude things to get it to match.
109. THIS IS GETTING COMPLICATED!
But there’s a solu?on.
Ragel lets you embed priori?es into transi?ons to deal with
ambiguity.
112. GUARDED OPERATIONS
Thinking in priori?es is hard.
Fortunately, Ragel provides some beler mechanisms for us
to use.
These are called “guarded concatena?ons”
113. FINISH-GUARDED CONCATENATION
A higher priority is then embedded into the transi?ons of the
second machine that enter into a final state.
This is much simpler to visualize and reason about.
119. PROTOCOL PARSING
Ragel is well suited for protocol parsing.
Mapping an RFC onto a Ragel specifica?on is prely straight-
forward.
Puma has a good example of this (heritage is the original
mongrel parser by Zed Shaw)
h"ps://github.com/puma/puma/blob/master/ext/puma_h"p11/h"p11_parser_common.rl
120. STATE CHARTS
Ragel allows you to specify states and transi?ons directly if
you desire extreme customiza?on.
This is like programming in the "assembly" of Ragel.
There are a few new operators for this.
123. PARSING RECURSIVE STRUCTURES
The general trick is to store some context about where
you are in your recursive structure, say in a stack called
@nesMngs, and push/pop to it as appropriate. When it
comes ?me to call fret, you can examine your @nesMngs
and steer the parser as deemed appropriate.
124. IMPLEMENTING LOOKAHEAD
This is possible. The trick here is to match deeper than you
need, then use fhold to walk the parser back a few
characters.
125. RAGEL INTERNALS
Ragel uses several variables for state. You can twiddle
them in ac?ons.
Those are the major ones. See the manual for more details.
126. RAGEL OPERATION (ROUGHLY)
1. Starts in state 0
2. Feed it data, upda?ng p and pe as appropriate
3. Run the %%exec loop
4. Characters move it through a state
5. It consumes p -> pe from data
6. If cs is >= first_final_state (final states are
last) then you have “admiled” the string
127. RAGEL OPERATION (SCANNERS)
Scanners are a bit more involved, but not that much more.
1. Use a stack to track states
2. Use ts -> te to track where they are in a match
3. Use the stack to backtrack when necessary
4. Keep matching repeatedly un?l we are done
5. Longest match wins
6. It's useful to create helper methods (emit,
current_buffer, current_match(start, end))
128. RAGEL STRING EXTRACTION
To pull out the data you care about, while you are parsing, you will do
something like this:
130. CODE STYLES
Ragel uses your .rl code to compute the set of states and transi?ons.
From that, it can generate code in a number of different styles.
131. CODE STYLES PERFORMANCE
Each of these has different visual organiza?on and
performance characteris?cs. In languages like C, this can boil
down to heavily-op?mized GOTO statements in a single
while loop. It's fast and cpu-cache friendly.
132. MULTI-LANGUAGE
It's possible to have a single Ragel defini?on that uses import
seman?cs to allow implemen?ng the ac?ons in different
languages using the same parent Ragel file. See the hlp11
parser in puma for details (C and Java)
h"ps://github.com/puma/puma/tree/master/ext/puma_h"p11
133. RAGEL IN C
It's also possible to prototype in Ruby, then convert it to a C
module for super-speed. Ragel supports several output
formats so you can do this port rather easily.
Again, see mongrel or puma for ideas.
134. RAGEL DIRECTIVES - INIT
Ini?alizes the data buffer and sets the current state:
135. RAGEL DIRECTIVES - DATA
Writes out defini?ons of the state and transi?on data:
136. RAGEL DIRECTIVES - EXEC
Writes out the code that processes the data buffer
using the state and transi?on data:
144. RAGEL PLAYGROUND
I created a tool in Volt to do some basic visualiza?on.
It's definitely a work in progress, but feel free to try it out.
h"ps://github.com/ijcd/ragel_playground