Walking with regular expressions

§ Gist We develop a generic, NFA-based regular expression engine which lets us inspect a matcher's state and choose its next piece of input. We then take advantage of these features to lazily traverse directed graphs, only visiting parts that match arbitrary patterns.

§ Introduction According to Wikipedia, "a regular expression […] is a sequence of characters that specifies a match pattern in text". Although text is by far their most common target, nothing prevents them from being applied to other kinds of data, as in spec, seqexp, and malli. Irrespective of domain, however, RE (Regular Expression) libraries typically provide only opaque (black-box) matchers.

 Here, we'll follow an alternative approach, one where the matcher is turned inside out, being externally driven and inspectable. To this end, we'll compile REs to NFAs (Non-deterministic Finite Automata), which will process successive pieces of input. Matchers based on automata explore all possible paths simultaneously, meaning they never have to backtrack. Thus, despite lacking features such as backreferences and lookaround, they have O(m*n) worst-case time complexity, where m and n are proportional to the RE and input size. Moreover, we can examine the state of the NFA between transitions and decide if and what input should be processed next. Finally, our engine will accept arbitrary matching predicates, thus being generic in both its logic and input datatypes.

§ NFAe compilation There are quite a few techniques for implementing RE engines,1 but one of the oldest and simplest is the simulation of equivalent NFAs. Russ Cox provides a comprehensive review of it, which describes Thompson's construction, i.e., the incremental buildup of an NFAe (NFA with epsilon transitions, which happen spontaneously without consuming any input) from a parsed RE. Thompson's construction relies on the recursive structure of the parsed RE to construct an NFAe from smaller ones, but requires the ability to form cycles while doing so. This is somewhat challenging to do idiomatically in languages emphasizing immutability, such as Clojure.2 However, Clojure being extremely data-centric would make the kind of REs we're about to build especially useful, so we'll stick with it.

 Our engine will support the following common set of operators:

:cat concatenation of operands
:alt exclusive choice among operands
:? at most one repetition
:* zero or more repetitions
:+ one or more repetitions

To overcome the cyclic structure issue, we'll represent the NFAe by its transition table rather than its state diagram, which is essentially indirection, since states will be referenced by name in this structure. For a concrete example, consider the RE [:cat :a [:alt :b [:+ :c]]], equivalent to the NFAe with the following state diagram:


in which the hollow kite arrows indicate starting and accepting (final) states, here S-0 and S-accept, respectively. The corresponding transition table may be encoded as:

{S-0 {:a      #{S-1}}
 S-1 {epsilon #{S-4 S-2}}
 S-2 {:b      #{S-accept}}
 S-4 {:c      #{S-3}}
 S-3 {epsilon #{S-accept S-4}}}

The only thing missing to completely specify an NFAe is the starting state, so we'll represent an NFAe as a map of two keys, namely :s for the starting state and :tt for the transition table. In the next paragraphs, we'll write a few procedures to help us construct such NFAe representations given equivalent REs.

 We'll start with a couple of utilities for working with transition tables. The first one updates the destinations of a transition table tt via a procedure f:

(defn update-dests [tt f]
  (update-vals tt #(update-vals % f)))

The other appends a state to an NFAe by replacing accepting states in a transition table tt with a given state s:

(def s-accept 'S-accept)

(defn join-tails [tt s]
  (let [join-tail (fn [ss] (-> ss (disj s-accept) (conj s)))]
    (update-dests tt #(cond-> % (contains? % s-accept) join-tail))))

 These suffice to implement the operations of our engine, so, before delving into their implementations, we'll provide an overview of what we're about to build. The following table shows the state diagram corresponding to each operation:

[:cat A-0 ... A-n] [:alt A-0 ... A-n] [:? A] [:* A] [:+ A]
cat.svg alt.svg question.svg star.svg plus.svg

As before, ingoing/outgoing hollow arrows indicate starting/accepting states. Operands are themselves automata (their names start with A) and any state introduced by the operation itself is named S. Finally, unlabeled filled arrows are transitions patched by join-tails, that is, the set of transitions in the arrow's origin that were leading to the accepting state, but now lead to the destination depicted in the diagram.

 To implement :cat, we only have to chain the given NFAes together. For a pair of NFAes, this is done by merging their transition tables after joining the tails of the first to the starting state of the second:

(defn chain [{s1 :s tt1 :tt} {s2 :s tt2 :tt}]
  {:s s1 :tt (merge (join-tails tt1 s2) tt2)})

For :alt, we also need a state s to act as a choice point between a collection of alternatives. Following the diagram, we merge all transition tables and add epsilon transitions from s to the starting states of all alternatives:

(def epsilon 'epsilon)

(defn split [s-new alternatives]
  (let [ss (set (map :s alternatives))]
    {:s s-new :tt (reduce merge
                          {s-new {epsilon ss}}
                          (map :tt alternatives))}))

Finally, the repetition operators (:?, :*, and :+) can all be implemented by a single procedure that takes advantage of the similarities between their connectivities:

(defn repetition [op s-new [{:keys [s tt]} & rst]]
  (assert (empty? rst))
  {:s  (if (#{:+} op) s s-new)
   :tt (-> (cond-> tt
             (#{:* :+} op) (join-tails s-new))
           (assoc-in [s-new epsilon] #{s s-accept}))})

 Piecing these operations together, we can write the RE to NFAe compiler:

(defn >NFAe [re & {:keys [make-state make-pass?]
                   :or   {make-state (fn [_] (gensym "S-"))
                          make-pass? #(if (or (fn? %) (set? %)) % #{%})}
                   :as   options}]
  (if-not (vector? re)
    (let [s-new (make-state re)]
      {:s s-new :tt {s-new {(make-pass? re) #{s-accept}}}})
    (let [[op opnds] [(first re) (map #(>NFAe % options) (rest re))]]
      (case op
        :cat       (reduce chain opnds)
        :alt       (split (make-state re) opnds)
        (:? :* :+) (repetition op (make-state re) opnds)))))

Our compiler accepts make-state and make-pass? procedural parameters for constructing state names and transition predicates, respectively. Literals (i.e., anything but vectors) are translated to an NFA with a starting and an accepting state connected by a single transition, triggered iff input satisfies the predicate. Let's compile the example RE from above with it:

> (>NFAe [:cat :a [:alt :b [:+ :c]]]
         :make-pass? identity
         :make-state (let [c (volatile! -1)]
                       (fn [_] (symbol (str "S-" (vswap! c inc))))))
{:s  S-0
 :tt {S-0 {:a      #{S-1}}
      S-1 {epsilon #{S-4 S-2}}
      S-2 {:b      #{S-accept}}
      S-4 {:c      #{S-3}}
      S-3 {epsilon #{S-accept S-4}}}}

§ Epsilon elimination The compiler simplicity we got by using epsilon transitions comes with a trade-off: every time a state is activated, its epsilon closure (i.e., all states reachable from it via epsilon transitions) must be activated as well. Fortunately, every NFAe can be converted to an NFA by computing all its epsilon closures and embedding them into it as explicit transitions.3

 To compute epsilon closures of an NFAe, we'll start with a map of each state to a set containing the state itself along with its direct epsilon transition destinations. Using the map itself, we'll then iteratively augment the state sets with states reachable by epsilon transitions until we reach a fixed point. Finally, for completeness, we add the epsilon closure of the accepting state:

(require '[clojure.set :as set])

(defn map-union [f coll]
  (transduce (map f) set/union coll))

(defn epsilon-closures [{:keys [tt]}]
  (let [init (fn [[s ts]] [s (conj (get ts epsilon #{}) s)])]
    (loop [old (into {} (map init) tt)]
      (let [new (update-vals old (partial map-union old))]
        (if (not= old new) (recur new)
            (assoc new s-accept #{s-accept}))))))

Let's compute the epsilon closures of our example NFAe:

> (epsilon-closures '{:s  S-0
                      :tt {S-0 {:a      #{S-1}}
                           S-1 {epsilon #{S-4 S-2}}
                           S-2 {:b      #{S-accept}}
                           S-4 {:c      #{S-3}}
                           S-3 {epsilon #{S-accept S-4}}}})
{S-0      #{S-0}
 S-1      #{S-4 S-2 S-1}
 S-2      #{S-2}
 S-4      #{S-4}
 S-3      #{S-accept S-4 S-3}
 S-accept #{S-accept}}

 With the epsilon closures at our disposal, we can finally eliminate epsilon transitions by following these steps:

In Clojure:4

(defn eliminate-epsilon [{:keys [tt s] :as NFAe}]
  (let [ecs   (epsilon-closures NFAe)
        tt    (as-> (update-dests tt (partial map-union ecs)) $
                (update-vals $ #(dissoc % epsilon))
                (into {} (keep (fn [[s ts]] (when (seq ts) [s ts]))) $))
        shake (partial set/select (some-fn tt (partial = s-accept)))]
    {:ss (shake (get ecs s))
     :tt (update-dests tt shake)}))

(def >NFA (comp eliminate-epsilon >NFAe))

So, our example without epsilon transitions becomes:

> (>NFA [:cat :a [:alt :b [:+ :c]]]
        :make-pass? identity
        :make-state (let [c (volatile! -1)]
                      (fn [_] (symbol (str "S-" (vswap! c inc))))))
{:ss #{S-0}
 :tt {S-0 {:a #{S-4 S-2}}
      S-2 {:b #{S-accept}}
      S-4 {:c #{S-accept S-4}}}}

and looks like this:


§ Matching All that remains now is to write a procedure to transition the NFA for a given input:

(defn transition [{:keys [ss tt]} input]
  (when (seq ss)
    (let [propagate       (fn [[pass? to]] (if (pass? input) to #{}))
          propagate-state #(map-union propagate (get tt %))]
      {:ss (map-union propagate-state ss) :tt tt})))

Inside it, propagate implements the transition predicate check for a single transition, which is in turn used by propagate-state to carry out all transitions for a given state. The NFA is returned with a starting state set equal to the union of propagate-state results for all starting (active) states.

 At this point, it's trivial to write the following procedures to check an NFA's state and process an input sequence with it:

(def on? (comp seq :ss))
(def off? (complement on?))
(def accept? (comp #(contains? % s-accept) :ss))
(def process (partial reduce (comp #(cond-> % (off? %) reduced) transition)))

which allow us to finally test our NFA:

> (let [NFA      (>NFA '[:cat a [:alt b [:+ c]]])
        matches? (comp accept? (partial process NFA))]
    (and (every? matches? '[[a b] [a c] [a c c] [a c c c]])
         (not-any? matches? '[[] [a] [a b c] [a c b]])))

A slightly more complex RE:

> (let [NFA      (>NFA '[:cat [:* a] [:? [:alt [:+ b] c]]])
        matches? (comp accept? (partial process NFA))
        matching '[[] [b] [b b] [b b b] [c]]]
    (and (every? matches? matching)
         (every? matches? (map (partial concat '[a]) matching))
         (every? matches? (map (partial concat '[a a]) matching))
         (not-any? matches? '[[a b c] [b b c] [d]])))

§ RE-directed traversal A use case for our engine is traversal of large/infinite directed graphs. The ability to externally provide input piecemeal and check whether the matching process is still active allows us to lazily visit only matching parts of it. Let's see how that unfolds, starting with a breadth-first node serialization:

(defn bfs [node direct-successors extract]
  (letfn [(bfs [q]
            (when-let [node (peek q)]
              (lazy-seq (let [out (extract node)]
                          (cond->> (bfs (pop (into q (direct-successors node))))
                            out (cons out))))))]
    (bfs (conj clojure.lang.PersistentQueue/EMPTY node))))

 A uniform interface for accessing direct successors would facilitate the traversal of heterogeneous data structures. We'll therefore define a protocol for Addressable datatypes,5 filling in implementations needed for our examples, namely for maps, vectors, nil, and Object (default). In graph terminology, addresses returns a node's outgoing edges, while fetch returns an edge's destination node:

(defprotocol Addressable
  (addresses [this])
  (fetch [this address]))

(extend-protocol Addressable
  (addresses [this] (keys this))
  (fetch [this k] (get this k))

  (addresses [this] (range (count this)))
  (fetch [this n] (nth this n))

  (addresses [this] nil)
  (fetch [this _] nil)

  (addresses [this] nil)
  (fetch [this _] nil))

To get a feel for bfs usage, let's write a procedure to return all vector/map address paths in a structure:

(defn bfs-paths [data]
  (bfs {:data data :path []}
       (fn [{:keys [data path]}]
         (map (fn [addr]
                {:data (fetch data addr)
                 :path (conj path addr)})
              (addresses data)))

which works as expected:

> (bfs-paths {:a [:b {:c :d} :e]
              :f [:g [:h {:i :j}]]})
 [:a 0]
 [:a 1]
 [:a 2]
 [:f 0]
 [:f 1]
 [:a 1 :c]
 [:f 1 0]
 [:f 1 1]
 [:f 1 1 :i])

 We can also constrain the traversal only to paths that match a set of NFAs. To achieve that, we'll:

(defn bfs-match [NFAs data]
  (bfs {:data data :path [] :NFAs NFAs}
       (fn [{:keys [data path NFAs]}]
         (keep (fn [addr]
                 (let [succ (fetch data addr)
                       NFAs (map #(transition % [addr succ]) NFAs)]
                   (when (every? on? NFAs)
                     {:data succ :path (conj path addr) :NFAs NFAs})))
               (addresses data)))
       (fn [{:keys [path NFAs]}]
         (when (every? accept? NFAs)

Note that, along with the address (addr), we also provide the successor (succ) pointed to by the address as part of the input, in order to grant or deny matching based on both. This is indicative of the way that the genericness of our engine comes in handy when designing bespoke matchers of this kind.

 To close out this section, let's see how we might use bfs-match on x86-64 instruction set data in JSON format (source). Using clojure.java.io and clojure.data.json, we can parse the dataset into EDN:

(require '[clojure.java.io :as io]
         '[clojure.data.json :as json])

(def x86-64-isa
  (with-open [rdr (io/reader "x86-64.json")]
    (json/read rdr :key-fn keyword)))

Now, to find, e.g., which instructions have 30 forms, we can write:6

> (bfs-match [(>NFA [:cat [:* any?] #(and (-> % first #{:forms})
                                          (-> % second count #{30}))])]
([:instructions :ADD :forms]
 [:instructions :AND :forms]
 [:instructions :XOR :forms]
 [:instructions :OR :forms]
 [:instructions :SUB :forms]
 [:instructions :SBB :forms]
 [:instructions :ADC :forms]
 [:instructions :CMP :forms])

The RE is quite natural to write, all it says is that we'd like to find paths ending with a :forms key associated with a collection of 30 elements. For every found path above, the traversal stops at the next step, because the RE accepts no further input. To search deeper, we have to allow extra input in it. Suppose, for example, that we'd like to find which of the above forms have an opcode equal to 00:

> (bfs-match [(>NFA [:cat
                     [:* any?]
                     #(and (-> % first #{:forms})
                           (-> % second count #{30}))
                     [:* any?]])
              (>NFA [:cat
                     [:* any?]
                     #(-> % first #{:opcode})
                     #{[:byte "00"]}])]
([:instructions :ADD :forms 2 :encodings 0 :opcode :byte]
 [:instructions :ADD :forms 20 :encodings 0 :opcode :byte])

Of course, this is exactly equivalent to a single RE combining the conditions:

> (bfs-match [(>NFA [:cat
                     [:* any?]
                     #(and (-> % first #{:forms})
                           (-> % second count #{30}))
                     [:* any?]
                     #(-> % first #{:opcode})
                     #{[:byte "00"]}])]
([:instructions :ADD :forms 2 :encodings 0 :opcode :byte]
 [:instructions :ADD :forms 20 :encodings 0 :opcode :byte])

§ Closing thoughts Our engine provides us with the power of regular expressions in finding patterns in any kind of data. Even though this expository, unoptimized implementation is not as slow as it might seem,7 its real strength comes from being able to incrementally and transparently process input, which allows us to prune non-matching structures as soon as we encounter them. For large enough graphs, it will eventually outscale any engines not supporting partial matching, provided that a certain portion of the graph is pruned.



If interested, Wikipedia provides a nice overview. For more in-depth expositions, see Implementing Regular Expressions, Regex engine internals as a library, and perlreguts documentation.


The emphasis here is on "idiomatically". It's always possible to use, e.g., deftype with mutable fields, indirection via array addresses, or plain Java interop to achieve the intended effect.


Glushkov's construction algorithm produces the same NFA we're going to obtain after removing epsilon transitions from an NFAe, directly from the RE. Understandably, though, it's a bit harder to follow and implement than our two-phase approach, so the latter was chosen here.


The check for s-accept in shake is done via (partial = s-accept) instead of #{s-accept} to maintain correctness in case s-accept is defined as nil or false.


The protocol can indirectly support arbitrary object types via datafy​/​nav.


This and later REs can be written much more concisely by parameterizing make-pass?, but, in the interest of clarity, we'll just use inline procedures here.


Criterium microbenchmarks of matching 10 integers against [:* int?] (same as in malli's README) showed a 17.8× slowdown in mean execution time for spec and a 4.5× one for our engine (both relative to malli).

Creative Commons License This work by Aris Spathis is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Views expressed are solely the author's and do not represent the views of any other entity.