Pages

Equations for data transformation: Term Rewriting

For a programmer, functions are the fundamental form of expression. Programmers write functions that transform inputs into outputs.

\[f(x) = x + 1\]

The inputs and outputs are complex data that represent something in the real world. I think about the inputs and outputs as abstract data shapes. I might start with a nested map that needs to be transformed into a sequence of sub-selections. Data comes in one shape and goes out another. Squares in, circles out.

Figure 1. The square-to-circle machine ™

For a mathematician, equations are the fundamental form of expression. Equations define valid rules for transforming one expression into another. Mathematicians solve problems through symbolic manipulation.

\[\begin{align*} (a + b)^2 &= (a + b)(a + b)\\ &= aa + ab + ba + bb\\ &= a^2 + 2ab + b^2 \end{align*}\]

Functions and equations are both fine ways to specify transformations. So why only use functions for programming? What does it look like to program data transformation like equations instead of functions? Today we will look at some examples of an equation base approach to specifying data transformations. Before diving into the examples, let’s first establish some definitions

Definitions

Term rewriting

Replacing terms with other terms

Term

A variable or function application: \(a\), \((a + b)\), \((a + b)^2\)

Variable

\(a,b\) can be assigned a value

Function

add, multiply, square

Rule

A pair of matching terms and substitution terms: \((a + b)^2 \Rightarrow a^2 + 2ab + b^2\)

Matching

\((1 + 3)^2\) can be made equal to \((a + b)^2\) when \(a=1,b=3\)

Substitution

Creation of new terms from old ones \((1 + 3)^2 \Rightarrow 1^2 + 2\cdot1\cdot3 + 2^2\)

Math equations are rules that strictly preserve equivalence. Term rewriting is a more general concept that permits any transformation.

Introducing Meander

Meander is a term rewriting system packaged as a Clojure library for data transformation.

[meander/epsilon "0.0.650"]
(ns meander.examples
  (:require [meander.epsilon :as m]))

I prefer to call it a structural pattern syntax for data transformation.

Rewrite

The m/rewrite macro is the swiss army knife of Meander.

Let’s do our first transformation from a vector to a map:

\[[a, b, c, d] \Rightarrow \{a, d\}\]
; Meander!
(m/rewrite [:a 1 :c 2]  ; input
  [?a ?b ?c ?d]         ; a match pattern
  {?a ?d})              ; a substitution pattern

;=> {:a 2 :c 1}         ; output
  • The match pattern is a data literal with the shape of the input

  • The substitution pattern is a data literals with the shape of the output

  • ?a is called a logic variable

  • _ is for ignoring unused terms

?x is a valid symbol in Clojure. Macros manipulate symbols rather than the values associated with those symbols. m/rewrite is a macro; it defines how to interpret the patterns. I think of Meander as a syntax because I specify structure with data literals to pattern match and substitute. Hence why I prefer to call it a "structural pattern syntax".

Logic

(m/rewrite [1 2] ; input
  [?a ?a]        ; a match pattern
  {?a ?a})       ; a substitute pattern

;=> nil

No match.

?a cannot match both 1 and 2.

(m/rewrite [1 1] ; input
  [?a ?a]        ; a match
  {?a ?a})       ; a substitution pattern

;=> {1 1}

Matched ?a with 1

Unification means resolving logical matches.

Repeating terms

(m/rewrite [:a 1 :c 2 :d 3]
  [!a ...]
  [!a ...])

;=> [:a 1 :c 2 :d 3]

!a is a memory variable: an array that can collect many values
…​ indicates 0 or more repeated occurrences

The match pattern is symmetrical with the substitution pattern. We got out exactly what we put in.

Repeat scope

(m/rewrite [1 :a 2 :b 3 :c]
  [!x !y ...]      ; match pairs
  [!y !x ...])     ; substitute flipped pairs

;=> [:a 1 :b 2 :c 3]

All terms preceeding …​ repeat.

Separator

Limits the scope of repeats.

(m/rewrite [:hello 1 2 3]
  [:hello . !v ...]
  [:world . !v ...])

;=> [:world 1 2 3]

Only the terms between . and …​ repeat.

Structurally explicit

Most Clojure functions will coerce their inputs to seqs. Meander does not do this.

[!x …​]

will only match vectors

(!x …​)

will only match lists and seqs

{:k ?v}

will only match maps

#{1}

will only match sets

(m/seqable !x …​)

will match maps, vectors, strings, etc…​

I like that it will only match the same types unless I choose to be more permissive with m/seqable.

Nesting

As you might expect, match and substitute patterns may be nested.

(m/rewrite {:xs [1 2 3 4 5]
            :ys [6 7 8 9 10]}
  {:xs [!xs ...]
   :ys [!ys ...]}
  [!xs ... . !ys ...])

;=> [1 2 3 4 5 6 7 8 9 10]

And Meander does the rearranging for you.

Pop Quiz

(m/rewrite [:hello 1 2 3]
  [?k . !v ...]
  [?k !v ...])

;=> [:hello 1 :hello 2 :hello 3]

Why?

When matching, we expect a single value, then repeated values. The substituting pattern has no separator, so we are substituting pairs.

Congratulations! Enjoy these macarons as your reward.

Figure 2. Delicious macarons

Meander has a rich feature set. I have only shown you the core syntax. We are now going to shift gears and look at some examples in the wild. I’ll introduce a few bits of unexplained but fairly obvious syntax as we go.

For a comprehensive treatment of Meander features, see Meander on GitHub which links to documentation, a cookbook, blog posts and a Strange Loop talk.

Examples

Rearranging logic expressions

This example is taken from Justice (an alternative syntax to Datalog queries).

(def rearrange-logic
  (bottom-up
   ;; nested commutative logic is raised
   ((m/pred #{'and 'or} ?op) . !before ... (?op . !clauses ...) . !after ...)
   (?op . !before ... !clauses ... !after ...)

   ;; moves `or` to the outside,
   ;; and `and` to the inside to match Datalog rule convention
   (and . !before ... (or . !clauses ...) . !after ...)
   (or . (and . ~@!before . !clauses . ~@!after) ...)

   ;; identity logic expressions are flattened
   ((m/pred #{'and 'or} ?op) ?body)
   ?body

   ;; double negatives are removed
   (not (not ?body))
   ?body

   ;; moves `not` inside to match Datalog rule convention
   (not (or . !clauses ...))
   (and . (not !clauses) ...)

   (not (and . !clauses ...))
   (or . (not !clauses) ...)))

Web scraping

This example comes from the Chartit (an ETL/analytics utility for tracking work)

(def extract-employees
  (s/search
   (m/$
    [:div {:class "directory-tables"} &
     (m/scan
      ;; heading/table pairs
      [:h3 {} ?department & _]
      _
      [:table &
       (m/scan
        [:tbody &
         (m/scan
          [:tr &
           (m/separated
            [:td & (m/scan [:a {} ?name & _])]
            [:td {} ?title & _]
            [:td & (m/scan [:a {:href ?mailto} & _])])])])])])
   ;;=>
   {:department (str/trim ?department)
    :name ?name
    :title ?title
    :email (subs ?mailto 7)}))

Parsing defn like forms

Here is another example from Justice. where it was desirable to define a macro that behaves like defn. This example can be directly compared with the Clojure.spec approach.

(defn wrap-defn
  "Returns a function that will parse a form according to `defn` semantics.
  Takes a function which will convert fn-spec forms."
  [rewrite-fn-spec]
  (m/rewrite
   (m/and ((pred simple-symbol? ?name) .
           (pred string? !?docstring) ...
           (pred map? !?attr-map) ...
           !tail ...)
        (m/guard (<= (count !?docstring) 1))
        (m/guard (<= (count !?attr-map) 1))
        (let
         (or (([(m/pred simple-symbol? !params) ... :as !param-list]
               . !forms ... :as !fn-specs) ..1)
             ([(m/pred simple-symbol? !params) ... :as !param-list]
               . !forms ... :as !fn-specs))
          (list* !tail))
        (m/guard (apply distinct? (map count !param-list))))
   ;;>
   (defn ?name . !?docstring ... !?attr-map ...
     ~@(map rewrite-fn-spec !fn-specs))))

Schema to code

This example shows the core functionality of HappyGAPI (a library that exposes Google APIs by generating code from schemas).

(defn summarize-schema [schema request depth]
  "Given a json-schema of type definitions,
  and a request that is a $ref to one of those types,
  resolves $ref(s) to a depth of 3,
  discards the distracting information,
  and returns a pattern for constructing the required input."
  (m/rewrite request
    {:type                 "object"
     :id                   ?id
     :additionalProperties ?ap
     :properties           (m/seqable [!property !item] ...)}
    ;;>
    {& ([!property (m/app #(summarize-schema schema % depth) !item)] ...)}

    {:type  "array"
     :items ?item}
    ;;>
    [~(summarize-schema schema ?item depth)]

    {:type (m/pred string? ?type)}
    ;;>
    (m/app symbol ?type)

    {:$ref (m/pred string? ?ref)}
    ;;>
    ~(if (> depth 2)
       (symbol ?ref)
       (summarize-schema schema (get schema (keyword ?ref)) (inc depth)))))

AST manipulation

This example comes from Kalai, a transpiler from Clojure to Java and Rust.

(def propagate-types-from-bindings-to-locals
  "We propagate type information which is stored in metadata
  from the the place where they are declared on a symbol
  to all future usages of that symbol in scope."
  (s/rewrite
    {:op   :local
     :form ?symbol
     :env  {:locals {?symbol {:form ?symbol-with-meta
                              :init ?init}}
            :as     ?env}
     &     ?more
     :as   ?ast}
    ;;>
    {:op   :local
     :form ~(propagate-ast-type ?init ?symbol-with-meta ?ast)
     :env  ?env
     &     ?more}

    ;; otherwise leave the ast as is
    ?else
    ?else))

Reading patterns vs functions

Those were some pretty complicated transformations.
Could you follow them?
I bet it was easier than most code.
Is that normal?
I claim that term rewriting expressions are much easier to read than functional transformations.

Next let’s compare what is involved in writing functions vs patterns.

Writing functions vs patterns

Functions are like recipes

Functions are all about what to do with inputs.

Clearly the big difference between functions and rewrite rules is what to do with the inputs. To write a function is to specify the tasks in order to convert the ingredients into the desired meal. A function is a verbal description of how to transform inputs to outputs in written form.

Figure 3. A recipe for making beef stew

Recipe for a function

1. Prepare meat and potatoes

(example inputs)

2. Mix ingredients in a pot

(function definition)
- A little destructuring
- a dash of let binding
- some sequence seasoning for flavor
- a teaspoon of conj, update, and assoc
- a splash of threading

3. Boil

(iterate, make it work)

4. Garnish with documentation

(docstring or types)

5. Simmer

(tests)

Easy to forget why it tastes good
Easy to skip documentation and testing

Patterns are more like before and after pictures

Rewrite rules are only the inputs and outputs; the what to do part is completely missing.

Figure 4. A hearty pot of beef stew

Depicting a pattern

1. Prepare example inputs

[{:k "Meat"} {:k "Potatoes"}]

2. Parameterize

[{:k !ingredient} …​]

3. Recombine as outputs

[!ingredient …​ "Stew"]

4. Create a test

(is (= [] (stew example)))

5. Verify the output

["Meat" "Potatoes" "Stew"]

Examples are the best kind of documentation. Writing Meander feels like writing examples. It gives me confidence I solved the right problem, and it’s easy to see what it does coming back to it later.

I like pictures. I especially like diagrams. So much so that I built an app for making diagrams: Hummi.app and I’ve written an article about why diagrams are important. The thing I like most about Meander is that transformation patterns depict the shape of the data that is coming in and going out.

When should we use term rewriting, when a function?

Term rewriting is suitable for largish data reshaping where you would otherwise do destructuring, updates, restructuring, and pipelines. We have seen some compelling examples, but not every problem is such a clear-cut case of re-shaping.

Why isn’t this the default way of writing code everywhere?

  • Term rewriting systems are hard to implement, and rarely integrated with programming languages.

  • Functions are more familiar in the programming community.

  • Not a silver bullet.

Problems that require aggregation, reduction, and variables tend to be well expressed with a loop or recursive function. Problems that are extract and reconstitute tend to be well expressed as patterns.

Limitations of term rewriting

Syntax

The syntax available in data literals is limited by the host language. Meander syntax is partially extensible through defsyntax to define custom operators. Execution extension is available by function application (see m/app and ~). IDEs and linting tools have trouble error checking pattern expressions because they are an entirely new set of expressions with a new grammar.

Less open to extension

Multi-methods and Protocols allow for the extension of behavior by including a new function and associating it with the output of the dispatch function. Multi-method and Protocol dispatch is somewhat similar to a case statement in that we expect distinct values to be matched. Multi-methods address the possibility of multiple matches by the introduction of hierarchies and preferences: “prefer-method creates an ordering between methods when they would otherwise be ambiguous” (see Clojure - Multi-methods and Hierarchies).

Pattern matching by contrast often defines many overlapping matching patterns, and is thus sensitive to ordering. The order that patterns are defined matters because the first value match will be used. This makes extension more challenging because extensions need to be coordinated.

One possible approach for extending patterns would be to specify them relative to existing clauses. You could add a pattern at the end, add a pattern before an existing pattern, or add a pattern after an existing pattern. This could get complicated if several libraries participated in an extensible dispatch. The order of extension might change the outcome.

Failure to match is opaque

Similar to regular expressions, it can be difficult to figure out why a match isn’t made for a given input. Simplifying and decomposing can help. Function application can be used (m/app #(doto % prn) ?x) to spy on terms.

Reduction

Meander has recursion, so you can reduce. But recursion has no syntactic advantage. The next branch of Meander (zeta) contains a neater way to express reduction.

Memory variable correspondence

Nested memory variables in a single expression can get confusing, and often don’t behave as you’d expect. There is potential for simplification issue#129. The next branch of Meander (zeta) has flexibility in how variables behave, which may solve this.

Performance

On par with hand rolled functions. Meander is a high quality library: fast and reliable.

Decomposition

The current options for creating sub-patterns are m/with and m/app, Which are equivalent to functional decomposition.

Other approaches to data transformation

Here is a short list of other data transformation approaches in Clojure so that we can examine where they differ from Meander:

Clojure.core

Clojure is concise and powerful. Many functions defined on few primary data structures. We operate on data but do not specify what the data is. So it is common to be looking at a function that operates on x, and have no context about what x aught to be. What is x?

Destructuring

Does one job well. No logic expressions or substitution.

Core.match, Core.logic

Independently work great. There is no convenient syntax for substitution.

Clojure.spec

Defines the shape of x as s-expression regexes. Specs do not look like the data they describe.

Specter

Takes a navigator/action approach. Improves data manipulation. Very compact, linguistic.

Each of these approaches have benefits and drawbacks. The purpose of this article was to show where Meander sits relative to these methods, and hopefully inspire you to explore this approach.

Conclusion

The mathematical style term rewriting approach to data transformation has several merits:

  • When reading expressions, inputs and outputs are instantly recognizable

  • When writing expressions, examples can be converted to patterns

  • The declarative style is symmetrically pleasing

  • Visual expression is more effective than verbal expression

Meander provides term rewriting as a convenient library for Clojure. I hope you will give it a try and find it as useful as I have.