Using Haskell at SQream

By Jake Wheat

9.25.2013 twitter linkedin facebook

At SQream, we use Haskell for a large part of our code. We use Haskell for the compiler, which takes SQL statements and turns them into low level instructions for the high performance CUDA runtime. We also use Haskell for rapid prototyping and for many auxiliary utilities.

Why Haskell?

Haskell is an excellent general purpose programming language, and also has particular strengths in compiler implementation, and in rapid development.
In this post I will show some demonstration code based on the SQream compiler. This should give some idea of how easy it is to develop compilers with Haskell.
Let’s say we decide to implement the compiler by first converting SQL to relational algebra, then running a number of rewrites on the relational algebra before converting to the next stages. This will make the task much clearer, since relational algebra is much simpler and more direct than SQL.
We have a relational algebra AST which looks like this (I have only included the constructors which concern us right now).

data QueryExpr = RelVar String
               | Restrict ScalarExpr QueryExpr
               | Select [(String,ScalarExpr)] QueryExpr
               | SummarizeBy [String] [(String,ScalarExpr)] QueryExpr
                 deriving (Show,Data,Typeable)
data ScalarExpr = Iden String
                | App String [ScalarExpr]
                | NumLit Integer
                 deriving (Show,Data,Typeable)

For the QueryExpr type, which represents a ‘relation valued expression’:

  • the RelVar constructor corresponds to a table or view in the database
  • the Restrict constructor corresponds to the where clause in SQL
  • the Select constructor is a combined replacement for Project, Rename and Extend operators from relational algebra (roughly the same as select lists in SQL, but strictly less powerful – for instance, we don’t support group by/aggregates, or correlated subqueries in select unlike SQL select lists)
  • the SummarizeBy constructor implements a more limited version SQL’s group by aggregates. The first list of Strings are the names of the group columns. The second list is the set of aggregates. We will look at some code with SummarizeBy later on in this post.

For the ScalarExpr, which represents a ‘scalar valued expression’:

  • Iden represents an identifier, this is a column reference
  • App represents a function call. The first parameter is the function name, and the list of exprs are the arguments
  • NumLit represents an integral numeric literal

Examples of relational algebra ASTs

Here is a simple SQL query:

select a from t;

This is the equivalent in relational algebra (in the Haskell AST):

Select [("a",Iden "a")] (RelVar "t")

Here is another example of a SQL query:

select a+b as c from t

The relational algebra equivalent:

Select [("c", App "+" [Iden "a", Iden "b"])] (RelVar "t")

An example with where:

select a,b from t where c > 100

A relational algebra equivalent:

Select [("a", Iden "a"), ("b", Iden "b")]
 (Restrict (App ">" [Iden "c", NumLit 100])
  (RelVar "t"))

Transformations

The two transforms I will demonstrate are:

  • combining nested selects into one select
  • a limited version of ‘pushing restricts closer to the storage’.

Consider this relational algebra expression:

Select [("a", App "/" [Iden "a", NumLit 3])]
$ Select [("a", App "+" [Iden "a", NumLit 5])]
$ RelVar "t"

We want to simplify this to the following expression:

Select [("a", App "/" [App "+" [Iden "a", NumLit 5], NumLit 3])]
$ RelVar "t"

It isn’t immediately obvious that this is an optimisation, but I hope you at least agree this is arguably a simpler representation of the same query.
Here is the second example expression to simplify:

Select [("b", Iden "b"), ("c", Iden "c")]
$ Restrict (App ">" [Iden "a", NumLit 3])
$ Select [("c", App "*" [Iden "c", NumLit 5])]
$ RelVar "t"

We want to convert it to this:

Select [("b", Iden "b"), ("c", App "*" [Iden "c", NumLit 5])]
$ Restrict (App ">" [Iden "a", NumLit 3])
$ RelVar "t"

Here is some simple code to do the transform.

transformIdentifiers :: Data a => [(String,ScalarExpr)] -> a -> a
transformIdentifiers env = transformBi $ e -> case e of
    Iden i | Just e' <- lookup i env -> e'
    _ -> e
simplifyTwoSelects :: QueryExpr -> QueryExpr
simplifyTwoSelects = transform $ x -> case x of
    Select sl0 (Select sl1 r) ->
        Select (transformIdentifiers sl1 sl0) r
    _ -> x
moveRestrictInsideSelect :: QueryExpr -> QueryExpr
moveRestrictInsideSelect = transform $ x -> case x of
    Restrict e (Select es r) ->
        Select es (Restrict (transformIdentifiers es e) r)
    _ -> x

I hope we can agree that this code is extremely concise. I think if you are familiar with basic Haskell syntax, it is also extremely readable.

Notes – using Uniplate and Biplate

Two interesting things about this code:

  • the transform/transformBi functions (from a package called Uniplate and Biplate), removes the need to write any code to apply our function recursively over the whole tree
  • We only need to explicitly mention the constructors we are using (just Identifier for the auxiliary function, just Select for the first transform, and Select and Restrict for the second transform). If any other constructors are added or removed from the two data types
  • QueryExpr and ScalarExpr – these functions here will still compile and work the same way.

Example use:

simplifyQueryExpr :: QueryExpr -> QueryExpr
simplifyQueryExpr = simplifyTwoSelects . moveRestrictInsideSelect
test1 :: QueryExpr
test1 =
    simplifyQueryExpr
    (Select [("a", App "/" [Iden "a", NumLit 3])]
     (Select [("a", App "+" [Iden "a", NumLit 5])]
      (RelVar "t")))

The value of test1 is:

Select [("a",App "/" [App "+" [Iden "a",NumLit 5],NumLit 3])] (RelVar "t")
test2 :: QueryExpr
test2 =
    simplifyQueryExpr
    (Select [("b", Iden "b"), ("c", Iden "c")]
     (Restrict (App ">" [Iden "a", NumLit 3])
      (Select [("c", App "*" [Iden "c", NumLit 5])]
       (RelVar "t"))))

The value of test2 is:

Select [("b",Iden "b"),("c",App "*" [Iden "c",NumLit 5])]
    (Restrict (App ">" [Iden "a",NumLit 3]) (RelVar "t"))

In our real compiler, at the equivalent stage to this, we have about 200 transforms, many of which are as simple as this. These transforms all compose really nicely as well.
Now I will try to justify why these are useful transforms, with a little detour.

Representing TPC-H 1

If we consider TPC-H query 1:

select
        l_returnflag,
        l_linestatus,
        sum(l_quantity) as sum_qty,
        sum(l_extendedprice) as sum_base_price,
        sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
        sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
        avg(l_quantity) as avg_qty,
        avg(l_extendedprice) as avg_price,
        avg(l_discount) as avg_disc,
        count(*) as count_order
from
        lineitem
where
        l_shipdate <= dateadd(day,63,'1998-12-01')
group by
        l_returnflag,
        l_linestatus
order by
        l_returnflag,
        l_linestatus;

In this query, we want to desugar the aggregates from something like this (written in relational algebra pseudocode):

summarize by (l_returnflag,
               l_linestatus)
    (sum(l_quantity) as sum_qty,
     sum(l_extendedprice) as sum_base_price,
     sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
     sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
     avg(l_quantity) as avg_qty,
     avg(l_extendedprice) as avg_price,
     avg(l_discount) as avg_disc,
     count as count_order)

to

select (sum_qty,
        sum_base_price,
        sum_disc_price,
        sum_charge,
        sum_qty / count_order as avg_qty,
        sum_base_price / count_order as avg_price,
        sum_disc_price / count_order as avg_disc,
        count_order)
summarize by (l_returnflag,
               l_linestatus)
    (sum(l_quantity) as sum_qty,
     sum(l_extendedprice) as sum_base_price,
     sum(a) as sum_disc_price,
     sum(b) as sum_charge,
     count as count_order)
select l_returnflag,
       l_linestatus
       l_quantity,
       l_extendedprice,
       l_extendedprice*(1-l_discount) as a
       l_extendedprice*(1-l_discount)*(1+l_tax) as b

Here you can see when we apply the desugaring to the summarize operator, we create new selects, which can then possibly be simplified further. Because the transformations compose well, we can have each transformation just focus on its small job, and leave a simpler mess behind that one or more other transformations can take care of. This has a big effect on the complexity of the code and the number of lines of code to write and maintain.
In order to implement this transform, we can split it into three separate passes:

  • for aggregate expressions which are an aggregate with a ‘non-simple’ argument, move the non-simple argument out into a select
  • for avg aggregates, split them into sum and count
  • combine duplicate columns – so we can calculate the minimum number of aggregates. Having this transform separately means that the avg aggregate split transform can be simpler

I will just show the first transform in this post.

simplifyAggregateParameters :: QueryExpr -> QueryExpr
simplifyAggregateParameters = transform $ x -> case x of
    SummarizeBy gs as r

Check that the summarize by has aggregates which need altering, if not, we can avoid some extra work.

        | not (all simple as)

Now create a list of new summarize entries, and a select list for the newly created nested select.

        , (as',sl) <- unzip $ map splitNonSimpleAgg as

Return the transformed QueryExpr.

        -> SummarizeBy gs as' $ Select sl r
    _ -> x
  where

Simple tells us if the function is just an aggregate call on a identifier or constant or something more complex.

    simple (_,App _ [Iden {}]) = True
    simple (_,App _ [NumLit {}]) = True
    simple _ = False

Split non simple agg takes an aggregate expression e.g.

sum(EXPR) as n

and splits it into two like this:

sum(n) as n
EXPR as n

The first case handles simple aggregates which we need if we have a mixture of simple and non-simple aggregates.

    splitNonSimpleAgg a@(n,App _ [Iden i]) = (a, (i,Iden i))

This is the non-simple case:

    splitNonSimpleAgg a@(n,App agg [e]) = ((n,App agg [Iden n]), (n,e))

I hope you can still agree that this code is quite simple.

Some more examples

Here are some examples:

select a,sum(b) from t

This should be unchanged.

test3 :: QueryExpr
test3 = simplifyAggregateParameters $
     SummarizeBy ["a"]
                 [("sum", App "sum" [Iden "b"])]
                 (RelVar "t")

output:

SummarizeBy [("a",Iden "a")] [("sum",App "sum" [Iden "b"])] (Select [("b",Iden "b")] (RelVar "t"))

One of the expressions from TPC-H 1:


select l_returnflag,
       l_linestatus,
       sum(l_extendedprice*(1-l_discount)) as sum_disc_price
from lineitem
group by l_returnflag,
         l_linestatus;
test4 :: QueryExpr
test4 = simplifyAggregateParameters $
     SummarizeBy ["l_returnflag","l_linestatus"]
                 [("sum_disc_price"
                  ,App "sum"
                   [App "*" [Iden "l_extendedprice"
                            ,App "-" [NumLit 1, Iden "l_discount"]]])]
                 (RelVar "lineitem")

Here is the result:


SummarizeBy ["l_returnflag","l_linestatus"]
            [("sum_disc_price",App "sum" [Iden "sum_disc_price"])]
    (Select [("sum_disc_price",App "*" [Iden "l_extendedprice",App "-" [NumLit 1,Iden "l_discount"]])]
     (RelVar "lineitem"))

For the transform which moves restricts, a more complex example might be TPC-H Q5.

select
        n_name,
        sum(l_extendedprice * (1 - l_discount)) as revenue
from
        customer,
        orders,
        lineitem,
        supplier,
        nation,
        region
where
        c_custkey = o_custkey
        and l_orderkey = o_orderkey
        and l_suppkey = s_suppkey
        and c_nationkey = s_nationkey
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'EUROPE'
        and o_orderdate >=  '1997-01-01'
        and o_orderdate < dateadd(year,1,'1997-01-01')
group by
        n_name
order by
        revenue desc;

In the where clause, we have a number of join predicates, and some simple restricts which would be more optimally applied before joining the tables together rather than after. If we convert this query straightforwardly to relational algebra, we could get something like this:

order by revenue desc
summarize by (n_name)
    (sum(l_extendedprice * (1 - l_discount)) as revenue)
restrict
        c_custkey = o_custkey
        and l_orderkey = o_orderkey
        and l_suppkey = s_suppkey
        and c_nationkey = s_nationkey
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'EUROPE'
        and o_orderdate >=  '1997-01-01'
        and o_orderdate < dateadd(year,1,'1997-01-01')
join
        customer,
        orders,
        lineitem,
        supplier,
        nation,
        region

We want to move the non-join-predicate parts of the restrict to produce something like this:

order by revenue desc
summarize by (n_name)
    (sum(l_extendedprice * (1 - l_discount)) as revenue)
restrict
        c_custkey = o_custkey
        and l_orderkey = o_orderkey
        and l_suppkey = s_suppkey
        and c_nationkey = s_nationkey
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
join
        customer,
        (restrict (o_orderdate >=  '1997-01-01'
                   and o_orderdate < dateadd(year,1,'1997-01-01')) orders),
        lineitem,
        supplier,
        nation,
        (restrict (r_name = 'EUROPE') region)

The code above for restrict is a simple version of this idea. Unfortunately, the implementation of this transform is a little too complex for this blog post.

Summary

The relative simplicity of the Haskell implementation of these transformations means that we can spend most of our time trying to come up with and describe the transformations we want, and not on the code in the compiler which implements the transformations.
This is the attractiveness of Haskell, and why I think it is worth taking the risk to use a relatively unusual and unproven technology. There are huge amounts of code in our system which I think would be much more difficult to implement and maintain in a more mainstream language (C++, Java or Python for instance).

Want to know more about SQream? Contact us for more information about our software and technology, or view our SQL documentation.