SQream Platform
GPU Powered Data & Analytics Acceleration
Enterprise (Private Deployment) SQL on GPU for Large & Complex Queries
Public Cloud (GCP, AWS) GPU Powered Data Lakehouse
No Code Data Solution for Small & Medium Business
Scale your ML and AI with Production-Sized Models
By Jake Wheat
At SQream, we use Haskell for a large part of our code. We use Haskell for the compiler, which takes SQL statements and turns them into low level instructions for the high performance CUDA runtime. We also use Haskell for rapid prototyping and for many auxiliary utilities.
Haskell is an excellent general purpose programming language, and also has particular strengths in compiler implementation, and in rapid development. In this post I will show some demonstration code based on the SQream compiler. This should give some idea of how easy it is to develop compilers with Haskell. Let’s say we decide to implement the compiler by first converting SQL to relational algebra, then running a number of rewrites on the relational algebra before converting to the next stages. This will make the task much clearer, since relational algebra is much simpler and more direct than SQL. We have a relational algebra AST which looks like this (I have only included the constructors which concern us right now).
data QueryExpr = RelVar String | Restrict ScalarExpr QueryExpr | Select [(String,ScalarExpr)] QueryExpr | SummarizeBy [String] [(String,ScalarExpr)] QueryExpr deriving (Show,Data,Typeable)
data ScalarExpr = Iden String | App String [ScalarExpr] | NumLit Integer deriving (Show,Data,Typeable)
For the QueryExpr type, which represents a ‘relation valued expression’:
For the ScalarExpr, which represents a ‘scalar valued expression’:
Here is a simple SQL query:
select a from t;
This is the equivalent in relational algebra (in the Haskell AST):
Select [("a",Iden "a")] (RelVar "t")
Here is another example of a SQL query:
select a+b as c from t
The relational algebra equivalent:
Select [("c", App "+" [Iden "a", Iden "b"])] (RelVar "t")
An example with where:
select a,b from t where c > 100
A relational algebra equivalent:
Select [("a", Iden "a"), ("b", Iden "b")] (Restrict (App ">" [Iden "c", NumLit 100]) (RelVar "t"))
The two transforms I will demonstrate are:
Consider this relational algebra expression:
Select [("a", App "/" [Iden "a", NumLit 3])] $ Select [("a", App "+" [Iden "a", NumLit 5])] $ RelVar "t"
We want to simplify this to the following expression:
Select [("a", App "/" [App "+" [Iden "a", NumLit 5], NumLit 3])] $ RelVar "t"
It isn’t immediately obvious that this is an optimisation, but I hope you at least agree this is arguably a simpler representation of the same query. Here is the second example expression to simplify:
Select [("b", Iden "b"), ("c", Iden "c")] $ Restrict (App ">" [Iden "a", NumLit 3]) $ Select [("c", App "*" [Iden "c", NumLit 5])] $ RelVar "t"
We want to convert it to this:
Select [("b", Iden "b"), ("c", App "*" [Iden "c", NumLit 5])] $ Restrict (App ">" [Iden "a", NumLit 3]) $ RelVar "t"
Here is some simple code to do the transform.
transformIdentifiers :: Data a => [(String,ScalarExpr)] -> a -> a transformIdentifiers env = transformBi $ e -> case e of Iden i | Just e' <- lookup i env -> e' _ -> e
simplifyTwoSelects :: QueryExpr -> QueryExpr simplifyTwoSelects = transform $ x -> case x of Select sl0 (Select sl1 r) -> Select (transformIdentifiers sl1 sl0) r _ -> x
moveRestrictInsideSelect :: QueryExpr -> QueryExpr moveRestrictInsideSelect = transform $ x -> case x of Restrict e (Select es r) -> Select es (Restrict (transformIdentifiers es e) r) _ -> x
I hope we can agree that this code is extremely concise. I think if you are familiar with basic Haskell syntax, it is also extremely readable.
Two interesting things about this code:
Example use:
simplifyQueryExpr :: QueryExpr -> QueryExpr simplifyQueryExpr = simplifyTwoSelects . moveRestrictInsideSelect
test1 :: QueryExpr test1 = simplifyQueryExpr (Select [("a", App "/" [Iden "a", NumLit 3])] (Select [("a", App "+" [Iden "a", NumLit 5])] (RelVar "t")))
The value of test1 is:
Select [("a",App "/" [App "+" [Iden "a",NumLit 5],NumLit 3])] (RelVar "t")
test2 :: QueryExpr test2 = simplifyQueryExpr (Select [("b", Iden "b"), ("c", Iden "c")] (Restrict (App ">" [Iden "a", NumLit 3]) (Select [("c", App "*" [Iden "c", NumLit 5])] (RelVar "t"))))
The value of test2 is:
Select [("b",Iden "b"),("c",App "*" [Iden "c",NumLit 5])] (Restrict (App ">" [Iden "a",NumLit 3]) (RelVar "t"))
In our real compiler, at the equivalent stage to this, we have about 200 transforms, many of which are as simple as this. These transforms all compose really nicely as well. Now I will try to justify why these are useful transforms, with a little detour.
If we consider TPC-H query 1:
select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= dateadd(day,63,'1998-12-01') group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus;
In this query, we want to desugar the aggregates from something like this (written in relational algebra pseudocode):
summarize by (l_returnflag, l_linestatus) (sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count as count_order)
to
select (sum_qty, sum_base_price, sum_disc_price, sum_charge, sum_qty / count_order as avg_qty, sum_base_price / count_order as avg_price, sum_disc_price / count_order as avg_disc, count_order) summarize by (l_returnflag, l_linestatus) (sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(a) as sum_disc_price, sum(b) as sum_charge, count as count_order) select l_returnflag, l_linestatus l_quantity, l_extendedprice, l_extendedprice*(1-l_discount) as a l_extendedprice*(1-l_discount)*(1+l_tax) as b
Here you can see when we apply the desugaring to the summarize operator, we create new selects, which can then possibly be simplified further. Because the transformations compose well, we can have each transformation just focus on its small job, and leave a simpler mess behind that one or more other transformations can take care of. This has a big effect on the complexity of the code and the number of lines of code to write and maintain. In order to implement this transform, we can split it into three separate passes:
I will just show the first transform in this post.
simplifyAggregateParameters :: QueryExpr -> QueryExpr simplifyAggregateParameters = transform $ x -> case x of SummarizeBy gs as r
Check that the summarize by has aggregates which need altering, if not, we can avoid some extra work.
| not (all simple as)
Now create a list of new summarize entries, and a select list for the newly created nested select.
, (as',sl) <- unzip $ map splitNonSimpleAgg as
Return the transformed QueryExpr.
-> SummarizeBy gs as' $ Select sl r _ -> x where
Simple tells us if the function is just an aggregate call on a identifier or constant or something more complex.
simple (_,App _ [Iden {}]) = True simple (_,App _ [NumLit {}]) = True simple _ = False
Split non simple agg takes an aggregate expression e.g.
sum(EXPR) as n
and splits it into two like this:
sum(n) as n
EXPR as n
The first case handles simple aggregates which we need if we have a mixture of simple and non-simple aggregates.
splitNonSimpleAgg a@(n,App _ [Iden i]) = (a, (i,Iden i))
This is the non-simple case:
splitNonSimpleAgg a@(n,App agg [e]) = ((n,App agg [Iden n]), (n,e))
I hope you can still agree that this code is quite simple.
Here are some examples:
select a,sum(b) from t
This should be unchanged.
test3 :: QueryExpr test3 = simplifyAggregateParameters $ SummarizeBy ["a"] [("sum", App "sum" [Iden "b"])] (RelVar "t")
output:
SummarizeBy [("a",Iden "a")] [("sum",App "sum" [Iden "b"])] (Select [("b",Iden "b")] (RelVar "t"))
One of the expressions from TPC-H 1:
select l_returnflag, l_linestatus, sum(l_extendedprice*(1-l_discount)) as sum_disc_price from lineitem group by l_returnflag, l_linestatus;
test4 :: QueryExpr test4 = simplifyAggregateParameters $ SummarizeBy ["l_returnflag","l_linestatus"] [("sum_disc_price" ,App "sum" [App "*" [Iden "l_extendedprice" ,App "-" [NumLit 1, Iden "l_discount"]]])] (RelVar "lineitem")
Here is the result:
SummarizeBy ["l_returnflag","l_linestatus"] [("sum_disc_price",App "sum" [Iden "sum_disc_price"])] (Select [("sum_disc_price",App "*" [Iden "l_extendedprice",App "-" [NumLit 1,Iden "l_discount"]])] (RelVar "lineitem"))
For the transform which moves restricts, a more complex example might be TPC-H Q5.
select n_name, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem, supplier, nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE' and o_orderdate >= '1997-01-01' and o_orderdate < dateadd(year,1,'1997-01-01') group by n_name order by revenue desc;
In the where clause, we have a number of join predicates, and some simple restricts which would be more optimally applied before joining the tables together rather than after. If we convert this query straightforwardly to relational algebra, we could get something like this:
order by revenue desc summarize by (n_name) (sum(l_extendedprice * (1 - l_discount)) as revenue) restrict c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE' and o_orderdate >= '1997-01-01' and o_orderdate < dateadd(year,1,'1997-01-01') join customer, orders, lineitem, supplier, nation, region
We want to move the non-join-predicate parts of the restrict to produce something like this:
order by revenue desc summarize by (n_name) (sum(l_extendedprice * (1 - l_discount)) as revenue) restrict c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey join customer, (restrict (o_orderdate >= '1997-01-01' and o_orderdate < dateadd(year,1,'1997-01-01')) orders), lineitem, supplier, nation, (restrict (r_name = 'EUROPE') region)
The code above for restrict is a simple version of this idea. Unfortunately, the implementation of this transform is a little too complex for this blog post.
The relative simplicity of the Haskell implementation of these transformations means that we can spend most of our time trying to come up with and describe the transformations we want, and not on the code in the compiler which implements the transformations. This is the attractiveness of Haskell, and why I think it is worth taking the risk to use a relatively unusual and unproven technology. There are huge amounts of code in our system which I think would be much more difficult to implement and maintain in a more mainstream language (C++, Java or Python for instance).
Want to know more about SQream? Contact us for more information about our software and technology, or view our SQL documentation.