Aggregate queries in Morel
Last week I wrote about
solving the WordCount problem in Morel.
The solution used a user-defined function, split
,
unnested the resulting array of words into a relation, and then used
the group
operator to count the occurrences of each word.
In this post, I want to describe in detail how that group
operator works, and the strategy that went into it.
I think that what we came up with is elegant, powerful and concise, and I hope that you will agree. But getting the right design wasn’t straightforward. To see why, we’ll take a quick tour through the history of aggregate functions in databases and functional programming languages.
Aggregate functions and relational algebra
When Tedd Codd introduced the relational algebra in 1970, there was no support for aggregation or aggregate functions. The operators were project, select, join and semijoin (called ‘restriction’ in Codd’s paper).
It was not until 1982 that Anthony Klug added aggregates to relational algebra. He remarked
Previous treatments of aggregate functions in relational languages have not been general and have not been well defined. Two examples are System R and INGRES. The formulations of aggregate functions in these systems do not apply to more general languages, for example, to languages having explicit quantifiers. Their definitions of aggregate functions also rely unnecessarily on “sets” of tuples having duplicate members (a contradiction).
Aggregation does not fit easily into relational algebra because
relational algebra works in terms of sets. If you project the deptno
column from the Emp
relation, you don’t get 5 records for
department 10 (one for each employee), you get just one. If you try to
compute the average salary for employees in each department, the most
natural formulation would compute the set of salary values in
department 10, and therefore if two employees have the same salary
value, they would be counted only once.
The commercial systems Klug referred to, System R and INGRES, had
fewer problems because they were based on multisets (bags) rather than
sets. Still, they arrived at an uneasy truce. SQL’s GROUP BY
works
by sleight of hand, simultaneously performing a project (of the key
columns) and aggregation. The bag of values being passed to the
aggregate function exists only fleetingly; if you’re squeamish, it’s
best to look away.
Shoe-horning aggregate behavior into the existing SELECT
expression
was another compromise, and it left its mark on SQL’s
semantics. Consider the following statement:
SELECT deptno, age, SUM(salary), MIN(age)
FROM Emp
GROUP BY deptno
There are two references to the age
column. The first reference is
illegal (because age
is not a GROUP BY
key), but the second
reference (inside the MIN
function) is legal. The semantic context,
determining what columns are available, is very different if you are
inside or outside an aggregate function, and both of those contexts
exist in just the first line of that query.
SQL had to invent a new clause, HAVING
, that does the same as
WHERE
but in the post-aggregation context. (As we shall see, Morel
does not have that problem. The semantic rules after a group
are the
same as before it, so you can intermix group
and where
any way you like.)
Aggregate functions create other semantic problems. Assuming that the
Emp
table contains 100 rows, how many rows does the following
statement return?
SELECT foo(age)
FROM Emp
Unless you know whether foo
is an aggregate function, it’s
impossible to say. If foo
is an aggregate function, the query
returns 1 row (implicitly totaling all Emp
records), but if foo
is
an ordinary scalar function the query returns 100 rows.
The underlying problem is that aggregate functions in SQL have the same syntax as scalar functions, but very different semantics. This makes the language fragile, and makes a query difficult to understand unless you are familiar with all of the library functions that it is using.
Nested collections
Having incorporated aggregate functions into relational algebra, by the early 1990s, the database research community was turning its attention to another sacred cow: nested collections, formally known as Non First Normal Form (NF2) relations.
Nested collections are a major extension to the relational model (though purists would say a corruption) that allow rich representations of data, but we are interested in them here because they allow a new approach to aggregation. With nested collections, you can aggregate in two steps: first create a set (or multiset) of rows that share the same key, then apply a function to collapse those rows to a single value.
As we saw in our discussion of WordCount, one of the languages that embraced nested collections is Apache Pig.
Consider the following aggregate query in Pig Latin:
emps = LOAD '/data/emps' using PigStorage()
as (empno: int, name: chararray, deptno: int, salary: int);
by_deptno = GROUP emps BY deptno;
dept_stats = FOREACH by_deptno
GENERATE group as deptno, COUNT(emps), AVG(emps.salary);
The by_deptno
relation is the intermediate step, after GROUP
has
created a collection for each deptno
value, and before GENERATE
has applied COUNT
and AVG
aggregate functions to collapse the
collections into scalar values. by_deptno
looks like this:
group | emps |
---|---|
10 | {[100, ‘Shaggy’, 10, 1500], [120, ‘Scooby’, 10, 1250]} |
20 | {[110, ‘Velma’, 20, 2000]} |
30 | {[130, ‘Fred’, 30, 1700]}, {[140, ‘Daphne’, 30, 1700]} |
The two-stage process makes it easy to write powerful aggregate
functions. For example, to implement MEDIAN
, you need to look at all
of the values, sort them, and take the value that occurs in the
middle.
But many other aggregate functions, including SUM
, COUNT
, MIN
and MAX
, can be computed iteratively, adding each row to a small
accumulator; materializing the whole multiset is a waste of memory and
effort.
The accumulator approach is how aggregate functions are usually implemented in functional programming languages.
Functional languages: foldl
In Standard ML, the
language from which Morel is derived,
the List
structure has a
higher-order function foldl
(meaning ‘fold left’). foldl
starts
with an initial accumulator value, then uses a “combiner” function
supplied by the caller to combine the initial value with the first
element of the list to form a new accumulator value, then combines
that accumulator value with the second element of the list, and so on.
(There is also a foldr
function that starts at the end of the list
and works forward.)
Thus sum
is defined as follows:
- val sum = foldl (fn (x, y) => x + y) 0;
val sum = fn : int list -> int
Here is sum
applied to a list:
- sum [1, 2, 3, 5, 8, 13];
val it = 32 : int
Choices, choices…
To recap, we have seen how grouping and aggregation are implemented in
three languages: SQL, Pig and Standard ML. SQL calculates aggregates
as it groups, whereas Pig forms collections first. Pig’s aggregate
functions work on entire collections, whereas ML’s foldl
higher-order function reduces collections by incrementally adding to
an accumulator.
In Morel, we had to choose whether our group
operation would
generate lists (like Pig) that would be reduced in a subsequent step,
or whether it would apply aggregate functions. And if it applied
aggregate functions, would these be defined using an accumulator (like
foldl
) or would we allow a more general mechanism?
Morel’s aggregate functions proceed in three steps, all of which occur
within the group
clause:
- First, Morel applies a key extractor expression to each row, and gathers rows together by key value;
- Second, for all of the rows in a group, Morel applies an argument extractor expression to get the argument to which the aggregate function will be applied, and collects the argument values into a list;
- Last, Morel applies the aggregate function to the list, emitting the result as a field in the output record.
In functional programming terms, it’s difficult to express this
concisely. If group
were a higher-order function, its signature
would look something like this:
- val group: ('r -> 'k) -> ('r -> 'a) -> ('a list -> 'b) -> ('k * 'b) list
where:
'r
is the input row'k
is the key'r -> 'k
is the key extractor function'a
is the type of the argument to the aggregate function'r -> 'a
is the argument extractor function'b
is the result of the aggregate function'a list -> 'b
is the aggregate function that converts a list of arguments to a result('k * 'b) list
is the output, a list of (key, result) pairs
Yes, this is complicated! And this variant only allows one aggregate function.
But Morel provides syntactic sugar so that group
is simple to use in
practice, while retaining strongly typing and type inference. For example:
- from e in emps
group e.deptno compute sumSal = sum of e.sal;
val it =
[{deptno=20,sumSal=10875.0},{deptno=10,sumSal=8750.0},
{deptno=30,sumSal=9400.0}] : {deptno:int, sumSal:real} list
(Examples in this post use the new =
syntax for renaming group keys
and aggregate functions that was introduced in
[MOREL-24] but has
not yet been released, rather than the old as
syntax used in
morel-0.2, and also assume that sum
and +
have overloads for both
int
and real
, introduced in
[MOREL-28] and
[MOREL-29].)
The key extractor (e.deptno
) and argument extractor (e.sal
) are
not functions but expressions that are evaluated in the environment of
the current row (the variable e
holds current member of the emps
relation). Expressions are as powerful as functions but are more
concise.
The aggregate function is a function – or, more precisely, an
expression that yields a function. In this case, we use the constant
sum
, which is a built-in function value of type int list -> int
.
Here is another example:
- from e in emps,
d in depts
where e.deptno = d.deptno
group e.deptno, d.dname, e.job
compute sumSal = sum of e.sal,
minRemuneration = min of e.sal + e.commission;
val it =
[{deptno=30,dname="SALES",job="MANAGER",minRemuneration=2850.0,sumSal=2850.0},
{deptno=20,dname="RESEARCH",job="CLERK",minRemuneration=800.0,sumSal=1900.0},
{deptno=10,dname="ACCOUNTING",job="PRESIDENT",minRemuneration=5000.0,sumSal=5000.0},
{deptno=10,dname="ACCOUNTING",job="MANAGER",minRemuneration=2450.0,sumSal=2450.0},
{deptno=20,dname="RESEARCH",job="ANALYST",minRemuneration=3000.0,sumSal=6000.0},
{deptno=30,dname="SALES",job="CLERK",minRemuneration=950.0,sumSal=950.0},
{deptno=10,dname="ACCOUNTING",job="CLERK",minRemuneration=1300.0,sumSal=1300.0},
{deptno=20,dname="RESEARCH",job="MANAGER",minRemuneration=2975.0,sumSal=2975.0},
{deptno=30,dname="SALES",job="SALESMAN",minRemuneration=1500.0,sumSal=5600.0}]
: {deptno:int, dname:string, job:string, minRemuneration:real, sumSal:real} list
Several things are more advanced than the previous example. The key is
composite, there is more than one aggregate function, and the argument
to the aggregate function may be a complex expression. Also, the input
is a join, so there are two variables (e
and d
) available for use
in expressions.
The output format is more straightforward than Pig. Pig uses a field
called group
for keys, which is a record if the key is
composite. Morel just uses the input field names (and allows you to
rename fields using =
). In this case, Pig’s output record is
{group = {deptno, dname}, sumSal, minRemuneration}
, and Morel’s is
{deptno, dname, job, sumSal, minRemuneration}
.
Pig’s intermediate format (after GROUP
and before FOREACH
) would
have a list-valued field called emps
, whereas Morel’s intermediate
list is seen only by the argument extractor, and does not need to be
named for the user’s benefit.
The output of group
is simply an iteration context with a number of
variables available (deptno
, dname
, job
, sumSal
,
minRemuneration
). That’s what you’d expect whenever you are inside a
from
expression. You can therefore follow group
with any clause
allowable in from
– where
, order
or group
– or terminate the
from
expression with a yield
clause.
User-defined aggregate functions
Morel’s goal is to be a simple, concise query language which allows you to escape into a Turing-complete programming language when you need to.
So of course you can define your own aggregate functions inside a
query. In this example, we define our own version of the sum
function:
- let
fun my_sum [] = 0
| my_sum (head :: tail) = head + (my_sum tail)
in
from e in emps
group e.deptno
compute sumEmpno = my_sum of e.empno
end;
val it =
[{deptno=20,sumEmpno=38501},{deptno=10,sumEmpno=23555},
{deptno=30,sumEmpno=46116}] : {deptno:int, sumEmpno:int} list
Aggregate functions are invoked on a collection of values formed by
applying their argument expression to all of the records in the
current group. SQL’s
COLLECT
aggregate function, which creates a collection of its arguments, is
therefore trivial in Morel: we just use the identity operator (fn x
=> x
) as the aggregate function, and it returns its argument:
- from e in emps
group e.deptno
compute names = (fn x => x) of e.ename;
val it =
[{deptno=20,names=["SMITH","JONES","SCOTT","ADAMS","FORD"]},
{deptno=10,names=["CLARK","KING","MILLER"]},
{deptno=30,names=["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]}]
: {deptno:int, names:string list} list
Computing aggregate functions efficiently
Not all aggregate functions need to operate on the full list of their arguments. It might seem like overkill to form the list of arguments when not all functions need it.
But in designing Morel, we favor expressive power and concise, readable syntax over efficiency. We think we can achieve efficiency (in most cases) by recognizing expressions that can be rewritten to something simpler.
In the case of aggregate functions, many functions have algebraic
properties that allow them to be computed more efficiently. For
example, you can compute sum
by dividing the rows into any subsets
you like, summing those subsets, and summing those sums, in any order
you like. For instance, sum [1, 2, 3, 5, 8, 13]
is the same as sum
[1, 3, 5, 13] + sum [2, 8]
, and therefore partitioning the input into
odd and even partitions would be a viable strategy.
In mathematical terms, sum
is a
commutative monoid.
In computational terms, that means that it is very easy to parallelize.
Morel will, at some point, allow you to declare the algebraic properties of built-in and user-defined aggregate functions (such as whether they are monoids). Then it will be able to choose more efficient plans.
Summary
Morel’s group
operator is elegant and powerful. The syntax is simple
and concise when used with built-in aggregate functions, but you can
easily write user-defined aggregate functions.
Aggregate functions behave as if they are acting on collections, but in practice they can frequently be computed more efficiently, using accumulators or by composing sub-totals.
Unlike the complicated semantics of aggregation in SQL, group
composes easily with other Morel relational operators such as where
and order
.
If you have comments, please reply on Twitter:
Next, let's add GROUP BY to Morel. Should be pretty straightforward, right? https://t.co/CKV9n73otd
— Julian Hyde (@julianhyde) April 9, 2020
This article has been updated.