Last week I wrote about solving the WordCount problem in Morel. The solution used a user-defined function, `split`, unnested the resulting array of words into a relation, and then used the `group` operator to count the occurrences of each word.

In this post, I want to describe in detail how that `group` operator works, and the strategy that went into it.

I think that what we came up with is elegant, powerful and concise, and I hope that you will agree. But getting the right design wasn’t straightforward. To see why, we’ll take a quick tour through the history of aggregate functions in databases and functional programming languages.

# Aggregate functions and relational algebra

When Tedd Codd introduced the relational algebra in 1970, there was no support for aggregation or aggregate functions. The operators were project, select, join and semijoin (called ‘restriction’ in Codd’s paper).

It was not until 1982 that Anthony Klug added aggregates to relational algebra. He remarked

Previous treatments of aggregate functions in relational languages have not been general and have not been well defined. Two examples are System R and INGRES. The formulations of aggregate functions in these systems do not apply to more general languages, for example, to languages having explicit quantifiers. Their definitions of aggregate functions also rely unnecessarily on “sets” of tuples having duplicate members (a contradiction).

Aggregation does not fit easily into relational algebra because relational algebra works in terms of sets. If you project the `deptno` column from the `Emp` relation, you don’t get 5 records for department 10 (one for each employee), you get just one. If you try to compute the average salary for employees in each department, the most natural formulation would compute the set of salary values in department 10, and therefore if two employees have the same salary value, they would be counted only once.

The commercial systems Klug referred to, System R and INGRES, had fewer problems because they were based on multisets (bags) rather than sets. Still, they arrived at an uneasy truce. SQL’s `GROUP BY` works by sleight of hand, simultaneously performing a project (of the key columns) and aggregation. The bag of values being passed to the aggregate function exists only fleetingly; if you’re squeamish, it’s best to look away.

Shoe-horning aggregate behavior into the existing `SELECT` expression was another compromise, and it left its mark on SQL’s semantics. Consider the following statement:

``````SELECT deptno, age, SUM(salary), MIN(age)
FROM Emp
GROUP BY deptno
``````

There are two references to the `age` column. The first reference is illegal (because `age` is not a `GROUP BY` key), but the second reference (inside the `MIN` function) is legal. The semantic context, determining what columns are available, is very different if you are inside or outside an aggregate function, and both of those contexts exist in just the first line of that query.

SQL had to invent a new clause, `HAVING`, that does the same as `WHERE` but in the post-aggregation context. (As we shall see, Morel does not have that problem. The semantic rules after a `group` are the same as before it, so you can intermix `group` and `where` any way you like.)

Aggregate functions create other semantic problems. Assuming that the `Emp` table contains 100 rows, how many rows does the following statement return?

``````SELECT foo(age)
FROM Emp
``````

Unless you know whether `foo` is an aggregate function, it’s impossible to say. If `foo` is an aggregate function, the query returns 1 row (implicitly totaling all `Emp` records), but if `foo` is an ordinary scalar function the query returns 100 rows.

The underlying problem is that aggregate functions in SQL have the same syntax as scalar functions, but very different semantics. This makes the language fragile, and makes a query difficult to understand unless you are familiar with all of the library functions that it is using.

# Nested collections

Having incorporated aggregate functions into relational algebra, by the early 1990s, the database research community was turning its attention to another sacred cow: nested collections, formally known as Non First Normal Form (NF2) relations.

Nested collections are a major extension to the relational model (though purists would say a corruption) that allow rich representations of data, but we are interested in them here because they allow a new approach to aggregation. With nested collections, you can aggregate in two steps: first create a set (or multiset) of rows that share the same key, then apply a function to collapse those rows to a single value.

As we saw in our discussion of WordCount, one of the languages that embraced nested collections is Apache Pig.

Consider the following aggregate query in Pig Latin:

``````emps = LOAD '/data/emps' using PigStorage()
as (empno: int, name: chararray, deptno: int, salary: int);
by_deptno = GROUP emps BY deptno;
dept_stats = FOREACH by_deptno
GENERATE group as deptno, COUNT(emps), AVG(emps.salary);
``````

The `by_deptno` relation is the intermediate step, after `GROUP` has created a collection for each `deptno` value, and before `GENERATE` has applied `COUNT` and `AVG` aggregate functions to collapse the collections into scalar values. `by_deptno` looks like this:

group emps
10 {[100, ‘Shaggy’, 10, 1500], [120, ‘Scooby’, 10, 1250]}
20 {[110, ‘Velma’, 20, 2000]}
30 {[130, ‘Fred’, 30, 1700]}, {[140, ‘Daphne’, 30, 1700]}

The two-stage process makes it easy to write powerful aggregate functions. For example, to implement `MEDIAN`, you need to look at all of the values, sort them, and take the value that occurs in the middle.

But many other aggregate functions, including `SUM`, `COUNT`, `MIN` and `MAX`, can be computed iteratively, adding each row to a small accumulator; materializing the whole multiset is a waste of memory and effort.

The accumulator approach is how aggregate functions are usually implemented in functional programming languages.

# Functional languages: foldl

In Standard ML, the language from which Morel is derived, the `List` structure has a higher-order function `foldl` (meaning ‘fold left’). `foldl` starts with an initial accumulator value, then uses a “combiner” function supplied by the caller to combine the initial value with the first element of the list to form a new accumulator value, then combines that accumulator value with the second element of the list, and so on.

(There is also a `foldr` function that starts at the end of the list and works forward.)

Thus `sum` is defined as follows:

``````- val sum = foldl (fn (x, y) => x + y) 0;
val sum = fn : int list -> int
``````

Here is `sum` applied to a list:

``````- sum [1, 2, 3, 5, 8, 13];
val it = 32 : int
``````

# Choices, choices…

To recap, we have seen how grouping and aggregation are implemented in three languages: SQL, Pig and Standard ML. SQL calculates aggregates as it groups, whereas Pig forms collections first. Pig’s aggregate functions work on entire collections, whereas ML’s `foldl` higher-order function reduces collections by incrementally adding to an accumulator.

In Morel, we had to choose whether our `group` operation would generate lists (like Pig) that would be reduced in a subsequent step, or whether it would apply aggregate functions. And if it applied aggregate functions, would these be defined using an accumulator (like `foldl`) or would we allow a more general mechanism?

Morel’s aggregate functions proceed in three steps, all of which occur within the `group` clause:

• First, Morel applies a key extractor expression to each row, and gathers rows together by key value;
• Second, for all of the rows in a group, Morel applies an argument extractor expression to get the argument to which the aggregate function will be applied, and collects the argument values into a list;
• Last, Morel applies the aggregate function to the list, emitting the result as a field in the output record.

In functional programming terms, it’s difficult to express this concisely. If `group` were a higher-order function, its signature would look something like this:

``````- val group: ('r -> 'k) -> ('r -> 'a) -> ('a list -> 'b) -> ('k * 'b) list
``````

where:

• `'r` is the input row
• `'k` is the key
• `'r -> 'k` is the key extractor function
• `'a` is the type of the argument to the aggregate function
• `'r -> 'a` is the argument extractor function
• `'b` is the result of the aggregate function
• `'a list -> 'b` is the aggregate function that converts a list of arguments to a result
• `('k * 'b) list` is the output, a list of (key, result) pairs

Yes, this is complicated! And this variant only allows one aggregate function.

But Morel provides syntactic sugar so that `group` is simple to use in practice, while retaining strongly typing and type inference. For example:

``````- from e in emps
group e.deptno compute sumSal = sum of e.sal;
val it =
[{deptno=20,sumSal=10875.0},{deptno=10,sumSal=8750.0},
{deptno=30,sumSal=9400.0}] : {deptno:int, sumSal:real} list
``````

(Examples in this post use the new `=` syntax for renaming group keys and aggregate functions that was introduced in [MOREL-24] but has not yet been released, rather than the old `as` syntax used in morel-0.2, and also assume that `sum` and `+` have overloads for both `int` and `real`, introduced in [MOREL-28] and [MOREL-29].)

The key extractor (`e.deptno`) and argument extractor (`e.sal`) are not functions but expressions that are evaluated in the environment of the current row (the variable `e` holds current member of the `emps` relation). Expressions are as powerful as functions but are more concise.

The aggregate function is a function – or, more precisely, an expression that yields a function. In this case, we use the constant `sum`, which is a built-in function value of type `int list -> int`.

Here is another example:

``````- from e in emps,
d in depts
where e.deptno = d.deptno
group e.deptno, d.dname, e.job
compute sumSal = sum of e.sal,
minRemuneration = min of e.sal + e.commission;
val it =
[{deptno=30,dname="SALES",job="MANAGER",minRemuneration=2850.0,sumSal=2850.0},
{deptno=20,dname="RESEARCH",job="CLERK",minRemuneration=800.0,sumSal=1900.0},
{deptno=10,dname="ACCOUNTING",job="PRESIDENT",minRemuneration=5000.0,sumSal=5000.0},
{deptno=10,dname="ACCOUNTING",job="MANAGER",minRemuneration=2450.0,sumSal=2450.0},
{deptno=20,dname="RESEARCH",job="ANALYST",minRemuneration=3000.0,sumSal=6000.0},
{deptno=30,dname="SALES",job="CLERK",minRemuneration=950.0,sumSal=950.0},
{deptno=10,dname="ACCOUNTING",job="CLERK",minRemuneration=1300.0,sumSal=1300.0},
{deptno=20,dname="RESEARCH",job="MANAGER",minRemuneration=2975.0,sumSal=2975.0},
{deptno=30,dname="SALES",job="SALESMAN",minRemuneration=1500.0,sumSal=5600.0}]
: {deptno:int, dname:string, job:string, minRemuneration:real, sumSal:real} list
``````

Several things are more advanced than the previous example. The key is composite, there is more than one aggregate function, and the argument to the aggregate function may be a complex expression. Also, the input is a join, so there are two variables (`e` and `d`) available for use in expressions.

The output format is more straightforward than Pig. Pig uses a field called `group` for keys, which is a record if the key is composite. Morel just uses the input field names (and allows you to rename fields using `=`). In this case, Pig’s output record is `{group = {deptno, dname}, sumSal, minRemuneration}`, and Morel’s is `{deptno, dname, job, sumSal, minRemuneration}`.

Pig’s intermediate format (after `GROUP` and before `FOREACH`) would have a list-valued field called `emps`, whereas Morel’s intermediate list is seen only by the argument extractor, and does not need to be named for the user’s benefit.

The output of `group` is simply an iteration context with a number of variables available (`deptno`, `dname`, `job`, `sumSal`, `minRemuneration`). That’s what you’d expect whenever you are inside a `from` expression. You can therefore follow `group` with any clause allowable in `from``where`, `order` or `group` – or terminate the `from` expression with a `yield` clause.

# User-defined aggregate functions

Morel’s goal is to be a simple, concise query language which allows you to escape into a Turing-complete programming language when you need to.

So of course you can define your own aggregate functions inside a query. In this example, we define our own version of the `sum` function:

``````- let
fun my_sum [] = 0
in
from e in emps
group e.deptno
compute sumEmpno = my_sum of e.empno
end;
val it =
[{deptno=20,sumEmpno=38501},{deptno=10,sumEmpno=23555},
{deptno=30,sumEmpno=46116}] : {deptno:int, sumEmpno:int} list
``````

Aggregate functions are invoked on a collection of values formed by applying their argument expression to all of the records in the current group. SQL’s COLLECT aggregate function, which creates a collection of its arguments, is therefore trivial in Morel: we just use the identity operator (```fn x => x```) as the aggregate function, and it returns its argument:

``````- from e in emps
group e.deptno
compute names = (fn x => x) of e.ename;
val it =
{deptno=10,names=["CLARK","KING","MILLER"]},
{deptno=30,names=["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]}]
: {deptno:int, names:string list} list
``````

# Computing aggregate functions efficiently

Not all aggregate functions need to operate on the full list of their arguments. It might seem like overkill to form the list of arguments when not all functions need it.

But in designing Morel, we favor expressive power and concise, readable syntax over efficiency. We think we can achieve efficiency (in most cases) by recognizing expressions that can be rewritten to something simpler.

In the case of aggregate functions, many functions have algebraic properties that allow them to be computed more efficiently. For example, you can compute `sum` by dividing the rows into any subsets you like, summing those subsets, and summing those sums, in any order you like. For instance, `sum [1, 2, 3, 5, 8, 13]` is the same as ```sum [1, 3, 5, 13] + sum [2, 8]```, and therefore partitioning the input into odd and even partitions would be a viable strategy.

In mathematical terms, `sum` is a commutative monoid. In computational terms, that means that it is very easy to parallelize.

Morel will, at some point, allow you to declare the algebraic properties of built-in and user-defined aggregate functions (such as whether they are monoids). Then it will be able to choose more efficient plans.

# Summary

Morel’s `group` operator is elegant and powerful. The syntax is simple and concise when used with built-in aggregate functions, but you can easily write user-defined aggregate functions.

Aggregate functions behave as if they are acting on collections, but in practice they can frequently be computed more efficiently, using accumulators or by composing sub-totals.

Unlike the complicated semantics of aggregation in SQL, `group` composes easily with other Morel relational operators such as `where` and `order`.