Last week I wrote about solving the WordCount problem in Morel. The solution used a user-defined function, split, unnested the resulting array of words into a relation, and then used the group operator to count the occurrences of each word.

In this post, I want to describe in detail how that group operator works, and the strategy that went into it.

I think that what we came up with is elegant, powerful and concise, and I hope that you will agree. But getting the right design wasn’t straightforward. To see why, we’ll take a quick tour through the history of aggregate functions in databases and functional programming languages.

Aggregate functions and relational algebra

When Tedd Codd introduced the relational algebra in 1970, there was no support for aggregation or aggregate functions. The operators were project, select, join and semijoin (called ‘restriction’ in Codd’s paper).

It was not until 1982 that Anthony Klug added aggregates to relational algebra. He remarked

Previous treatments of aggregate functions in relational languages have not been general and have not been well defined. Two examples are System R and INGRES. The formulations of aggregate functions in these systems do not apply to more general languages, for example, to languages having explicit quantifiers. Their definitions of aggregate functions also rely unnecessarily on “sets” of tuples having duplicate members (a contradiction).

Aggregation does not fit easily into relational algebra because relational algebra works in terms of sets. If you project the deptno column from the Emp relation, you don’t get 5 records for department 10 (one for each employee), you get just one. If you try to compute the average salary for employees in each department, the most natural formulation would compute the set of salary values in department 10, and therefore if two employees have the same salary value, they would be counted only once.

The commercial systems Klug referred to, System R and INGRES, had fewer problems because they were based on multisets (bags) rather than sets. Still, they arrived at an uneasy truce. SQL’s GROUP BY works by sleight of hand, simultaneously performing a project (of the key columns) and aggregation. The bag of values being passed to the aggregate function exists only fleetingly; if you’re squeamish, it’s best to look away.

Shoe-horning aggregate behavior into the existing SELECT expression was another compromise, and it left its mark on SQL’s semantics. Consider the following statement:

SELECT deptno, age, SUM(salary), MIN(age)
FROM Emp
GROUP BY deptno

There are two references to the age column. The first reference is illegal (because age is not a GROUP BY key), but the second reference (inside the MIN function) is legal. The semantic context, determining what columns are available, is very different if you are inside or outside an aggregate function, and both of those contexts exist in just the first line of that query.

SQL had to invent a new clause, HAVING, that does the same as WHERE but in the post-aggregation context. (As we shall see, Morel does not have that problem. The semantic rules after a group are the same as before it, so you can intermix group and where any way you like.)

Aggregate functions create other semantic problems. Assuming that the Emp table contains 100 rows, how many rows does the following statement return?

SELECT foo(age)
FROM Emp

Unless you know whether foo is an aggregate function, it’s impossible to say. If foo is an aggregate function, the query returns 1 row (implicitly totaling all Emp records), but if foo is an ordinary scalar function the query returns 100 rows.

The underlying problem is that aggregate functions in SQL have the same syntax as scalar functions, but very different semantics. This makes the language fragile, and makes a query difficult to understand unless you are familiar with all of the library functions that it is using.

Nested collections

Having incorporated aggregate functions into relational algebra, by the early 1990s, the database research community was turning its attention to another sacred cow: nested collections, formally known as Non First Normal Form (NF2) relations.

Nested collections are a major extension to the relational model (though purists would say a corruption) that allow rich representations of data, but we are interested in them here because they allow a new approach to aggregation. With nested collections, you can aggregate in two steps: first create a set (or multiset) of rows that share the same key, then apply a function to collapse those rows to a single value.

As we saw in our discussion of WordCount, one of the languages that embraced nested collections is Apache Pig.

Consider the following aggregate query in Pig Latin:

emps = LOAD '/data/emps' using PigStorage()
  as (empno: int, name: chararray, deptno: int, salary: int);
by_deptno = GROUP emps BY deptno;
dept_stats = FOREACH by_deptno
  GENERATE group as deptno, COUNT(emps), AVG(emps.salary);

The by_deptno relation is the intermediate step, after GROUP has created a collection for each deptno value, and before GENERATE has applied COUNT and AVG aggregate functions to collapse the collections into scalar values. by_deptno looks like this:

group emps
10 {[100, ‘Shaggy’, 10, 1500], [120, ‘Scooby’, 10, 1250]}
20 {[110, ‘Velma’, 20, 2000]}
30 {[130, ‘Fred’, 30, 1700]}, {[140, ‘Daphne’, 30, 1700]}

The two-stage process makes it easy to write powerful aggregate functions. For example, to implement MEDIAN, you need to look at all of the values, sort them, and take the value that occurs in the middle.

But many other aggregate functions, including SUM, COUNT, MIN and MAX, can be computed iteratively, adding each row to a small accumulator; materializing the whole multiset is a waste of memory and effort.

The accumulator approach is how aggregate functions are usually implemented in functional programming languages.

Functional languages: foldl

In Standard ML, the language from which Morel is derived, the List structure has a higher-order function foldl (meaning ‘fold left’). foldl starts with an initial accumulator value, then uses a “combiner” function supplied by the caller to combine the initial value with the first element of the list to form a new accumulator value, then combines that accumulator value with the second element of the list, and so on.

(There is also a foldr function that starts at the end of the list and works forward.)

Thus sum is defined as follows:

- val sum = foldl (fn (x, y) => x + y) 0;
val sum = fn : int list -> int

Here is sum applied to a list:

- sum [1, 2, 3, 5, 8, 13];
val it = 32 : int

Choices, choices…

To recap, we have seen how grouping and aggregation are implemented in three languages: SQL, Pig and Standard ML. SQL calculates aggregates as it groups, whereas Pig forms collections first. Pig’s aggregate functions work on entire collections, whereas ML’s foldl higher-order function reduces collections by incrementally adding to an accumulator.

In Morel, we had to choose whether our group operation would generate lists (like Pig) that would be reduced in a subsequent step, or whether it would apply aggregate functions. And if it applied aggregate functions, would these be defined using an accumulator (like foldl) or would we allow a more general mechanism?

Morel’s aggregate functions proceed in three steps, all of which occur within the group clause:

  • First, Morel applies a key extractor expression to each row, and gathers rows together by key value;
  • Second, for all of the rows in a group, Morel applies an argument extractor expression to get the argument to which the aggregate function will be applied, and collects the argument values into a list;
  • Last, Morel applies the aggregate function to the list, emitting the result as a field in the output record.

In functional programming terms, it’s difficult to express this concisely. If group were a higher-order function, its signature would look something like this:

- val group: ('r -> 'k) -> ('r -> 'a) -> ('a list -> 'b) -> ('k * 'b) list

where:

  • 'r is the input row
  • 'k is the key
  • 'r -> 'k is the key extractor function
  • 'a is the type of the argument to the aggregate function
  • 'r -> 'a is the argument extractor function
  • 'b is the result of the aggregate function
  • 'a list -> 'b is the aggregate function that converts a list of arguments to a result
  • ('k * 'b) list is the output, a list of (key, result) pairs

Yes, this is complicated! And this variant only allows one aggregate function.

But Morel provides syntactic sugar so that group is simple to use in practice, while retaining strongly typing and type inference. For example:

- from e in emps
    group e.deptno compute sumSal = sum of e.sal;
val it =
  [{deptno=20,sumSal=10875.0},{deptno=10,sumSal=8750.0},
   {deptno=30,sumSal=9400.0}] : {deptno:int, sumSal:real} list

(Examples in this post use the new = syntax for renaming group keys and aggregate functions that was introduced in [MOREL-24] but has not yet been released, rather than the old as syntax used in morel-0.2, and also assume that sum and + have overloads for both int and real, introduced in [MOREL-28] and [MOREL-29].)

The key extractor (e.deptno) and argument extractor (e.sal) are not functions but expressions that are evaluated in the environment of the current row (the variable e holds current member of the emps relation). Expressions are as powerful as functions but are more concise.

The aggregate function is a function – or, more precisely, an expression that yields a function. In this case, we use the constant sum, which is a built-in function value of type int list -> int.

Here is another example:

- from e in emps,
      d in depts
    where e.deptno = d.deptno
    group e.deptno, d.dname, e.job
      compute sumSal = sum of e.sal,
        minRemuneration = min of e.sal + e.commission;
val it =
  [{deptno=30,dname="SALES",job="MANAGER",minRemuneration=2850.0,sumSal=2850.0},
   {deptno=20,dname="RESEARCH",job="CLERK",minRemuneration=800.0,sumSal=1900.0},
   {deptno=10,dname="ACCOUNTING",job="PRESIDENT",minRemuneration=5000.0,sumSal=5000.0},
   {deptno=10,dname="ACCOUNTING",job="MANAGER",minRemuneration=2450.0,sumSal=2450.0},
   {deptno=20,dname="RESEARCH",job="ANALYST",minRemuneration=3000.0,sumSal=6000.0},
   {deptno=30,dname="SALES",job="CLERK",minRemuneration=950.0,sumSal=950.0},
   {deptno=10,dname="ACCOUNTING",job="CLERK",minRemuneration=1300.0,sumSal=1300.0},
   {deptno=20,dname="RESEARCH",job="MANAGER",minRemuneration=2975.0,sumSal=2975.0},
   {deptno=30,dname="SALES",job="SALESMAN",minRemuneration=1500.0,sumSal=5600.0}]
  : {deptno:int, dname:string, job:string, minRemuneration:real, sumSal:real} list

Several things are more advanced than the previous example. The key is composite, there is more than one aggregate function, and the argument to the aggregate function may be a complex expression. Also, the input is a join, so there are two variables (e and d) available for use in expressions.

The output format is more straightforward than Pig. Pig uses a field called group for keys, which is a record if the key is composite. Morel just uses the input field names (and allows you to rename fields using =). In this case, Pig’s output record is {group = {deptno, dname}, sumSal, minRemuneration}, and Morel’s is {deptno, dname, job, sumSal, minRemuneration}.

Pig’s intermediate format (after GROUP and before FOREACH) would have a list-valued field called emps, whereas Morel’s intermediate list is seen only by the argument extractor, and does not need to be named for the user’s benefit.

The output of group is simply an iteration context with a number of variables available (deptno, dname, job, sumSal, minRemuneration). That’s what you’d expect whenever you are inside a from expression. You can therefore follow group with any clause allowable in fromwhere, order or group – or terminate the from expression with a yield clause.

User-defined aggregate functions

Morel’s goal is to be a simple, concise query language which allows you to escape into a Turing-complete programming language when you need to.

So of course you can define your own aggregate functions inside a query. In this example, we define our own version of the sum function:

- let
    fun my_sum [] = 0
      | my_sum (head :: tail) = head + (my_sum tail)
  in
    from e in emps
      group e.deptno
      compute sumEmpno = my_sum of e.empno
  end;
val it =
  [{deptno=20,sumEmpno=38501},{deptno=10,sumEmpno=23555},
   {deptno=30,sumEmpno=46116}] : {deptno:int, sumEmpno:int} list

Aggregate functions are invoked on a collection of values formed by applying their argument expression to all of the records in the current group. SQL’s COLLECT aggregate function, which creates a collection of its arguments, is therefore trivial in Morel: we just use the identity operator (fn x => x) as the aggregate function, and it returns its argument:

- from e in emps
    group e.deptno
    compute names = (fn x => x) of e.ename;
val it =
  [{deptno=20,names=["SMITH","JONES","SCOTT","ADAMS","FORD"]},
   {deptno=10,names=["CLARK","KING","MILLER"]},
   {deptno=30,names=["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]}]
  : {deptno:int, names:string list} list

Computing aggregate functions efficiently

Not all aggregate functions need to operate on the full list of their arguments. It might seem like overkill to form the list of arguments when not all functions need it.

But in designing Morel, we favor expressive power and concise, readable syntax over efficiency. We think we can achieve efficiency (in most cases) by recognizing expressions that can be rewritten to something simpler.

In the case of aggregate functions, many functions have algebraic properties that allow them to be computed more efficiently. For example, you can compute sum by dividing the rows into any subsets you like, summing those subsets, and summing those sums, in any order you like. For instance, sum [1, 2, 3, 5, 8, 13] is the same as sum [1, 3, 5, 13] + sum [2, 8], and therefore partitioning the input into odd and even partitions would be a viable strategy.

In mathematical terms, sum is a commutative monoid. In computational terms, that means that it is very easy to parallelize.

Morel will, at some point, allow you to declare the algebraic properties of built-in and user-defined aggregate functions (such as whether they are monoids). Then it will be able to choose more efficient plans.

Summary

Morel’s group operator is elegant and powerful. The syntax is simple and concise when used with built-in aggregate functions, but you can easily write user-defined aggregate functions.

Aggregate functions behave as if they are acting on collections, but in practice they can frequently be computed more efficiently, using accumulators or by composing sub-totals.

Unlike the complicated semantics of aggregation in SQL, group composes easily with other Morel relational operators such as where and order.

If you have comments, please reply on Twitter:

This article has been updated.