Julian Hyde on Streaming Data, Open Source OLAP. And stuff.

Aggregate queries in Morel

2020-04-09T13:00:00-07:00

Last week I wrote about solving the WordCount problem in Morel. The solution used a user-defined function, split, unnested the resulting array of words into a relation, and then used the group operator to count the occurrences of each word.

In this post, I want to describe in detail how that group operator works, and the strategy that went into it.

I think that what we came up with is elegant, powerful and concise, and I hope that you will agree. But getting the right design wasn’t straightforward. To see why, we’ll take a quick tour through the history of aggregate functions in databases and functional programming languages.

Aggregate functions and relational algebra

When Tedd Codd introduced the relational algebra in 1970, there was no support for aggregation or aggregate functions. The operators were project, select, join and semijoin (called ‘restriction’ in Codd’s paper).

It was not until 1982 that Anthony Klug added aggregates to relational algebra. He remarked

Previous treatments of aggregate functions in relational languages have not been general and have not been well defined. Two examples are System R and INGRES. The formulations of aggregate functions in these systems do not apply to more general languages, for example, to languages having explicit quantifiers. Their definitions of aggregate functions also rely unnecessarily on “sets” of tuples having duplicate members (a contradiction).

Aggregation does not fit easily into relational algebra because relational algebra works in terms of sets. If you project the deptno column from the Emp relation, you don’t get 5 records for department 10 (one for each employee), you get just one. If you try to compute the average salary for employees in each department, the most natural formulation would compute the set of salary values in department 10, and therefore if two employees have the same salary value, they would be counted only once.

The commercial systems Klug referred to, System R and INGRES, had fewer problems because they were based on multisets (bags) rather than sets. Still, they arrived at an uneasy truce. SQL’s GROUP BY works by sleight of hand, simultaneously performing a project (of the key columns) and aggregation. The bag of values being passed to the aggregate function exists only fleetingly; if you’re squeamish, it’s best to look away.

Shoe-horning aggregate behavior into the existing SELECT expression was another compromise, and it left its mark on SQL’s semantics. Consider the following statement:

SELECT deptno, age, SUM(salary), MIN(age)
FROM Emp
GROUP BY deptno

There are two references to the age column. The first reference is illegal (because age is not a GROUP BY key), but the second reference (inside the MIN function) is legal. The semantic context, determining what columns are available, is very different if you are inside or outside an aggregate function, and both of those contexts exist in just the first line of that query.

SQL had to invent a new clause, HAVING, that does the same as WHERE but in the post-aggregation context. (As we shall see, Morel does not have that problem. The semantic rules after a group are the same as before it, so you can intermix group and where any way you like.)

Aggregate functions create other semantic problems. Assuming that the Emp table contains 100 rows, how many rows does the following statement return?

SELECT foo(age)
FROM Emp

Unless you know whether foo is an aggregate function, it’s impossible to say. If foo is an aggregate function, the query returns 1 row (implicitly totaling all Emp records), but if foo is an ordinary scalar function the query returns 100 rows.

The underlying problem is that aggregate functions in SQL have the same syntax as scalar functions, but very different semantics. This makes the language fragile, and makes a query difficult to understand unless you are familiar with all of the library functions that it is using.

Nested collections

Having incorporated aggregate functions into relational algebra, by the early 1990s, the database research community was turning its attention to another sacred cow: nested collections, formally known as Non First Normal Form (NF²) relations.

Nested collections are a major extension to the relational model (though purists would say a corruption) that allow rich representations of data, but we are interested in them here because they allow a new approach to aggregation. With nested collections, you can aggregate in two steps: first create a set (or multiset) of rows that share the same key, then apply a function to collapse those rows to a single value.

As we saw in our discussion of WordCount, one of the languages that embraced nested collections is Apache Pig.

Consider the following aggregate query in Pig Latin:

emps = LOAD '/data/emps' using PigStorage()
  as (empno: int, name: chararray, deptno: int, salary: int);
by_deptno = GROUP emps BY deptno;
dept_stats = FOREACH by_deptno
  GENERATE group as deptno, COUNT(emps), AVG(emps.salary);

The by_deptno relation is the intermediate step, after GROUP has created a collection for each deptno value, and before GENERATE has applied COUNT and AVG aggregate functions to collapse the collections into scalar values. by_deptno looks like this:

group	emps
10	{[100, ‘Shaggy’, 10, 1500], [120, ‘Scooby’, 10, 1250]}
20	{[110, ‘Velma’, 20, 2000]}
30	{[130, ‘Fred’, 30, 1700]}, {[140, ‘Daphne’, 30, 1700]}

The two-stage process makes it easy to write powerful aggregate functions. For example, to implement MEDIAN, you need to look at all of the values, sort them, and take the value that occurs in the middle.

But many other aggregate functions, including SUM, COUNT, MIN and MAX, can be computed iteratively, adding each row to a small accumulator; materializing the whole multiset is a waste of memory and effort.

The accumulator approach is how aggregate functions are usually implemented in functional programming languages.

Functional languages: foldl

In Standard ML, the language from which Morel is derived, the List structure has a higher-order function foldl (meaning ‘fold left’). foldl starts with an initial accumulator value, then uses a “combiner” function supplied by the caller to combine the initial value with the first element of the list to form a new accumulator value, then combines that accumulator value with the second element of the list, and so on.

(There is also a foldr function that starts at the end of the list and works forward.)

Thus sum is defined as follows:

- val sum = foldl (fn (x, y) => x + y) 0;
val sum = fn : int list -> int

Here is sum applied to a list:

- sum [1, 2, 3, 5, 8, 13];
val it = 32 : int

Choices, choices…

To recap, we have seen how grouping and aggregation are implemented in three languages: SQL, Pig and Standard ML. SQL calculates aggregates as it groups, whereas Pig forms collections first. Pig’s aggregate functions work on entire collections, whereas ML’s foldl higher-order function reduces collections by incrementally adding to an accumulator.

In Morel, we had to choose whether our group operation would generate lists (like Pig) that would be reduced in a subsequent step, or whether it would apply aggregate functions. And if it applied aggregate functions, would these be defined using an accumulator (like foldl) or would we allow a more general mechanism?

Morel’s aggregate functions proceed in three steps, all of which occur within the group clause:

First, Morel applies a key extractor expression to each row, and gathers rows together by key value;
Second, for all of the rows in a group, Morel applies an argument extractor expression to get the argument to which the aggregate function will be applied, and collects the argument values into a list;
Last, Morel applies the aggregate function to the list, emitting the result as a field in the output record.

In functional programming terms, it’s difficult to express this concisely. If group were a higher-order function, its signature would look something like this:

- val group: ('r -> 'k) -> ('r -> 'a) -> ('a list -> 'b) -> ('k * 'b) list

where:

'r is the input row
'k is the key
'r -> 'k is the key extractor function
'a is the type of the argument to the aggregate function
'r -> 'a is the argument extractor function
'b is the result of the aggregate function
'a list -> 'b is the aggregate function that converts a list of arguments to a result
('k * 'b) list is the output, a list of (key, result) pairs

Yes, this is complicated! And this variant only allows one aggregate function.

But Morel provides syntactic sugar so that group is simple to use in practice, while retaining strongly typing and type inference. For example:

- from e in emps
    group e.deptno compute sumSal = sum of e.sal;
val it =
  [{deptno=20,sumSal=10875.0},{deptno=10,sumSal=8750.0},
   {deptno=30,sumSal=9400.0}] : {deptno:int, sumSal:real} list

(Examples in this post use the new = syntax for renaming group keys and aggregate functions that was introduced in [MOREL-24] but has not yet been released, rather than the old as syntax used in morel-0.2, and also assume that sum and + have overloads for both int and real, introduced in [MOREL-28] and [MOREL-29].)

The key extractor (e.deptno) and argument extractor (e.sal) are not functions but expressions that are evaluated in the environment of the current row (the variable e holds current member of the emps relation). Expressions are as powerful as functions but are more concise.

The aggregate function is a function – or, more precisely, an expression that yields a function. In this case, we use the constant sum, which is a built-in function value of type int list -> int.

Here is another example:

- from e in emps,
      d in depts
    where e.deptno = d.deptno
    group e.deptno, d.dname, e.job
      compute sumSal = sum of e.sal,
        minRemuneration = min of e.sal + e.commission;
val it =
  [{deptno=30,dname="SALES",job="MANAGER",minRemuneration=2850.0,sumSal=2850.0},
   {deptno=20,dname="RESEARCH",job="CLERK",minRemuneration=800.0,sumSal=1900.0},
   {deptno=10,dname="ACCOUNTING",job="PRESIDENT",minRemuneration=5000.0,sumSal=5000.0},
   {deptno=10,dname="ACCOUNTING",job="MANAGER",minRemuneration=2450.0,sumSal=2450.0},
   {deptno=20,dname="RESEARCH",job="ANALYST",minRemuneration=3000.0,sumSal=6000.0},
   {deptno=30,dname="SALES",job="CLERK",minRemuneration=950.0,sumSal=950.0},
   {deptno=10,dname="ACCOUNTING",job="CLERK",minRemuneration=1300.0,sumSal=1300.0},
   {deptno=20,dname="RESEARCH",job="MANAGER",minRemuneration=2975.0,sumSal=2975.0},
   {deptno=30,dname="SALES",job="SALESMAN",minRemuneration=1500.0,sumSal=5600.0}]
  : {deptno:int, dname:string, job:string, minRemuneration:real, sumSal:real} list

Several things are more advanced than the previous example. The key is composite, there is more than one aggregate function, and the argument to the aggregate function may be a complex expression. Also, the input is a join, so there are two variables (e and d) available for use in expressions.

The output format is more straightforward than Pig. Pig uses a field called group for keys, which is a record if the key is composite. Morel just uses the input field names (and allows you to rename fields using =). In this case, Pig’s output record is {group = {deptno, dname}, sumSal, minRemuneration}, and Morel’s is {deptno, dname, job, sumSal, minRemuneration}.

Pig’s intermediate format (after GROUP and before FOREACH) would have a list-valued field called emps, whereas Morel’s intermediate list is seen only by the argument extractor, and does not need to be named for the user’s benefit.

The output of group is simply an iteration context with a number of variables available (deptno, dname, job, sumSal, minRemuneration). That’s what you’d expect whenever you are inside a from expression. You can therefore follow group with any clause allowable in from – where, order or group – or terminate the from expression with a yield clause.

User-defined aggregate functions

Morel’s goal is to be a simple, concise query language which allows you to escape into a Turing-complete programming language when you need to.

So of course you can define your own aggregate functions inside a query. In this example, we define our own version of the sum function:

- let
    fun my_sum [] = 0
      | my_sum (head :: tail) = head + (my_sum tail)
  in
    from e in emps
      group e.deptno
      compute sumEmpno = my_sum of e.empno
  end;
val it =
  [{deptno=20,sumEmpno=38501},{deptno=10,sumEmpno=23555},
   {deptno=30,sumEmpno=46116}] : {deptno:int, sumEmpno:int} list

Aggregate functions are invoked on a collection of values formed by applying their argument expression to all of the records in the current group. SQL’s COLLECT aggregate function, which creates a collection of its arguments, is therefore trivial in Morel: we just use the identity operator (fn x => x) as the aggregate function, and it returns its argument:

- from e in emps
    group e.deptno
    compute names = (fn x => x) of e.ename;
val it =
  [{deptno=20,names=["SMITH","JONES","SCOTT","ADAMS","FORD"]},
   {deptno=10,names=["CLARK","KING","MILLER"]},
   {deptno=30,names=["ALLEN","WARD","MARTIN","BLAKE","TURNER","JAMES"]}]
  : {deptno:int, names:string list} list

Computing aggregate functions efficiently

Not all aggregate functions need to operate on the full list of their arguments. It might seem like overkill to form the list of arguments when not all functions need it.

But in designing Morel, we favor expressive power and concise, readable syntax over efficiency. We think we can achieve efficiency (in most cases) by recognizing expressions that can be rewritten to something simpler.

In the case of aggregate functions, many functions have algebraic properties that allow them to be computed more efficiently. For example, you can compute sum by dividing the rows into any subsets you like, summing those subsets, and summing those sums, in any order you like. For instance, sum [1, 2, 3, 5, 8, 13] is the same as sum [1, 3, 5, 13] + sum [2, 8], and therefore partitioning the input into odd and even partitions would be a viable strategy.

In mathematical terms, sum is a commutative monoid. In computational terms, that means that it is very easy to parallelize.

Morel will, at some point, allow you to declare the algebraic properties of built-in and user-defined aggregate functions (such as whether they are monoids). Then it will be able to choose more efficient plans.

Summary

Morel’s group operator is elegant and powerful. The syntax is simple and concise when used with built-in aggregate functions, but you can easily write user-defined aggregate functions.

Aggregate functions behave as if they are acting on collections, but in practice they can frequently be computed more efficiently, using accumulators or by composing sub-totals.

Unlike the complicated semantics of aggregation in SQL, group composes easily with other Morel relational operators such as where and order.

If you have comments, please reply on Twitter:

Next, let's add GROUP BY to Morel. Should be pretty straightforward, right? https://t.co/CKV9n73otd
— Julian Hyde (@julianhyde) April 9, 2020

This article has been updated.

Word Count revisited

2020-03-31T13:00:00-07:00

WordCount is a problem that has been used to showcase several generations of data engines. It was introduced in MapReduce, and followed by many others, including Pig, Hive and Spark.

The problem is simple to state: Given a collection of documents, find the set of words that occur in those documents, and the number of occurrences of each.

The solution is not so straightforward. It requires functions on scalar values (to tokenize a string into words), handling nested collections (because one line or document becomes a set of words), data parallelism (in case there are millions of documents and thousands of words), and reading from and writing to files. Mike Stonebraker’s protestations notwithstanding, these are not things that an RDBMS does well.

The deficiencies of RDBMS were the impetus for new data processing languages and frameworks, starting with MepReduce in 2004.

In this post, we shall look at implementations of WordCount in various languages and engines. Each implementation blends general-purpose programming languages, functional programming, and relational algebra in varying proportions.

WordCount in MapReduce

map(String input_key, String input_value):
  // input_key: document name
  // input_value: document contents
  for each word w in input_value:
    EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values):
  // output_key: a word
  // output_values: a list of counts
  int result = 0;
  for each v in intermediate_values:
    result += ParseInt(v);
  Emit(AsString(result));

(From MapReduce: Simplified Data Processing on Large Clusters, by Jeff Dean and Sanjay Ghemawat, 2004.)

MapReduce was ground-breaking because it framed data parallelism in functional programming terms. It demonstrated that a large, complex, distributed problem could be expressed in terms of two simple functions.

Functional programming is often thought of as good for solving only ‘small’ problems, but because functions are pure and stateless they are an excellent building block for large-scale distributed programs.

Also, you can implement the functions in a powerful general-purpose programming language, so you can solve the whole problem in one language. In most dialects of SQL you cannot solve WordCount because there is no built-in split function to split a document into words. You would have to jump into another language to implement split as a user-defined function, and then import that function into your SQL session.

My only quibble is that they use confusing names. To functional programmers, map and reduce are well-known higher-order functions that are built into the system; Dean and Ghemawat’s map and reduce functions are just the arguments to those.

In the following example in Standard ML, a functional programming language, you’ll see that I rename their map and reduce functions to wc_mapper and wc_reducer, and pass them as arguments to a higher-order function called mapReduce. It illustrates the connection between MapReduce and functional programming. The only difference is that in Dean and Ghemawat’s MapReduce, and in other implementations of MapReduce such as Apache Hadoop, the mapReduce function is a powerful distributed system and not 17 lines of Standard ML.

Word Count in Standard ML

First, let’s define the mapReduce function. It is a higher-order function that takes two other functions mapper and reducer as arguments, and also the list of input values.

mapReduce is a framework. The user can make it do WordCount or a hundred other tasks by providing different implementations of mapper and reducer.

This particular implementation is very inefficient – the update and dedup functions that build a multimap have O(n²) running time, and the program runs in a single thread – but the point is that the framework could be made more efficient without the user having to rewrite their mapper and reducer functions.

- fun mapReduce mapper reducer list =
    let
      fun update (key, value, []) = [(key, [value])]
        | update (key, value, ((key2, values) :: tail)) =
            if key = key2 then
              (key, (value :: values)) :: tail
            else
              (key2, values) :: (update (key, value, tail))
      fun dedup ([], dict) = dict
        | dedup ((key, value) :: tail, dict) =
            dedup (tail, update (key, value, dict))
      fun flatMap f list = List.foldl (op @) [] (List.map f list)
      val keyValueList = flatMap mapper list
      val keyValuesList = dedup (keyValueList, [])
    in
      List.map (fn (key, values) => (key, reducer (key, values))) keyValuesList
    end;
val mapReduce = fn
  : ('a -> (''b * 'c) list)
    -> (''b * 'c list -> 'd) -> 'a list -> (''b * 'd) list

Now let’s define the wc_mapper and wc_reducer functions that will power the WordCount algorithm.

- fun wc_mapper line =
    let
      fun split0 [] word words = word :: words
        | split0 (#" " :: s) word words = split0 s "" (word :: words)
        | split0 (c :: s) word words = split0 s (word ^ (String.str c)) words
      fun split s = List.rev (split0 (String.explode s) "" [])
    in
      List.map (fn w => (w, 1)) (split line)
    end;
val wc_mapper = fn : string -> (string * int) list
- fun wc_reducer (key, values) = foldl (op +) 0 values;
val wc_reducer = fn : 'a * int list -> int

Check that they work on discrete values:

- wc_mapper "a skunk sat on a stump";
val it = [("a",1),("skunk",1),("sat",1),("on",1),("a",1),("stump",1)]
  : (string * int) list
- wc_reducer ("hello", [1, 4, 2]);
val it = 7 : int

Bind them to mapReduce to create a function tailored to the WordCount problem:

- fun wordCount lines = mapReduce wc_mapper wc_reducer lines;
val wordCount = fn : string list -> (string * int) list

And check that our wordCount function works:

- wordCount ["a skunk sat on a stump",
    "and thunk the stump stunk",
    "but the stump thunk the skunk stunk"];
val it =
  [("but",1),("the",3),("stump",3),("thunk",2),("skunk",2),("stunk",2),
   ("and",1),("a",2),("sat",1),("on",1)] : (string * int) list

WordCount in Pig

Apache Pig was one of the first high-level languages for Apache Hadoop. Pig has its trotters firmly planted in relational algebra, but makes extensive use of nested collections.

input = load 'mary' as (line);
words = foreach input generate flatten(TOKENIZE(line)) as word;
grpd = group words by word;
cntd = foreach grpd generate group, COUNT(words);
dump cntd;

(From Programming Pig by Alan Gates, O’Reilly 2011.)

Each line of the program is one relational operation. In line 2, a user-defined function (TOKENIZE) generates a collection, which is then flattened. Line 3 groups occurrences of words, and line 4 generates a count of each collection.

WordCount in Apache Hive SQL

SELECT word, COUNT(*)
FROM input
  LATERAL VIEW explode(split(text, ' ')) lTable AS word
GROUP BY word

(From Stack Overflow.)

There is not a typical SQL statement, and the interesting stuff all happens on line 3:

First, the split function converts the text column from the input table into an array of strings.
Next, the explode table-valued function converts an array of strings into a relation with one string column.
Last, the LATERAL VIEW keywords are work around oddities in SQL semantics. VIEW tells Hive to treat the result of a table function as a relation (without it, the only thing you can include in the FROM clause are tables and sub-queries), and LATERAL makes previous entries in the FROM clause (in this case the text column of the input relation) visible inside the function.

(LATERAL VIEW is Hive-specific syntax. Standard SQL would use CROSS JOIN LATERAL TABLE; in Oracle, Microsoft SQL Server, and Apache Calcite you can also use CROSS APPLY.)

The overall effect is nested ‘for’ loops: first over the rows in the input relation, then over the array yielded by split(input.text, ' ') for each row. The syntax is different from the Pig solution, but the semantics are almost identical. The resulting list of words is then handled by GROUP BY and COUNT(*) in the usual way.

WordCount in Apache Spark

Apache Spark is both an extension to the MapReduce paradigm and a successor to the Hadoop engine. Its binding to the Scala language makes for concise programs, as followign example shows. It also has a distributed processing engine that is more efficient than Hadoop, especially for shorter-lived jobs.

val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

(From Apache Spark Examples.)

Spark has many more operations than just map and reduce, but this example clearly shows the same map-reduce structure.

Spark is a platform rather than a language: the calls to methods flatMap, map and reduceByKey do not actually process data but build an expression in Spark’s algebra. The arguments to those methods are Scala functions. When saveAsTextFile is called, the algebra is planned and executed.

I call this a ‘builder’ model, and you can see earlier examples in DryadLINQ, FlumeJava, and Cascading. While the primary interface to Apache Calcite is SQL, its builder API RelBuilder is popular with people writing query optimizers.

A builder system inevitably has two languages: the host language in which you write the programs (Scala in the case of Spark) and the engine’s own algebra. For small expressions (for example a filter condition) some builders have an expression algebra, while others use fragments of the host language (such as the Scala fragment word => (word, 1) above).

A builder has an underlying algebra, which means that the large-scale program can be optimized by re-organizing the algebraic operators. The mix of languages means that you can use the power of the host language to write user-defined functions without stepping out of the environment (the way you would have to, say, leave SQL in order to write a UDF in Java).

But the seams between the algebra and the host language are always apparent. They have different type systems, for instance, and if the language type-checks in the host language it still may not type-check in the algebra. And those fragments of host language may be opaque to the optimizer and prevent advanced optimizations.

Reflecting on these problems, I came up with Morel, a language that has the power of a general-purpose language (due to its SML ancestry) but with support for relational expressions in the language, so that you naturally express data-oriented problems in relational algebra.

Unlike a builder, Morel is one language. The algebra is the parse tree of the program, and the query optimizer is built into the language parser.

WordCount in Morel

The solution to the WordCount problem in Morel is very concise:

from line in lines,
    word in split line
  group word compute count

So concise that it needs some explanation. The from keyword (an feature of Morel that is not present in Standard ML) creates a list comprehension. You can think of it as a ‘for’ loop, but declarative rather than imperative. Inside the loop are not actions but expressions. The whole from is an expression whose value is a list, and the elements of that list are defined by those inner expressions.

One difference from SQL is that collections can be composed of any value, not just records; lines is a list of strings, and therefore at any moment during the iteration line is a string. (The Hive SQL example is confusing because it has two single-column relations, input and lTable, but it has to use column names, text and word, in expressions.)

SQL makes a big distinction between relations (which may appear in a FROM clause) and collections such as arrays (which may only appear where expressions can appear, such as in the SELECT and WHERE clauses). Morel makes no such distinction. from works on any collection-valued expression, which may be a list of strings, a list of records, or a table stored in a relational database.

As a result, we don’t tend to use the term ‘query’ in Morel. In other languages, a ‘query’ is an expression that operates on relations, but in Morel we just call it an expression.

Is from a query operator? It reminds us of SELECT because it uses relational operations – scan, join and aggregate in this example, and also filter and sort – but it’s just one of many ways that you can operate on lists in Morel.

The solution – all 3 lines of it – is a single from expression:

The first line, from line in lines, assigns each element of lines in turn to a variable line of type string.
The second line, word in split line, applies the split function to line to yield an array of strings, and assigns each element of of the array in turn to a variable word. (We don't need the equivalent of SQL's LATERAL, because line is implicitly visible in the inner-loop.)
The third line, group word compute count, gathers records into groups that have the same word value, then applies the built-in count aggregate function to those groups. The result is a list of records with fields word and count.

A more complete solution

The above Morel solution works, but it assumes that a split function is available. (The other solutions in other languages have the same problem; this is especially onerous in Pig and Hive, where someone would have to write a UDF in a language such as Java, compile it, package it in a JAR file, add the JAR file to the classpath, and restart the runtime.)

A better solution would solve the problem all in the one language, and ideally in the same block of code, without requiring an extra compilation step. Because Morel is a general-purpose language, we can declare the split function inline:

fun wordCount lines =
  let
    fun split0 [] word words = word :: words
      | split0 (#" " :: s) word words = split0 s "" (word :: words)
      | split0 (c :: s) word words = split0 s (word ^ (String.str c)) words
    fun split s = List.rev (split0 (String.explode s) "" [])
  in
    from line in lines,
        word in split line
    group word compute count
  end;

gives signature

val wordCount = fn : string list -> {count:int, word:string} list

Another improvement is that the new solution is not an expression, but a function. The previous solution was a expression that assumed that there is a list called lines in the environment, but the function can easily be applied to any value.

Now let’s run it:

wordCount ["a skunk sat on a stump",
    "and thunk the stump stunk",
    "but the stump thunk the skunk stunk"];
val it =
  [{count=2,word="a"},{count=3,word="the"},{count=1,word="but"},
   {count=1,word="sat"},{count=1,word="and"},{count=2,word="stunk"},
   {count=3,word="stump"},{count=1,word="on"},{count=2,word="thunk"},
   {count=2,word="skunk"}] : {count:int, word:string} list

Conclusion

We have seen the solutions to the WordCount problem in 5 languages: MapReduce, Standard ML, Pig, Hive SQL, Spark, and Morel. Pig and Hive have powerful high-level query languages but rely on UDFs written in another language. MapReduce and Spark use the power of their native language but rely on an external framework (whose real language is a relational algebra created by a builder) to carry out the processing.

Only Morel brings high-level query operators into a language that can also solve general-purpose problems.

Morel lies at the intersection of functional programming and query languages, taking the best of both worlds. Over the next few weeks, this blog will drill deeper into both of those aspects. We shall look at how to express SQL concepts such as GROUP BY and ORDER BY in Morel, and also what it means to have functions as first-class values in a query language.

If you have comments, please reply on Twitter:

WordCount is the "Hello, world!" problem for data languages. MapReduce, #ApachePig, @ApacheHive, and @ApacheSpark all have solutions with varying degrees of elegance and efficiency.

Can Morel do better? https://t.co/hsEhBeq4Xk
— Julian Hyde (@julianhyde) April 1, 2020

This article has been updated.

Morel release 0.2

2020-03-10T02:00:00-07:00

I am pleased to announce Morel release 0.2. Since release 0.1 we have renamed the project from ‘smlj’ to ‘Morel’, made major improvements to the type system, and continued our relational extensions.

Some highlights of release 0.2:

Functions and values can have polymorphic types, inferred as part of a Hindley-Milner type system;
Relational expressions may now include a group clause, so you can evaluate aggregate queries (similar to SQL GROUP BY);
Foreign values allow external data, such as the contents of a JDBC database, to be handled as if it is in memory;
Add built-in functions based on the String and List structures in the Standard ML basis library;
Postfix field reference syntax makes Morel more familiar to SQL users;
Add Morel language reference.

For more information, see the release notes.

If you have comments, please reply on Twitter:

Announcing Morel release 0.2! https://t.co/6nelUtJAmU
— Julian Hyde (@julianhyde) March 10, 2020

Morel: The basic language

2020-03-03T12:00:00-08:00

Last week I wrote about my goals for Morel, extending a simple functional language (Standard ML) with relational operations so that it can be used as a query language.

This week I’d like to go over the basics of the language. Much of this is the same as ML, and that’s a good thing. If you want to learn more about ML, there are plenty of good resources.

The shell

The easiest way to start Morel is in its shell. The following example starts Morel from macOS and evaluates a string literal.

$ ./morel
morel version 0.1.0 (java version "13", JRE null (build 13+33), JLine terminal, xterm-256color)
= "hello, world!";
val it = "hello, world!" : string
=

(To build Morel and start the shell for yourself, follow the instructions on GitHub. To exit the shell, type Ctrl+D.)

Primitive types and simple expressions

As a functional language, everything in Morel is an expression. The basic types are bool, int, real, string, char, and unit. Here are literals in each.

= false;
val it = false : bool
= 10;
val it = 10 : int
= ~4.5;
val it = ~4.5 : real
= "morel";
val it = "morel" : string
= #"a";
val it = #"a" : char
= ();
val it = () : unit

As you’d expect, there are built-in operators for each data type. Here are a few examples:

= true andalso false;
val it = false : bool
= true orelse false;
val it = true : bool
= not false;
val it = true : bool
= 1 + 2;
val it = 3 : int
= ~(5 - 2);
val it = ~3 : int
= 10 mod 3;
val it = 1 : int
= "mo" ^ "rel";
val it = "morel" : string

Variables

You can assign values to variables.

= val x = 7;
val x = 7 : int
= val y = x mod 3;
val y = 1 : int;
= x + y;
val it = 8 : int

(Morel, following Standard ML, actually calls them “value bindings” rather than “variables” because you cannot change their value. It’s not much of a hardship, because you can create a new variable with the same name, and it will obscure the old variable and its value.)

There is a special variable called it used by the shell. Each time you evaluate an expression, the shell and assigns the value to a variable called it, and prints the value and its type. You can use it in the next expression.

= "morel";
val it = "morel" : string
= String.size it;
val it = 5 : int
= it + 4;
val it = 9 : int

A let expression binds one or more values and evaluates an expression.

= let
-   val x = 3
-   val y = 2
- in
-   x + y
- end;
val it = 5 : int

Lists, records and tuples

In addition to primitive types, there are list, record, and tuple types.

= [1, 2, 3];
val it = [1,2,3] : int list
= {id = 10, name = "Scooby"};
val it = {id=10,name="Scooby"} : {id:int, name:string}
= (1, true, "yes");
val it = (1,true,"yes") : int * bool * string

Tuples are actually just records with fields named “1”, “2”, and so on. The following example shows that the values are identical, and have the same type, whether you use tuple or record syntax.

= (1, true, "yes");
val it = (1,true,"yes") : int * bool * string
= {1 = 1, 2 = true, 3 = "yes"};
val it = (1,true,"yes") : int * bool * string
= (1, true, "yes") = {1 = 1, 2 = true, 3 = "yes"};
val it = true : bool;

The empty record and empty tuple are equal, and are the only value of the type unit. Morel outputs it as ().

= {};
val it = () : unit
= ();
val it = () : unit
= {} = ();
val it = true : bool;

Functions

Functions are expressions, too. fn makes a lambda expression. After we have bound the lambda value to plusOne, we can use plusOne as a function.

= val plusOne = fn x => x + 1;
val plusOne = fn : int -> int
= plusOne 2;
val it = 3 : int

Function declarations are common, so the fun keyword provides a shorthand: “fun f arg = exp” is short for “val f = fn arg => exp”.

= fun plusOne x = x + 1;
val plusOne = fn : int -> int
= plusOne 2;
val it = 3 : int

Functions can have multiple arguments, separated by spaces.

= fun plus x y = x + y;
val plus = fn : int -> int -> int
= plus 3 4;
val it = 7 : int

If we supply too few arguments, we get a closure that captures the argument value and can be applied later.

= val plusTen = plus 10;
val plusTen = fn : int -> int
= plusTen 2;
val it = 12 : int

Functions can be recursive. Here, the factorial function evaluates by calling itself, using the mathematical identity that n! = n * (n-1)!.

= fun factorial n =
-   if n = 1 then
-     1
-   else
-     n * factorial (n - 1);
val factorial = fn : int -> int
= factorial 1;
val it = 1 : int
= factorial 5;
val it = 120 : int

Higher-order functions and type inference

A higher-order function is a function that operates on other functions. Here are a couple of examples.

The map function applies a given function f to each element of a list, returning a list, as follows:

= fun map f [] = []
-   | map f (head :: tail) = (f head) :: (map f tail);
val map = fn : ('a -> 'b) -> 'a list -> 'b list
= fun double n = n * 2;
val double = fn : int -> int
= map double [1, 2, 3, 4];
val it = [2,4,6,8] : int list

The type of the map function, above, is fn : ('a -> 'b) -> 'a list -> 'b list. Morel’s type system (based, like that of ML, on the Hindley-Milner type system) has deduced that map has a polymorphic type, and 'a and 'b are type variables. This means that if f has type 'a -> 'b, for any types 'a and 'b, then map f will transform a list of ‘a' to a list of 'b’.

For example, if f is the built-in function String.size of type string -> int, then 'a is string and 'b is int, and map String.size will convert a string list to an int list.

Notice that we did not declare any types; the type system deduced everything for us. Type inference is perhaps ML’s greatest feature. In Morel, it helps us achieve our goal of writing powerful queries concisely. We don’t need to specify types, and furthermore, we can include temporary values and functions in the query whenever we need them.

The filter function keeps only those elements of a list for which a predicate p evaluates to true, as follows:

= fun filter p [] = []
-   | filter p (head :: tail) =
-     if (p head) then
-       (head :: (filter p tail))
-     else
-       (filter p tail);
val filter = fn : ('a -> bool) -> 'a list -> 'a list
= fun even n = n mod 2 = 0;
val even = fn : int -> bool
= filter even [1, 2, 3, 4];
val it = [2,4] : int list

You may notice that map and filter are very similar to the SELECT and WHERE clauses of a SQL statement. This is no surprise: relational algebra, which underlies SQL, is basically a collection of higher-order functions applied to lists of records (relations).

Can we extend ML syntax to make it easier to write relational algebra expressions? You bet!

Relational expressions

from is a Morel extension that iterates over one or more lists, applies relational operations, and returns a list.

It has a similar purpose to SQL’s SELECT. But unlike SELECT, its inputs and outputs can be collections of any type (not just records). Also, Morel makes no distinction between relations and expressions; therefore Morel do not need operations like SQL’s UNNEST to deal with nested collections, and we can operate on lists in memory just like tables in a database.

Let’s start by defining emps and depts relations as lists of records.

- val emps =
=   [{id = 100, name = "Fred", deptno = 10},
=    {id = 101, name = "Velma", deptno = 20},
=    {id = 102, name = "Shaggy", deptno = 30},
=    {id = 103, name = "Scooby", deptno = 30}];
val emps =
  [{deptno=10,id=100,name="Fred"},{deptno=20,id=101,name="Velma"},
     {deptno=30,id=102,name="Shaggy"},{deptno=30,id=103,name="Scooby"}]
  : {deptno:int, id:int, name:string} list
= val depts =
-   [{deptno = 10, name = "Sales"},
-    {deptno = 20, name = "HR"},
-    {deptno = 30, name = "Engineering"},
-    {deptno = 40, name = "Support"}];
val depts =
  [{deptno=10,name="Sales"},{deptno=20,name="HR"},
     {deptno=30,name="Engineering"},{deptno=40,name="Support"}]
  : {deptno:int, name:string} list

Now let’s run our first query:

= from e in emps yield e;
val it =
  [{deptno=10,id=100,name="Fred"},{deptno=20,id=101,name="Velma"},
     {deptno=30,id=102,name="Shaggy"},{deptno=30,id=103,name="Scooby"}]
  : {deptno:int, id:int, name:string} list

The equivalent in SQL would be

SELECT *
FROM emps AS e

In Morel there is no difference between a query, a table, and a list-valued expression, so we could have instead written just emps.

= emps;
val it =
  [{deptno=10,id=100,name="Fred"},{deptno=20,id=101,name="Velma"},
     {deptno=30,id=102,name="Shaggy"},{deptno=30,id=103,name="Scooby"}]
  : {deptno:int, id:int, name:string} list

A where clause filters out rows, and a yield clause controls which fields are returned.

= from e in emps
-   where #deptno e = 30
-   yield {id = #id e};
val it = [{id=102},{id=103}] : {id:int} list

SQL equivalent is as follows:

SELECT e.id
FROM emps AS e
WHERE e.deptno = 30

If you omit yield, you get the raw values of the loop variable e.

= from e in emps
-   where #deptno e = 30;
val it =
  [{deptno=30,id=102,name="Shaggy"},
     {deptno=30,id=103,name="Scooby"}]
  : {deptno:int, id:int, name:string} list

Shorthand

In ML, the usual way to access a field is via an accessor function that starts with ‘#’. For example, #id e returns the id field of record e. But Morel has an alternative syntax, e.id, which is more familiar for SQL users.

Also, when you are constructing a record ML requires each field to be named, e.g. id = #id e, but in Morel you can omit the name field if it is the same as the current field or variable.

Thus the following 3 queries are equivalent:

= from e in emps
-   yield {e = #id e};
val it = [{id=100},{id=101},{id=102},{id=103}] : {id:int} list
= from e in emps
-   yield {e = e.id};
val it = [{id=100},{id=101},{id=102},{id=103}] : {id:int} list
= from e in emps
-   yield {e.id};
val it = [{id=100},{id=101},{id=102},{id=103}] : {id:int} list

I’ll use the abbreviated forms from now on.

Joins and sub-queries

The following query joins employees and departments relations on department number.

= from e in emps,
-     d in depts
-   where e.deptno = d.deptno
-   yield {e.id, e.deptno, ename = e.name, dname = d.name};
val it =
  [{deptno=10,dname="Sales",ename="Fred",id=100},
   {deptno=20,dname="HR",ename="Velma",id=101},
   {deptno=30,dname="Engineering",ename="Shaggy",id=102},
   {deptno=30,dname="Engineering",ename="Scooby",id=103}]
  : {deptno:int, dname:string, ename:string, id:int} list

The following query returns the names of employees in the Engineering department.

= let
-   fun exists [] = false
-     | exists (head :: tail) = true
- in
-   from e in emps
-     where exists (from d in depts
-                   where d.deptno = e.deptno
-                   andalso d.name = "Engineering")
-     yield e.name
- end;
val it = ["Shaggy","Scooby"] : string list

This query shows how much can be accomplished in Morel with just functions, without extending the language. In SQL, the equivalent query would have EXISTS and a correlated sub-query, but in Morel exists is an ordinary function that we have defined in the query, and a correlated sub-query is just an expression that happens to reference return a list and reference variables in an enclosing scope.

Summary

To recap, Morel has:

primitive types bool, int, real, string, char, unit;
also list, tuple, record, and function types;
lambda expressions and recursive functions;
polymorphism and type-inference;
relational from expressions (a Morel extension to Standard ML).

This was just a quick introduction to Morel and its ancestor Standard ML, and I had to skip over many topics. There wasn’t time to cover algebraic types and pattern-matching, variations to from expressions such as the group clause, and how Morel accesses external data and optimizes programs. I hope to cover these topics in upcoming posts.

If you have comments, please reply on Twitter:

Last week, the grand vision; this week we're down to brass tacks. https://t.co/2MihRqUTjs
— Julian Hyde (@julianhyde) March 3, 2020

This article has been updated.

Morel: A functional language for data

2020-02-25T23:24:00-08:00

For the past few months, I have been working on an experimental functional/data language called Morel.

SQL has several deficiences relating to nested collections, higher-order functions and type system. After several months trying to figure out how to add these features to SQL, I noticed that they were basically the defining characteristics of a functional programming language.

I figured, rather than fixing SQL, why not dig the tunnel from the other end? Start with a functional programming language, then add what makes SQL a good query language.

Morel’s design goals:

Seamless integration of collections (relations) and relational operators into a functional programming language
As concise as SQL
Access external data
Retain the computational power and sophisticated type system of the functional programming language
Allow query planning via relational algebra, even in hybrid programs that are mixture of relational algebra
Suitable for interactive queries from a REPL (read-eval-print loop) and also larger scale programs

Conciseness

SQL is concise. Many useful queries are only a few lines long.

Functional programming languages also tend to be concise, for similar reasons to SQL: they are strongly typed, and the language has good type inference. Type inference means that you don’t need to explicitly specify types often, if ever.

When you write a query in Morel, you are writing a short expression in a functional programming language, but its structure looks very similar to the equivalent SQL query.

For example, here is a query in Morel:

from e in hr.emps,
    d in hr.depts
where e.deptno = d.deptno
yield {e.id, e.deptno, ename = e.name, dname = d.name};

The equivalent query in SQL looks very similar:

SELECT e.id, e.deptno, e.name AS ename, d.name AS dname
FROM hr.emps AS e,
    hr.depts AS d
WHERE e.deptno = d.deptno;

External data

In the above example, the hr data structure looks like a record in memory (with fields emps and depts that are collections of records) but actually maps to a schema in a DBMS.

Virtualized data is essential to bringing large-scale data sets into the programming model, and is implemented by calling out to Apache Calcite schema adapters. I expect to make more use of Calcite for query planning.

Interactivity

People tend to write SQL queries in a REPL (read-eval-print loop). Small functional programs can be written in the same way, so there is a good fit there.

A modest project

I chose to extend Standard ML because it is a small, simple language, well known in the academic community.

This made it feasible for me to write a parser and interpreter as a solo project. (By the way, the interpreter is written in Java, and is quite a nice implementation of Standard ML for the JVM, even if you don’t care for the relational extensions.)

If the Morel experiment is successful, the ideas can be carried into more complex and powerful languages such as Haskell and Scala.

Relationally complete vs. Turing complete

Query languages are, by design, powerful but not too powerful.

One reason for this is that, if we add extra power (for example, function values and arrays) then we have to add extra syntax for these features. The extra syntax makes the language harder to use for simple taskes, and also harder to learn.

More important, query languages rely on a query planner. Many details can be left out of the program (such as whether to use a hash-join or sort-merge-join algorithm to perform a join) because the planner can make these decisions for us. But if we give the language too much power, we make the planner’s job difficult or impossible.

Why is this? As soon as a language has sufficient power – if it can loop, or call functions recursively – it becomes Turing complete, and not all programs in such a language can be reasoned about. (See, for example, the Halting Problem.)

SQL is not Turing complete (if you ignore the WITH RECURSIVE construct), as evidenced by the fact that any query with finite input relations eventually terminates. It is equivalent in power to relational algebra and relational calculus, which Edgar F. Codd called relationally complete.

Morel, on the other hand, crosses that line. This is necessary, because all functional languages are Turing complete, but are we giving the planner an impossible task?

Limiting the power

I believe that we can solve the problem by separating the “query” parts of a program, that consist only of relational operators, from the “looping” parts of a program. This is unproven at this point, but is bolstered by the observation that many data-oriented programs fall into one of the following patterns:

Pure queries consist only of relational operators. Such programs are often small, are structurally similar to the equivalent SQL, and can be planned in a similar fashion.
Queries with locally defined scalar functions can be separated into a pure query that makes calls into user-defined functions. The user-defined functions do not invoke relational operators and therefore the query can be planned as normal.
Queries in loops can be converted into a parameterized query that is executed under the control of a functional program.
Iterative queries, for example queries that add to a set until it reaches a fixed point, can be planned and executed using techniques such as stratified recursion.
User-defined table functions connected by relational operators. This is essentially the MapReduce pattern. What happens in the table functions is beyond our control, but if we know something about their inputs and outputs (for example, that an aggregate function can be computed by examining only rows with the same key) then the framework can assist in running the query reliably and in parallel.

In all of these patterns, if we can recognize the ‘query’ parts we can optimize them using conventional techniques. If we cannot recognize any ‘query’ parts, nothing is lost; we can still execute the whole as a functional program.

Conclusion

Morel is an exciting experimental language that combines the best aspects of database query languages and functional programming.

In this brief introduction, I have not gone into the details of Morel’s syntax, semantics or implementation, but examples can be found on the Morel site and in Morel’s test suite, and I plan to write more blog posts over the following months.

If you have comments, please reply on Twitter:

Is it easier to add functions and polymorphic type system to SQL, or to add relational operations to a functional language? https://t.co/WEf9LbChWX
— Julian Hyde (@julianhyde) February 26, 2020

Blog reboot

2020-02-18T10:51:00-08:00

Welcome back!

Almost six years since my last blog post, I’m sharpening my electronic pencil to start writing again.

From Blogger to Jekyll

Previously at Blogger, the blog is now self-hosted at blog.hydromatic.net. I figured that the readers wouldn’t miss the cheesy ads, I won’t miss the spam comments, and the conversations started by the posts can all happen on Twitter or in other people’s blogs.

I’ve ported all of the previous posts to Jekyll, Markdown and stored them in GitHub. They’re available in the index. I must say, having full control over my content is quite a relief.

Update

A few things have changed since my last post, but many things have stayed the same. Around 2012, I was writing a lot about the Mondrian OLAP engine and olap4j API, but my work on Mondrian has since tailed off. I had a promising SQL parsing/planning project called Optiq that has since moved to the Apache Software Foundation and is thriving as Apache Calcite.

My work on streaming SQL, which started at SQLstream, has continued in Apache Calcite, found collaborators in Apache Beam and Flink, and resulted in a well-received paper at SIGMOD 2019. With luck, a future version of the SQL standard will have extensions for streaming queries.

At that time I worked for SQLstream and Pentaho; I have since worked on Apache Calcite, Hive and Hadoop at Hortonworks, and I now work at Looker (which last week completed its merger with Google).

Why blog?

Why did I stop blogging? At that time, Calcite was starting to grow really fast and my energies went into Calcite features, releases and conference talks. Oh, and I also had a newborn and a 3 year old.

Twitter played its part. As the leading microblogging service, it allowed me to vent my passion and bounce those idea off audience. But by the time I had blown off steam in 140 characters, I no longer had the passion or outrage to rework the idea into a blog post.

But I’ve come to realize that ideas need more room to breathe. In technology, why you are doing something is often more important than what you are doing. You can develop an idea over several posts, and bring your audience along. After the product is complete, the blog will show the thinking that went into it (and perhaps a few wrong turns along the way). That’s what I hope to do here.

The cool stuff

I am interested in databases and business intelligence (BI), especially from the perspective of relational algebra and query optimization. I want to extend the database paradigm, to areas such as streaming queries and geospatial data, and on language design, to make database technology more useful. Lastly, I want to bring the technology to a wide audience via open source software.

I’ll be writing about those things here. I’m especially keen to introduce you to Morel, a language that I am developing. It is a small functional language derived from Standard ML that is also an elegant and powerful database query language, and I think it has a bright future.

Watch this space.

Comments

If you have comments, please reply on Twitter:

Why I stopped blogging - and why I started again https://t.co/xDFbdGWEoA
— Julian Hyde (@julianhyde) February 18, 2020

Table macros

2014-05-05T15:04:00-07:00

Table macros are a new Optiq feature (since release 0.6) that combine the efficiency of tables with the flexibility of functions.

Optiq offers a convenient model for presenting data from multiple external sources via a single, efficient SQL interface. Using adapters, you create a schema for each external source, and a table for each data set within a source.

But sometimes the external data source does not consist of a fixed number of data sets, known ahead of time. Consider, for example, Optiq’s web adapter, optiq-web, which makes any HTML table in any web page appear as a SQL table. Today you can create an Optiq model and define within it several tables.

Optiq-web’s home page shows an example where you can create a schema with tables “Cities” and “States” (based on the Wikipedia pages List of states and territories of the United States and List of United States cities by population) and execute a query to find out the proportion of the California’s population that live in cities:

SELECT COUNT(*) "City Count",
  SUM(100 * c."Population" / s."Population") "Pct State Population"
FROM "Cities" c, "States" s
WHERE c."State" = s."State" AND s."State" = 'California';

But what if you want to query a URL that isn’t in the schema? A table macro will allow you to do this:

SELECT * FROM TABLE(
  web('https://en.wikipedia.org/wiki/List_of_countries_by_population'));

web is a function that returns a table. That is, a Table object, which is the definition of a table. In Optiq, a table definition doesn’t need to be assigned a name and put inside a schema, although most do; this is a free-floating table. A table just needs to be able to describe its columns, and to be able to convert itself to relational algebra. Optiq invokes it while the query is being planned.

Here is the WebTableMacro class:

public class WebTableMacro {
  public Table eval(String url) {
    Map<String, Object> operands = new HashMap<String, Object>();
    operands.put("url", url);
    return new WebTable(operands, null);
  }
}

And here is how you define a WEB function based upon it in your JSON model:

{
  "version": "1.0",
  "defaultSchema": "ADHOC",
  "schemas": [
    {
      "name": "ADHOC",
      "functions": [
        {
          "name": "WEB",
          "className": "com.example.WebTableMacro"
        }
      ]
    }
  ]
}

Table macros are a special kind of table function. They are defined in the same in the model, and invoked in the same way from a SQL statement. A table function can be used at prepare time if (a) its arguments are constants, and (b) the table it returns implements TranslatableTable. If it fails either of those tests, it will be invoked at runtime; it will still produce results, but will have missed out on the advantages of being part of the query optimization process.

What kind of advantages can the optimization process bring? Suppose a web page that produces a table supports URL parameters to filter on a particular column and sort on another. We could write planner rules that push take a FilterRel or SortRel on top of a WebTableScan and convert them into a scan with extra URL parameters. A table that came from the web function would be able to participate in that process.

The name ‘table macros’ is inspired by Lisp macros – functions that are invoked at compile time rather than run time. Macros are an extremely powerful feature in Lisp and I hope they will prove to be a powerful addition to SQL. But to SQL users, a more familiar name might be ‘parameterized views’.

Views and table macros are both expanded to relational algebra before the query is optimized. Views are specified in SQL, whereas table macros invoke user code (it takes some logic to handle those parameters). Under the covers, Optiq’s views are implemented using table macros. (They always have been – we’ve only just got around to making table macros a public feature.)

To sum up. Table macros are powerful new Optiq feature that extend the reach of Optiq to data sources that have not been pre-configured into an Optiq model. They are a generalization of SQL views, and share with views the efficiency of expanding relational expressions at query compilation time, where they can be optimized. Table macros will help bring a SQL interface to yet more forms of data.

Improvements to Optiq’s MongoDB adapter

2014-03-19T13:39:00-07:00

It’s been a while since I posted to this blog, but I haven’t been idle. Quite the opposite; I’ve been so busy writing code that I haven’t had time to write blog posts. A few months ago I joined Hortonworks, and I’ve been improving Optiq on several fronts, including several releases, adding a cost-based optimizer to Hive and some other initiatives to make Hadoop faster and smarter.

More about those other initiatives shortly. But Optiq’s mission is to improve access to all data, so here I want to talk about improvements to how Optiq accesses data in MongoDB. Optiq can now translate SQL queries to extremely efficient operations inside MongoDB.

MongoDB 2.2 introduced the aggregation framework, which allows you to compose queries as pipelines of operations. They have basically implemented relational algebra, and we wanted to take advantage of this.

As the following table shows, most of those operations map onto Optiq’s relational operators. We can exploit that fact to push SQL query logic down into MongoDB.

MongoDB operator	Optiq operator
$project	ProjectRel
$match	FilterRel
$limit	SortRel.limit
$skip	SortRel.offset
$unwind	-
$group	AggregateRel
$sort	SortRel
$geoNear	-

A bug pointed out that it would be more efficient if we evaluated $match before $project. As I fixed that bug yesterday, I decided to push down limit and offset operations. (In Optiq, these are just attributes of a SortRel; a SortRel sorting on 0 columns can be created if you wish to apply limit or offset without sorting.)

That went well, so I decided to go for the prize: pushing down aggregations. This is a big performance win because the output of a GROUP BY query is often a lot smaller than its input. It is much more efficient for MongoDB aggregate the data in memory, returning a small result, than to return a large amount of raw data to be aggregated by Optiq.

Now queries involving SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, OFFSET, FETCH (or LIMIT if you prefer the PostgreSQL-style syntax), not to mention sub-queries, can be evaluated in MongoDB. (JOIN, UNION, INTERSECT, MINUS cannot be pushed down because MongoDB does not support those relational operators; Optiq will still evaluate those queries, pushing down as much as it can.)

Let’s see some examples of push-down in action.

Given the query:

SELECT state, COUNT(*) AS c
FROM zips
GROUP BY state

Optiq evaluates:

db.zips.aggregate(
   {$project: {STATE: ‘$state’}},
   {$group: {_id: ‘$STATE’, C: {$sum: 1}}},
   {$project: {STATE: ‘$_id’, C: ‘$C’}})

and returns

STATE=WV; C=659
STATE=WA; C=484

Now let’s add a HAVING clause to find out which states have more than 1,500 zip codes:

SELECT state, COUNT(*) AS c
FROM zips
GROUP BY state
HAVING COUNT(*) > 1500

Optiq adds a $match operator to the previous query’s pipeline:

db.zips.aggregate(
   {$project: {STATE: ‘$state’}},
   {$group: {_id: ‘$STATE’, C: {$sum: 1}}},
   {$project: {STATE: ‘$_id’, C: ‘$C’}},
   {$match: {C: {$gt: 1500}}})

and returns

STATE=NY; C=1596
STATE=TX; C=1676
STATE=CA; C=1523

Now the pièce de résistance. The following query finds the top 5 states in terms of number of cities (and remember that each city can have many zip-codes).

SELECT state, COUNT(DISTINCT city) AS cdc
FROM zips
GROUP BY state
ORDER BY cdc DESC
LIMIT 5

COUNT(DISTINCT {column}) is difficult to implement because it requires the data to be aggregated twice – once to compute the set of distinct values, and once to count them within each group. For this reason, MongoDB doesn’t implement distinct aggregations. But Optiq translates the query into a pipeline with two $group operators. For good measure, we throw in ORDER BY and LIMIT clauses.

The result is an awe-inspiring pipeline that includes two $group operators (implementing the two phases of aggregation for distinct-count), and finishes with $sort and $limit.

db.zips.aggregate(
  {$project: {STATE: '$state', CITY: '$city'}},
  {$group: {_id: {STATE: '$STATE', CITY: '$CITY'}}},
  {$project: {_id: 0, STATE: '$_id.STATE', CITY: '$_id.CITY'}},
  {$group: {_id: '$STATE', CDC: {$sum: {$cond: [ {$eq: ['CITY', null]}, 0, 1]}}}},
  {$project: {STATE: '$_id', CDC: '$CDC'}},
  {$sort: {CDC: -1}},
  {$limit: 5})

I had to jump through some hoops to get this far, because MongoDB’s expression language can be baroque. In one case I had to generate

{$ifNull: [null, 0]}

in order to include the constant 0 in a $project operator. And I was foiled by MongoDB bug SERVER-4589 when trying to access the values inside the zips table’s loc column, which contains (latitude, longitude) pairs represented as an array.

In conclusion, Optiq on MongoDB now does a lot of really smart stuff. It can evaluate any SQL query, and push down a lot of that evaluation to be executed efficiently inside MongoDB.

I encourage you to download Optiq and try running some sophisticated SQL queries (including those generated by the OLAP engine I authored, Mondrian).

Efficient SQL queries on MongoDB

2013-06-17T17:15:00-07:00

How do you integrate MongoDB with other data in your organization? MongoDB is great for building applications, and it has its own powerful query API, but it’s difficult to mash up data between MongoDB and other tools, or to make tools that speak SQL, such as Pentaho Analysis (Mondrian), connect to MongoDB.

Building a SQL interface isn’t easy, because MongoDB’s data model is such a long way from SQL’s model. Here are some of the challenges:

MongoDB doesn’t have a schema. Each database has a number of named ‘collections’, which are the nearest thing to a SQL table, but each row in a collection can have a completely different set of columns.
In MongoDB, data can be nested. Each row consists of a number of fields, and each field can be a scalar value, null, a record, or an array of records.
MongoDB supports a number of relational operations, but doesn’t use the same terminology as SQL: the find method supports the equivalent of SELECT and WHERE, while the aggregate method supports the equivalent of SELECT, WHERE, GROUP BY, HAVING and ORDER BY.
For efficiency, it’s really important to push as much of the processing down to MongoDB’s query engine, without the user having to re-write their SQL.
But MongoDB doesn’t support anything equivalent to JOIN.
MongoDB can’t access external data.

I decided to tackle this using Optiq. Optiq already has a SQL parser and a powerful query optimizer that is powered by rewrite rules. Building on Optiq’s core rules, I can add rules that map tables onto MongoDB collections, and relational operations onto MongoDB’s find and aggregate operators.

What I produced is a effectively a JDBC driver for MongoDB. Behind it is a hybrid query-processing engine that pushes as much of the query processing down to MongoDB, and does whatever is left (such as joins) in the client.

Let’s give it a try. First, install MongoDB, and import MongoDB’s zipcode data set:

$ curl -o /tmp/zips.json https://media.mongodb.org/zips.json
$ mongoimport --db test --collection zips --file /tmp/zips.json
Tue Jun  4 16:24:14.190 check 9 29470
Tue Jun  4 16:24:14.469 imported 29470 objects

Log into MongoDB to check it’s there:

$ mongo
MongoDB shell version: 2.4.3
connecting to: test
> db.zips.find().limit(3)
{ "city" : "ACMAR", "loc" : [ -86.51557, 33.584132 ], "pop" : 6055, "state" : "AL", "_id" : "35004" }
{ "city" : "ADAMSVILLE", "loc" : [ -86.959727, 33.588437 ], "pop" : 10616, "state" : "AL", "_id" : "35005" }
{ "city" : "ADGER", "loc" : [ -87.167455, 33.434277 ], "pop" : 3205, "state" : "AL", "_id" : "35006" }
> exit
bye

Now let’s see the same data via SQL. Download and install Optiq:

$ git clone https://github.com/julianhyde/optiq.git
$ mvn install

Optiq comes with a sample model in JSON format, and the sqlline SQL shell. Connect using the mongo-zips-model.json Optiq model, and use sqlline’s !tables command to list the available tables.

$ ./sqlline
sqlline> !connect jdbc:optiq:model=mongodb/target/test-classes/mongo-zips-model.json admin admin
Connecting to jdbc:optiq:model=mongodb/target/test-classes/mongo-zips-model.json
Connected to: Optiq (version 0.4.13)
Driver: Optiq JDBC Driver (version 0.4.13)
Autocommit status: true
Transaction isolation: TRANSACTION_REPEATABLE_READ
sqlline> !tables
+------------+--------------+-----------------+---------------+
| TABLE_CAT  | TABLE_SCHEM  |   TABLE_NAME    |  TABLE_TYPE   |
+------------+--------------+-----------------+---------------+
| null       | mongo_raw    | zips            | TABLE         |
| null       | mongo_raw    | system.indexes  | TABLE         |
| null       | mongo        | ZIPS            | VIEW          |
| null       | metadata     | COLUMNS         | SYSTEM_TABLE  |
| null       | metadata     | TABLES          | SYSTEM_TABLE  |
+------------+--------------+-----------------+---------------+

Each collection in MongoDB appears here as a table. There are also the COLUMNS and TABLES system tables provided by Optiq, and a view called ZIPS defined in mongo-zips-model.json.

Let’s try a simple query. How many zip codes in America?

sqlline> SELECT count(*) FROM zips;
+---------+
| EXPR$0  |
+---------+
| 29467   |
+---------+
1 row selected (0.746 seconds)

Now a more complex one. How many states have a city called Springfield?

sqlline> SELECT count(DISTINCT state) AS c FROM zips WHERE city = 'SPRINGFIELD';
+-----+
|   C |
+-----+
| 20  |
+-----+
1 row selected (0.549 seconds)

Let’s use the SQL EXPLAIN command to see how the query is implemented.

sqlline> !set outputformat csv
sqlline> EXPLAIN PLAN FOR
. . . .> SELECT count(DISTINCT state) AS c FROM zips WHERE city = 'SPRINGFIELD';

'PLAN'
'EnumerableAggregateRel(group=[{}], C=[COUNT($0)])
  EnumerableAggregateRel(group=[{0}])
    EnumerableCalcRel(expr#0..4=[{inputs}], expr#5=['SPRINGFIELD'], expr#6=[=($t0, $t5)], STATE=[$t3], $condition=[$t6])
      MongoToEnumerableConverter
        MongoTableScan(table=[[mongo_raw, zips]], ops=[[<{city: 1, state: 1, _id: 1}, {$project ...}>]])
'
1 row selected (0.115 seconds)

The last line of the plan shows that Optiq calls MongoDB’s find operator asking for the city, state and _id fields. The first three lines of the plan show that the filter and aggregation are implemented using in Optiq’s built-in operators, but we’re working on pushing them down to MongoDB.

Finally, quit sqlline.

sqlline> !quit
Closing: net.hydromatic.optiq.jdbc.FactoryJdbc41$OptiqConnectionJdbc41

Optiq and its MongoDB adapter shown here are available on github. If you are interested in writing your own adapter, check out optiq-csv, a sample adapter for Optiq that makes CSV files appear as tables. It has own tutorial on writing adapters.

Check back at this blog over the next few months, and I’ll show how to write views and advanced queries using Optiq, and how to use Optiq’s other adapters.

Gathering requirements for olap4j 2.0

2013-06-03T10:35:00-07:00

It’s time to start thinking about olap4j version 2.0.

My initial goal for olap4j version 1.0 was to decouple application developers from Mondrian’s legacy API. We’ve far surpassed that goal. Many applications are using olap4j to connect to OLAP servers like Microsoft SQL Server Analysis Services, Palo and SAP BW. And projects are leveraging the olap4j-xmlaserver sister project to provide an XMLA interface on their own OLAP server. The need is greater than ever to comply with the latest standards.

The difference between products and APIs is that you can’t change APIs without pissing people off. Even if you improve the API, you force the developers of the drivers to implement the improvements, and the users of the API get upset because they don’t have their new drivers yet. There are plenty of improvements to make to olap4j, so let’s try to do it without pissing too many people off!

Since olap4j version 1.0, there has been a new release of Mondrian (well, 4.0 is not released officially yet, but the metamodel and API are very nearly fully baked) and a new release of SQL Server Analysis Services, the home of the de facto XMLA standard.

Also, the Mondrian team have spun out their XMLA server as a separate project (olap4j-xmlaserver) that can run against any olap4j driver. If this server is to implement the latest XMLA specification, it needs the underlying olap4j driver to give it all the metadata it needs.

Here’s an example of the kind of issue that we’d like to fix. In olap4j 1.x, you can’t tell whether a hierarchy is a parent-child hierarchy. People have asked for a method

boolean isParentChild();

Inspired by the the STRUCTURE attribute of the MDSCHEMA_HIERARCHIES XMLA request, we instead propose to add

enum Structure {
  FULLYBALANCED,
  RAGGEDBALANCED,
  RAGGED,
  NETWORK
}
Structure isParentChild();

We can’t add this without requiring a new revision of all drivers, but let’s be careful gather all the requirements so we can do it just this once.

Here are my goals for olap4j 2.0:

Support Analysis Services 2012 metamodel and XMLA as of Analysis Services 2012.
Create an enum for each XMLA enum. (Structure, above, is an example.)
Support Mondrian 4.0 metamodel. Many of the new Mondrian features, such as measure groups and attributes, are already in SSAS and XMLA.
Allow user-specified metadata, such as those specified in Mondrian’s schema as annotations, to be passed through the olap4j API and XMLA driver.
We’ll know that we’ve done the right thing if we can remove MondrianOlap4jExtra.

I’d also like to maintain backwards compatibility. As I already said, drivers will need to be changed. But any application that worked against olap4j 1.1 should work against olap4j 2.0, and any driver for olap4j 2.0 should also function as an olap4j 1.x driver. That should simplify things for the users.

I’ll be gathering a detailed list of API improvements in the olap4j 2.0 specification. If you have ideas for what should be in olap4j version 2.0, now is the time to get involved!