Ordered and unordered data
Despite what the relational model says, some data is ordered.
I’m not talking about sorted data. If you sort a collection, applying a comparator function to its elements, then you have no more information than you had before.
No, the integer list
[3, 1, 4, 1, 5, 9, 2, 6, 5, 3]
and the string list
["Shall I compare thee to a summer's day?",
"Thou art more lovely and more temperate",
"Rough winds do shake the darling buds of May",
"And summer's lease hath all too short a date"]
depend on the order of their elements for their meaning.
But of course, some data is unordered, for good reason. A relational database would be foolish to guarantee that if you write rows into a table in a particular order, they will be read back in the same order. Such a guarantee would seriously limit the database’s scalability.
This post is about how we allow ordered and unordered data to coexist in Morel.
We achieved this with a collection of new features, including
adding a bag type,
the
ordered relational operators,
the
ordinal keyword,
and a new
unorder step.
All of these features will appear shortly in Morel release 0.7.
List and bag types
As a functional query language, Morel spans the worlds of database and functional programming.
Databases’ fundamental type, the relation, is an unordered collection
of records. (Though curiously, modern SQL allows columns to contain
“nested tables”, which can be either of the ordered ARRAY type or
the unordered MULTISET type.)
Functional programming languages’ fundamental type is the list, an ordered type. Functional programs are often defined by structural induction on lists. For example, the function
fun allPositive [] = true
| allPositive (x::xs) = x > 0 andalso allPositive xs;
> val allPositive = fn : int list -> bool
inductively defines that a list of numbers is “all-positive” if it is empty, or if its first element is positive and the rest of the list is “all-positive”. This kind of inductive definition requires a firm distinction between the first element of a list and the rest of the list, a distinction that is not present in an unordered collection.
So, Morel needs to support both ordered and unordered collections.
Earlier versions of Morel papered over the difference. All collections
had type list, even the unordered collections backed by database
tables. Morel’s relational operators produced results in deterministic
order if you applied them to in-memory collections using the
in-process interpreter, but order was not guaranteed when Morel
converted the query to SQL for execution in a DBMS.
To fix the problem, the first step was to add a bag type. (Bag is a
synonym for multiset,
implying a given element may occur more than once, but iteration order
is not defined.) bag is the unordered counterpart to the ordered
list type, and has similar operations.
val b = bag [3, 1, 4, 1, 5];
> val b = [3,1,4,1,5] : int bag
Bag.length b;
> val it = 5 : int
Bag.toList b;
> val it = [3,1,4,1,5] : int list
Bag.fromList [3, 1, 4, 1, 5];
> val it = [3,1,4,1,5] : int bag
Order-dependent operations from the list type, such as hd and
drop, are defined for bag instances, but they are not guaranteed
to return the same result every time you call them.
Bag.hd b;
> val it = 3 : int
Bag.drop (b, 2);
> val it = [4,1,5] : int bag
Collections backed by database tables now have type bag:
from e in scott.depts;
> deptno dname loc
> ------ ---------- --------
> 10 ACCOUNTING NEW YORK
> 20 RESEARCH DALLAS
> 30 SALES CHICAGO
> 40 OPERATIONS BOSTON
>
> val it : {deptno:int, dname:string, loc:string} bag
(You may notice that scott.depts collection, backed by the DEPT
table of the SCOTT JDBC data source, has changed its name as well
as its type. It used to be called scott.dept. Morel collection names
should be plural and lower-case, and improvements to the
name mapping system
make it easier to derive proper collection names.)
Next, we provide relational operators to convert between list and
bag.
Converting between ordered and unordered
Now that queries can reference list and bag collections, we need
operators to convert from one to the other. To do this, we use the
existing order step, and add an unorder step and an ordinal
expression.
In previous versions of Morel, the order step converted a list to a
list with a different ordering; now its input can be a list or a
bag:
from i in [3, 1, 4, 1, 5]
order DESC i;
> val it = [5,4,3,1,1] : int list
from i in bag [3, 1, 4, 1, 5]
order DESC i;
> val it = [5,4,3,1,1] : int list
If the sort key does not create a total ordering, the results will be nondeterministic but still a list. For example, we can sort integers so that even numbers occur before odd numbers
from i in bag [3, 1, 4, 1, 5]
order i mod 2;
> val it = [4, 1, 5, 1, 3]
or convert a bag to a list in arbitrary order.
from i in bag [3, 1, 4, 1, 5]
order ();
> val it = [5, 4, 1, 1, 3]
To go the opposite direction, the new unorder step converts a list
to a bag:
from i in [3, 1, 4, 1, 5]
unorder;
> val it = [3,1,4,1,5] : int bag
(You are also free to apply unorder to a bag; it will have no
effect.)
As we said above, a bag contains less information than its
corresponding list. If you plan to convert the bag to a list
at a later stage, you need to store the ordering in an extra field.
The new ordinal expression lets us do this:
from i in [3, 1, 4, 1, 5]
yield {i, j = ordinal}
unorder
order j
yield i;
> val it = [3,1,4,1,5] : int list
The ordinal expression can be used in an expression in a
step whose input is ordered (except the steps whose expressions are
evaluated before the query starts: except, intersect, skip,
take, and union). ordinal evaluates to 0 for the first element,
1 for the next element, and so on. But as we shall see, the optimizer
avoids evaluating ordinal if it can.
Here is a query that computes the salary rank of each employee, then returns only the poorly-paid clerks.
from e in scott.emps
order e.sal
yield {e, rank = 1 + ordinal}
where e.job = "CLERK";
> ename rank
> ------ ----
> MILLER 9
> ADAMS 12
> JAMES 13
> SMITH 14
>
> val it : {ename:string, rank:int} list
The main reason to apply order and unorder in a query is
to control the target collection type. But there is a more subtle
reason which relates to performance. The ordered and unordered
versions of the relational operators may produce the same results
(modulo ordering) but ordered execution may be less efficient (such
as running with reduced parallelism). If a query contains an order
or unorder, the order of the input to that step is irrelevant, and
the optimizer can use a more efficient execution plan.
This, by the way, is why the specification of the order step does
not guarantee stability. If order was stable, the optimizer would
have to use ordered execution of upstream steps if the sort key is
not exhaustive.
If you want order to be stable, you can add ordinal to the
trailing edge of the sort key:
from e in scott.emps
order DESC e.sal
where e.deptno <> 20
yield {e.ename, e.job, e.sal}
order (e.job, ordinal);
> val it =
> [{ename="MILLER",job="CLERK",sal=1300.0},
> {ename="JAMES",job="CLERK",sal=950.0},
> {ename="BLAKE",job="MANAGER",sal=2850.0},
> {ename="CLARK",job="MANAGER",sal=2450.0},
> {ename="KING",job="PRESIDENT",sal=5000.0},
> {ename="ALLEN",job="SALESMAN",sal=1600.0},
> {ename="TURNER",job="SALESMAN",sal=1500.0},
> {ename="WARD",job="SALESMAN",sal=1250.0},
> {ename="MARTIN",job="SALESMAN",sal=1250.0}]
> : {ename:string, job:string, sal:real} list
Materializing ordinal as a 1-based, contiguous sequence of integers
is expensive because it forces sequential execution, and the
optimizer will avoid this if it can. In this case, because ordinal
is used for sorting but is not returned, the optimizer downgrades
ordinal to a virtual expression. The plan might use an ordered
implementation of the where and yield steps followed by a stable
sort, or it might replace ordinal with the previous sort key
(DESC e.sal).
Ordered relational operators
We need to define the semantics of the relational operators over all types of collection.
Part of the job has been done already:
- The relational model defines the semantics of operators over sets (unordered collections without duplicates).
- The SQL standard specifies the relational operators over tables (unordered collections with duplicates).
- Previous versions of Morel defined semantics for (and implemented)
relational operators over multisets (unordered collections with
duplicates). While the collection type was at the time called
list, we were actually defining the currentbagtype. Unlike SQL, elements need not be records.
What remains is to define the semantics of queries over lists (ordered collections with duplicates), and for hybrid queries that combine lists and multisets. (We define hybrid semantics in the next section.)
Because a
query
consists of a sequence of steps, each corresponding to a relational
operator, we define the semantics of each step over input that is a
list:
- The first step in a query –
from pat in exp,forall pat in exp, orexists pat in exp– returns elements in the same order that they are emitted from exp. join pat in exp [ on condition ]for each element from its input evaluates exp, then, in order of those elements, emits a record consisting of fields of the two elements, skipping records where condition is false.- If a
from,forall,existsorjoinstep has more than one scan, each subsequent scan behaves as if it were a separatejoinstep. yield exppreserves order.where conditionpreserves order, dropping rows for which condition is false.skip countandtake countpreserve order (respectively dropping the first count rows, or taking the first count rows).distinctpreserves order, emitting only the first occurrence of each element.group groupKey1, ..., groupKeyg [ compute agg1, ..., agga ]preserves order, emitting groups in the order that the first element in the group was seen; each aggregate functionaggiis invoked with a list of the input elements that belong to that group, in arrival order.compute agg1, ..., aggabehaves as agroupstep where all input elements are in the same group.union [ distinct ] exp1, ..., expnoutputs the elements of the input in order, followed by the elements of each expi argument in order (just like the UNIXcatcommand). Ifdistinctis specified, outputs only the first occurrence of each element.intersect [ distinct ] exp1, ..., expnoutputs the elements of the input in order, provided that every expi argument contains at least the number of occurrences of this element so far. Ifdistinctis specified, outputs only the first occurrence of each element.except [ distinct ] exp1, ..., expnoutputs the elements of the input in order, provided that the number of occurrences of that element so far is less than the number of occurrences of that element in all the expi arguments. Ifdistinctis specified, outputs only the first occurrence of each element.require condition(which can only occur in aforallquery) has the same behavior as the unordered case.orderandunorder, as discussed earlier, have the same semantics as in the unordered case.
The rules for from and join produce the same familiar ordering as
a nested “for” loop in a language such as C, Python or Java:
from hundreds in [100, 200, 300],
tens in [10, 20, 30]
join units in [1, 2, 3]
yield hundreds + tens + units;
> val it =
> [111,112,113,121,122,123,131,132,133,211,212,213,221,222,
> 223,231,232,233,311,312,313,321,322,323,331,332,333]
The rules for union, intersect and except are rather subtle, and
are best illustrated by example:
from i in [3, 1, 4, 1, 5, 9, 2, 6]
union [2, 7, 1, 8, 2, 8, 1, 8];
> val it = [3,1,4,1,5,9,2,6,2,7,1,8,2,8,1,8] : int list
from i in [3, 1, 4, 1, 5, 9, 2, 6]
intersect [2, 7, 1, 8, 2, 8, 1, 8];
> val it = [1,1,2] : int list
from i in [3, 1, 4, 1, 5, 9, 2, 6]
except [2, 7, 1, 8, 2, 8, 1, 8];
> val it = [3,4,5,6,9] : int list
Hybrid relational operators
We have specified the behavior of queries where input collections are all lists or all bags. But what if a query has a mix of list and bag inputs?
The mixing can occur if the first step of the query (from, exists,
or forall) has more than one scan, or in steps that introduce
another collection (join, union, intersect, or except). In all
cases, unordered wins: if any input is a bag, the step becomes
unordered, and unordered semantics apply from then on.
For example, if we join a list of department numbers (ordered) to a
table of employees (unordered), selecting only the clerks and
managers, the result is a bag (unordered):
from deptno in [10, 20, 30]
join e in scott.emps on e.deptno = deptno
where e.job elem ["CLERK", "MANAGER"]
yield {deptno, e.ename};
> deptno ename
> ------ ------
> 30 JAMES
> 10 CLARK
> 20 ADAMS
> 10 MILLER
> 20 SMITH
> 30 BLAKE
> 20 JONES
>
> val it : {deptno:int, ename:string} bag
Type inference challenges
This feature was challenging to implement because it required major changes to Morel’s type inference algorithm. (We mention this only in the spirit of sharing war-stories, and for the interest of those who understand the internal workings of Morel’s compiler. Hopefully, the changes to type-inference algorithm will be invisible to the casual user.)
The problem is evident in a program such as
let
fun f (xs, ys) =
from i in xs
intersect ys
in
f ((from e in scott.emps yield e.empno), [7521, 7782, 8000])
end;
> val it = [7521,7782] : int bag
While resolving the type of function f and its embedded query, the
types of the arguments xs and ys have not yet been
determined. Morel’s previous type inference algorithm allowed us to
say “xs and ys must be lists with the same element type” or
“xs and ys must be bags with the same element type”. It was
based on
Hindley-Milner’s Algorithm W
and unification, which basically means finding an assignment of
logical variables so that two trees are structurally identical.
But the type inference rules for queries with a mixture of ordered and
unordered collections require conditions that contain the word
‘or’. For example, resolving the intersect expression above requires
that we say “we can allow xs and ys to both be bags, or both be
lists, or one to be a bag and the other a list, but they must have
same element type”. Furthermore, we need to derive the result
type, saying “the result of the query is a list if both arguments
are lists, otherwise a bag, with the same element type as the
arguments”.
We needed a system where we can place a number of constraints on type
variables, and then solve for those constraints. The new type
inference algorithm extends Hindley-Milner with constraints, using the
approach described in
“A Second Look at Overloading” by Odersky, Wadler & Wehr (1995).
As the title of that paper suggests, we have added a kind of
overloading to Morel;
it is as if the intersect operator now has four forms:
intersect: α bag * α bag → α bagintersect: α bag * α list → α bagintersect: α list * α bag → α bagintersect: α list * α list → α list
(and similar overloads for the other relational operators) and the type inference algorithm solves the constraints to land on one valid assignment of types.
The algorithm took several months of hard work to implement, but the results are pleasing. Morel retains the key benefits of a Hindley-Milner type system: strong static typing, runtime efficiency, and type inference without the need for type annotations.
Like any other major change in architecture, constraint-based type inference will take a while to mature; [MOREL-270] and [MOREL-271] describe some of the remaining issues.
Conclusion
The ability to combine ordered and unordered data sets, and process both using relational operators, is a major new feature in Morel. It allows Morel to handle, with equal ease, data from files and relational databases, and data that is generated programmatically.
This feature will be available in Morel release 0.7.
To find out more about Morel, read about its goals and basic language, peruse the query reference or language reference, or download it from GitHub and give it a try.
If you have comments, please reply on Bluesky @julianhyde.bsky.social or Twitter:
Database tables are unordered, but functional programming languages work best over ordered lists. Which should @morel_lang prefer? Both! We now have "list" and "bag" types, and full relational algebra over both. https://t.co/n8vZx0pUmG
— Julian Hyde (@julianhyde) June 7, 2025