Until last week, I hadn’t thought much about DML in Morel. I knew we would support DML, of course, but I had assumed that we would just add commands analogous to SQL’s INSERT, UPDATE, DELETE commands.

But then, over a beer, my brother asked me whether Morel would have a MERGE command. I almost choked on my pale ale. I realized that MERGE (also known as UPSERT) and the other SQL DML commands brought a complexity that could damage the fragile design of a new language. I started thinking about the whole DML problem. DML is about mutation, Morel is a functional programming language, and mutation and FP have always been uneasy bedfellows. So I devised some language extensions that I think fit both Morel and modern data engineering workflow.

In this post, I describe those DML extensions, and related ideas about transactions, incremental processing, and view maintenance.

I would love to hear feedback from people who do data engineering, ELT and ETL, and love or hate how current tools and query languages solve the problem. My extensions to Morel are at an early stage, so there is plenty of time to change course. I will be speaking next week at Data Council 2025, so if you are in Oakland, please tell me what you think!

Translate SQL DML commands to Morel

The obvious thing would be to directly translate SQL’s DML commands into Morel. To see how that would look, let’s consider a simple SQL script that executes the three DML commands on the emps table of the scott data set, and then commits the changes.

-- Delete employees who earn more than 1,000.
DELETE FROM scott.emps
WHERE sal > 1000;

-- Add one employee.
INSERT INTO scott.emps (empno, deptno, ename, job, sal)
VALUES (100, 20, 'HYDE', 'ANALYST', 1150);

-- Double the salary of all managers.
UPDATE scott.emps
SET sal = sal * 2
WHERE job = 'MANAGER';

-- Commit.
COMMIT;

If we added insert, update, and delete operators to Morel, this could become the following code.

(* Delete employees who earn more than 1,000. *)
delete e in scott.emps
  where e.sal > 1000;

(* Add one employee. *)
insert scott.emps
  [{empno = 100, deptno = 20, ename = "HYDE", job = "ANALYST", sal = 1150}];

(* Double the salary of all managers. *)
update e in scott.emps
  where e.job = 'MANAGER'
  assign (e, {e with sal = e.sal * 2});

(* Commit. *)
commit;

Note that update has an assign clause that updates the current record, and we borrow the with functional update notation from OCaml.

But this doesn’t look right to me. The insert, update, and delete operators all mutate their target data set – which conflicts with functional programming’s immutability ethos – and their syntax only uses parts of Morel’s elegant from expression.

Also, let’s remind ourselves what we are trying to achieve.

Requirements for a data engineering language

Things a data engineering language needs to do well:

  • Refer to any previous version of a data set (committed or uncommitted).
  • Work in terms of deltas between data sets, or versions of a data set, if that is the most natural way to frame the problem.
  • Support data sets other than tables: local files, data frames, cloud object storage, non-SQL databases.
  • Allow you to commit atomically when you are ready, and not force you to commit before you are ready.
  • Create intermediate data sets (temporary tables), and materialize them if it improves performance.
  • Give the compiler freedom to re-order commands, parallelize commands, and create (or suggest) intermediate data sets.
  • Static typing, so that you can find and refactor all uses of a particular table or field across all queries, scripts, programs, and jobs.
  • Parameterize scripts.
  • Refer to previous types of a table, e.g. when the table did not have a bonus column, and when the hiredate column was a string.

SQL DML commands fail most of these requirements. (For example, they can refer to only one previous version of a data set, namely the contents of the table at the start of the current command.) DML looks better when considered through the lens of functional programming.

DML as functional programming

Functional programming languages really don’t like assignment: they would rather create a new variable than mutate an existing variable.

This makes sense for a functional language’s compiler – if a program has no mutations, the compiler has more freedom to re-order and parallelize the program – but the controller running data engineering script would also like to re-order and parallelize wherever possible.

SQL’s DML commands are weird. (Though many of us have been writing SQL for so long that we’ve gotten used to the semantics.) In a command that modifies the emps table, you can also read from a table called emps, but it is a different table! It is the before image, the state of the table before this command started inserting, modifying or deleting rows.

It is reasonable, and useful, to refer to the before image, but it is only the before image of the current command. Suppose your script has three commands – a DELETE, an INSERT and an UPDATE, as in the previous example – and the UPDATE command wishes to reference the state of the emps table before the INSERT command was run. In SQL, you’re out of luck. In the DML syntax I’m proposing for Morel, it is straightforward:

(* Delete employees who earn more than 1,000. *)
val emps2 =
  from e in scott.emps
    where not (e.sal > 1000);

(* Add one employee. *)
val emps3 = emps2 union
  [{empno = 100, deptno = 20, ename = "HYDE", job = "ANALYST", sal = 1150}];

(* Double the salary of all managers. *)
val emps4 =
  from e in emps3
    yield if e.job = "MANAGER" then {e with sal = e.sal * 2} else e;

(* Commit. *)
commit {scott with emps = emps4};

The insert, update, and delete commands are no more, but we use the new with operator during update and commit.

The commit command has changed considerably, and deserves at least one blog posts all to itself. It is now the only command that mutates (although what mutatation means in a language without side-effects has yet to be nailed down). Its argument is the value of a database – in this case the scott database – which is a record with a field for each table – and therefore allows us to commit changes to all tables atomically; if we have changed only a few tables, with provides a convenient shorthand. Representing a database as a record will (with the forall quantifier just added, and the proposed check keyword on type specifications) allow us to define complex constraints such as foreign keys.

Because we use a new variable for each version of the emps table, the previous versions are still available.

If you’re worried about the performance and storage requirements of all of these copies of the emps table, don’t be. These copies are virtual. For example, if your program asks for emps2, this might expand to the contents of emps3 minus the “delta” records inserted. Or if you are using a table format such as Apache Iceberg or Hudi, each version is probably just a different subset of the underlying files. The proposed Morel syntax is closer to how databases implement transactions internally (using snapshots and journals).

Incremental computation

Incremental computation is essential when you are maintaining large data sets on disk. If you have a table with a billion rows and you want to insert, update or delete 1% of those rows, it makes no sense to write the 99% of rows that are not effected. The SQL INSERT, UPDATE and DELETE commands are clearly designed with that in mind.

But modifying 1% of a collection is mutation, which is antithetical to the functional programming ethos. It is more complicated to say “here are the rows added, modified and removed” than to say “here is the new relation”.

This proposal offers the best of both. A Morel modification command returns the new set but is implemented by sending incremental commands to the database. For example, if the “Delete employees who earn more than 1,000” command above deletes 1% of employees, then emps2 will contain the remaining 99% of employees, but Morel will update the database efficiently by sending a DELETE command.

Incremental computation also arises during incremental view maintenance (IVM), or more generally, when maintaining some materialized data set that has been created (explicitly by the user, or implicitly by Morel’s planner) earlier this script run or in a previous run. It is notoriously hard to correctly write the incremental logic; it is preferable to declare that the data set is always equivalent to running a particular query, and have the planner compute the deltas.

For both kinds of incremental computation, we believe that people should work in the “declarative space,” saying “this is the relation that I want,” and Morel should figure out the incremental operations to compute and store that relation efficiently.

Conclusion

This article outlines how DML commands could be implemented in Morel, in a way that we believe is superior to SQL. Rather than creating direct analogs of the SQL commands, our approach is to use queries to create named intermediate results: INSERT becomes a query with union, DELETE becomes a query with minus or where not, and UPDATE becomes a query with a conditional yield. The only new command, commit, has the magical ability to update all tables in a database atomically.

This is consistent with Morel’s ethos as a functional relational language, and makes some of data engineering’s challenges more tractable.

If you have comments, come see me next week in Oakland or reply on Bluesky @julianhyde.bsky.social or Twitter: