samskivert: A Relational Model of Data for Large Shared Data Banks – Codd

30 September 2009

Summary
Limitations of existing data storage systems, especially their tendency to conflate data storage representation with the mechanism by which data is accessed. Specifically: ordering dependence, indexing dependence, access path dependence. Introduction of the relational model, primary keys, normal form and a normalization process. Description of a universal data sublanguage for querying, updating and deleting data. No specific syntax proposed. Definition of expressible, named and stored relations. Hints at the complexity of translating a query on a named relation into efficient operations on a stored relation (cue decades of research on query planning). Defines operations on relations: permutation, projection, join (cyclic and linear; natural), composition, restriction. Redundancy in relations, strong and weak. Consistency of the data set, and rudimentary strategies for checking and enforcing it.

Comments
Given that this was the paper that introduced the relational database, it clearly had some impact. Reading a paper that calls collections of data a data bank and still puts a space between data and base fills one with a sense of anachronism and reverence. That said, it is interesting to consider what stuck and what didn’t.

He makes a big deal about relations containing no duplicates and removing duplicate rows in projections and joins. As far as I know modern databases don’t do this. They assume that if you have duplicate rows, you probably care about it or you’ll supply a distinct clause.

Also interesting is the frequent reference to inferential databases (or rather repeated reminders that his proposal is only appropriate for non-inferential databases). The term doesn’t seem to be widely used today but I can only assume he was referring to databases that actually drew inferences from their data and presumably contained logical propositions in addition to facts.

When defining his universal data sublanguage, he mentions a host language and the data sublanguage. This made me immediately think that he had postulated something like LINQ way back in 1969. However, that turned out not to be the case. I’m not 100% clear on what he did mean exactly, but it appears to be more like the distinction between PLSQL and ANSI SQL. The host language might contain hints on how to physically store the data and it would provide a library of functions that could be called by the data sublanguage (his example is arithmetic operators).

He discusses compositions at length, a composition being basically a join with the join columns removed (I could be wrong about this). Given that one nearly never wants to do that, it was not clear why he spent time discussing it, but I think it’s just a useful operation when proving other things about relations. He specifically mentions it later when talking about validating consistency. Similarly, restriction looks an awful lot like a natural join against a table whose columns are precisely the join columns and nothing more. Again, probably just useful for proofs.

His discussion of redundancy also seems not to have born fruit. Strict redundancy would indicate that under the hood, you could avoid storing some data. An explanatory example would be: you declare a table T1 with (A, B), T2 with (B, C) and T3 with (A, C). If T1.A and T1.B are primary keys and T1.B is a foreign key (or for whatever reason you are guaranteed that there is an exact one to one correspondence between T1.B and T2.B) then T3 can be completely derived from a join on (A, B) and (B, C) — it is in fact a composition as I mentioned above, join and drop the join columns. So a smart database could not store anything on the file system for the (A, C) mapping. However, I don’t know that any database actually does this. By his own admission you can never actually do anything about weak redundancy (it would be like strong redundancy but without the guarantee that T1.B and T2.B are in one-to-one correspondence), so it’s not super interesting.

Finally his coverage of consistency seems pretty loosey-goosey in this world of ACID compliant databases. But I have to remember that transactions had not yet been invented. Also his examples all seemed to portray a world where the database was a giant mainframe with numerous terminals where people were running queries against the database and manually updating and deleting data therein, so preventing all those people from introducing inconsistencies must have seemed like a major challenge.

I doubt all of my paper reviews will be this thorough, but I’ll aim to maintain my current enthusiasm level for as long as possible.

©1999–2022 Michael Bayne