samskivert: What Goes Around Comes Around – Stonebraker, et. al.

02 October 2009

“Those who do not understand history are condemned to repeat it.” Summarizes database research over last 30 years, organized into nine eras:

1. Hierarchical era: epitomized by IMS. Limitations include data duplication (tree-structure not ideal for all data sets), path-dependence, physical data dependence, manual query implementation.

2. Network: epitomized by CODASYL. Network structure more flexible than tree, but also more complex. Binary relationships still not ideal (e.g. marriage ceremony with bride, groom, minister), manual query implementation now even more complex, fails to encourage data factoring (end up with spaghetti-like network).

3. Relational era: inaugurated by Codd in 1970, simple logical data model (tables), accessed through set-at-a-time query language, physical and logical data independence. Triggered “Great Debate”, opponents balked at complexity of query language. RDBMSs incubated on VAX systems, debate resolved (by fiat) when IBM announced DB/2 (and kingmade SQL).

4. Entity-relationship: recast relational model as entities and their relationships. Turned out to be easily mappable to relational model and provided no performance or functionality benefits. Little impact aside from as a useful conceptual model when creating schemas.

5. Post-relational (R++): myriad small extensions to relational model proposed. Authors favorite was Gem which added set-valued attributes, aggregation (direct record-to-record links), and generalization (inheritance). None of the extensions caught on.

6. Semantic Data Model: focused on classes, inheritance (networks, allowing multiple inheritance). Little marketplace success, easy to map manually to relational model, no significant power/performance improvements.

7. OODBMS: mostly “persistent C++”, supported in-process persistent object hierarchies, lacked query languages, transactions, focused on not degrading performance. O2 attempted to make “proper” DBMS with object-oriented data model and query language, etc. but failed in marketplace.

8. Object-relational: born from a need to handle GIS data, chiefly introduced user-defined datatypes, user-defined functions (stored procedures), user-defined access methods (custom indexing, etc.). All of these brought performance improvements and not-surprisingly have survived in the marketplace.

9. Semi-structured data: schema-last, network-oriented data model. Real problem is semantic heterogeneity which is not helped by SSD, most systems attempt to migrate from schema-last to schema-first. XML: “everything we’ve done before and then some”, supports hierarchical and networked data models, set-based attributes, inheritance, xpath query language, and adds union types, and more. Will probably either: fail outright, be trimmed to a data-oriented subset, or become popular and cause a repeat of the complexity-inspired pain of era 2 (CODASYL, etc.).

We get an enjoyable and informative overview of the last 30 years of database research, by authors who are not afraid to inject a little humor and opinion along the way. The pre-relational summary echos many of the pain points outlined by Todd in his paper introducing the relational approach. The outline of the “great debate” is fascinating, both as an introduction to the history of the field and in the ways it so closely mirrors the debate surrounding the transition from hand-coded assembly to higher-level languages and basically every major raising of the level of abstraction in computer science (”normal programmers can’t understand it”, “it will never be fast enough”, etc.). Also interesting is the object lesson in technology adoption with RDBMSs piggy-backing on the success of the VAX and then being thrust into the limelight by heavyweight IBM (along with their choice of SQL as the de facto query language). E-R succeeding as a modeling tool seems obvious in retrospect and is interestingly paralleled by UML’s success in design and documentation and failure as a model from which code is generated.

The post-relational (R++) era seems characteristic of a technology where there are no major pain points and thus nothing on which to focus research. Researchers languish, noodling with incremental improvements that don’t provide sufficient increase in expressive power or performance to justify the effort and complexity of adoption. One need only look to the flurry of activity in the last few years (unfortunately post-dating this paper) to see how the major pain of developing massively scalable systems has generated work that already looks like it will have longer lasting impact than much of what was done in the 80s and 90s. The semantic data model era seems very similar to R++ except that the motivation is the abstract “semantic impoverishment” of RDBMSs rather than some specific end use like 2D data sets or text management.

Having lived through the OO era, I am intimately familiar with the colossal “missing of the point” that took place during that time. The authors diplomatically hint at this state of affairs. OODB systems were fragile, only applicable to narrow application domains and generally more trouble than they were worth. It is heartening to see that new research like LINQ is addressing the actual problem — the pain of bridging between a normal programming language and the query language used by the database, as well as marshaling data between the two. Interestingly the authors hint at the value of this in their summary calling for “code and data [to be made] equal class citizens”.

As the authors point out, it is instructive to see that the major contribution of the O-R era was stored procedures. User defined data types and user defined data access methods are certainly very useful for niche markets, but with dates, currency and full text search now first-class citizens of most relational systems the vast majority of the market has no need to implement custom data types or indexing methods. Stored procedures stick with us mostly due to their performance benefit — they certainly don’t make a system more portable or easier to maintain. One could also point to the rearchitecture of database engines to support extensible data types and access methods as a benefit of this era if only from an engineering standpoint.

Finally we get to the semi-structured data and XML eras. With an additional few years of hindsight, it is clear that the authors’ assessment of the inutility of these approaches has been validated. XML databases have not taken over the world, and XPath remains a niche technology hidden behind a wall of complexity, to say nothing of XSL and XSLT. As the authors also predicted, XML as an over-the-wire format for RPC has taken over the world. Amusingly, it may someday be usurped by JSON, a technology born of — you guessed it — a desire for efficiency.

I would be very interested to hear what the authors have to say about developments in the last five years in the database field. Their opinions on LINQ and related technologies as well as on the menagerie of scalability-focused non-relational databases (CouchDB, SimpleDB, et. al.) would likely be a refreshing counterpoint to the recent resurgence of reports of the death of the RDBMS.

©1999–2015 Michael Bayne