samskivert: Access Path Selection in a Relational Database Management System – Selinger, et. al.

21 October 2009

Summary
System R is introduced. SQL statement processing summarized: parsing, optimizing, code generation, execution. Research Storage System is introduced, supported scans summarized: segment and index scan. Costs for single relation access paths outlined along with general formula for estimating cost. Selectivity factor introduced, approximations provided for supported predicates (c = v, c > or < v, c1 < v < c2, c in l, c in subquery, or, and, not). Search algorithm plus heuristics for join access path selection summarized. Method for calculating costs is provided along with full walk-through of example join. Handling of nested queries is explained.

Comments
If you’ve ever wondered about the gory details of how a relational database decides precisely how to process a query, this is a great place to start. They give good formulas for estimating the cost (in disk reads) of various query predicates based on limited statistics about the indexes and relations, and a robust method for combining that information together as you consider different potential join orders. The whole gamut of query optimization has produced many PhDs over the last thirty years and is accordingly pretty complex, but this is the solid foundation on which it all rests.

I also discovered correlation subqueries (select NAME from EMPLOYEE X where SALARY > (select SALARY from EMPLOYEE where EMPLOYEE_NUMBER = X.MANAGER)), which I never knew about (or never had a query that needed such a thing, I suppose). My mind balks at how expensive they are to compute (they talk about some possible optimizations), but they definitely increase the expressive power of SQL.

A couple of amusing anecdotes:

Source: PDF ACM

©1999–2022 Michael Bayne