samskivert: Generalized Search Trees for Database Systems – Hellerstein, et. al.

14 October 2009

Introduces existing state of the art: specialized search trees and search trees for extensible data types. Introduces Generalized Search Tree (GiST). Explains general search tree abstraction, shows how this maps nicely to GiST. Explains structure, invariants and key methods (Consistent, Union, Compress, Decompress, Penalty, PickSplit) of GiST. Defines Search, elaborates on optimizations for search on linearly ordered domains. Outlines how Insert works, omits explanation of Delete. Instantiates GiST as B+-tree, R-tree and (novel) RD-tree (set-valued attribute index), summarizes key methods in each case. Discusses performance implications of key overlap and compression loss, provides analysis of performance of RD-tree. Sketches implementation issues: in-memory efficiency, concurrency-related, variable-length keys, bulk loading, optimizer integration and actual coding. Proposes as future work: theory of indexability, indexing of non-standard domains, query optimization and cost estimation, lossy key compression techniques and algorithmic improvements (c.f. R*-tree improvements over R-tree).

The GiST is clearly a useful abstraction over pre-existing search trees. The key contribution of this paper would seem to be identifying the generalization of those trees as successive partitions of a data set and defining algorithms based around that abstraction. In addition to simply saving time for the implementer of a new index, it provides a conceptual starting point for considering whether and how your data can be indexed at all. Will it partition into a nicely balanced tree with good fanout? If not, some other approach may be desirable.

Of the six methods needed to implement a GiST, four really capture the essence of the process: Consistent indicates whether the subtree identified by a particular edge may contain data that matches your query. If so, you traverse it. If not, it can be skipped. Union is needed when restructuring the tree after an insert. Every edge must be made Consistent with the Union of the keys in the tree below that edge. Penalty guides the algorithm toward the optimal subtree in which to insert new data and PickSplit guides the algorithm in splitting a tree node in two when it becomes too large. In retrospect this seems like a perfectly natural way to capture the construction and use of a search tree but it no doubt took keen insight on the part of the authors to tease out exactly what was important and what was not.

Also interesting is the identification of limitations of the GiST. Not every data set nicely partitions into a perfectly non-overlapping hierarchy and though the GiST handles overlapping keys, performance degrades as overlap increases. Further, the union of the keys in a subtree must provide some sort of compressed representation that does not make the size of the index prohibitively large. For B+-trees one need only store the lower and upper bound of the data contained in each subtree (in fact, just the lower bound) and this partitions the data with no overlapping. Other index types are more problematic: with R-trees overlap becomes harder to avoid due to the use of bounding boxes to define Consistent, but using more fine-grained data would result in too much data at each index node. With RD-trees and set valued data, overlap becomes more problematic still. It is possible that the illustrations of the constraints on tree indexing is a greater contribution than the generalized framework for implementing tree indexes itself.

Source: PDF ACM

©1999–2015 Michael Bayne