Four algorithms for joining larger than main memory tables are considered: sort-merge, simple hash, GRACE hash and hybrid hash. Sort-merge partitions the tables using an ordering established by the join columns, GRACE and hybrid hash partition the tables via a hashing function and the simple hash algorithm does not explicitly partition the tables.
The performance of the four algorithms is analyzed, their minimum memory requirement (for non-pathological performance) is established, and circumstances in which the algorithms prove equivalent or to dominate one another is also discussed. In most circumstances, the hybrid algorithm is shown to provide the best performance.
Strategies for dealing with partition overflow are discussed, but not quantified. The benefits and performance implications of using the algorithms with a virtual memory system are also discussed with an toward achieving good performance and memory utilization where multiple queries may be run simultaneously. Finally three join-efficiency-improving tools are shown to be compatible with the algorithms: database filters, Babb arrays and semijoins.
Nothing like constrained resources to inspire clever algorithms. The hybrid algorithm proposed nicely degrades to the one that performs well in low memory when memory is low and upgrades to the one that performs well with copious memory when memory abounds. There is some serious hand waving over how the algorithm might cope with non-uniform distribution of the join keys which would likely result in it generating partitions too large to fit into memory. Some vague fallbacks are provided but they mostly involve adjusting the hashing algorithm after the fact, which sounds non-trivial. One presumes that in the intervening two decades these issues have been hammered out.
In these days of copious memory, a paper that talks about data sets requiring a whopping five megabytes of RAM might sound a bit quaint, but fortunately the amount of data we’re trying to process these days nicely outstrips the amount of RAM on our machines by the same few orders of magnitude. These algorithms may well prove useful on projects like Pig, where the tendency is probably to say “We scale out, so we don’t have to worry about performance on individual machines; just do things the naive way.” Maybe we’ll see these same algorithms crop up in another five years, except instead of being motivated by a reduction in query time, they’ll be motivated by the reduction in power consumption and disk drive wear that is achieved by doing fewer disk reads and memory compares/copies as we grind through our petabytes.