Parallel database systems making a comeback on commodity hardware. Pipelined parallelism and partitioned parallelism both provide benefit but the latter gives bigger gains. Shared-nothing, partitioned data systems are general consensus. Challenges to performance: startup, interference, skew. Data partitioning methods: round-robin, hash, range. Parallelizing operators generally means introducing split and merge. Certain traditional relational operators are better in parallelized environments than others, e.g. hash-join beats sort-merge join. Extant systems: Teradata, Tandem, Gamma, etc. Future directions: batch+OLTP, parallel query optimization, client parallelism, physical design, online data reorganization.
I find it tragic that data-partitioned parallel databases were old hat in 1992 (Teradata was apparently doing this stuff as far back as 1978) and yet numerous, large-scale, websites are still doing horizontal data partitioning manually. To be fair, data partitioning has historically meant partitioning data across separate disk drives inside an SMP machine, but the trajectory has definitely been toward totally separate servers.
Oracle provided “Real Application Clusters” as far back as 2001 (though many might suggest the name replace the final s with a popular four-letter word), but very few large-scale websites use Oracle (I think eBay does, I know of no others). Open-source engines have failed to follow suit with cluster support. Is that because it’s hard? Certainly, but solutions to other hard problems make their way into FOSS. I think the major malfunction has been the assumption that ACID properties must be preserved in the clustered system. This is both technically very difficult and meaningfully reduces scalability. Thus it seems that the most important contribution of the parade of NoSQL projects is the realization that sometimes one will happily sacrifice some consistency for scalability and simplicity.