Motivation: RDBMSs not ideal for analytics on “Internet-scale” data sets, programmers prefer procedural, map/reduce too low-level. Proposal: Pig Latin, a parallelizable procedural data processing language that compiles down to map/reduce tasks. Notable differences from RDBMSs: schema-optional, nested data model, pervasive UDF support, parallelism required, debugging support. Pig Latin: data model (atom, tuple, bag, map), LOAD/USING/AS, FOREACH, FLATTEN, FILTER, COGROUP, GROUP, JOIN, etc. Nested operations, output via STORE. Implementation: logical plan creation (laziness), map-reduce physical plan creation, nested-bag efficiency when using algebraic grouping functions. Debugging: Pig Pen, sandbox data set generation. Usage scenarios: rollup aggregation, temporal analysis, session analysis. Related work: Dryad, Sawzall, data-parallel languages. Future work: “safe” optimizer, better UI, non-Java UDFs, LINQness.
Pig looks very useful, but I can’t help but feel that a big part of it is a reinvention of the internals of (admittedly not widely hyped) parallel database systems. There are useful differences: no need for predefined schemas, the non-flat data model, separating join into cogroup and cross-product, allowing the developer to specify the query plan rather than being completely declarative. Some would call this last benefit a burden since the developer now has to understand how the system works sufficiently well to structure their query efficiently, but Pig targets situations where the database system doesn’t really know anything about the underlying data and thus is not in a good position to make those decisions on the developer’s behalf.
However, even with those additions, it takes a few big steps backwards. Having no information on key distribution means you always process your entire data set. This quickly becomes untenable as we discovered when developing Panopticon at OOO. A concrete example is: if you know that a file contains data that only spans a certain date range, you can completely avoid reading that file when doing a processing job that doesn’t intersect that date range. When you have 25,000 files and 24,500 of them don’t intersect your date range, the speedup can be quite substantial.
Less fundamental but still problematic is that they materialize intermediate data on their data-duplicating distributed file system rather than stream it directly to the reduce jobs. This is good for fault tolerance, but bad for performance. It’s not entirely clear that Hadoop Streaming is meant to address this problem.
It seems like (for analytics anyway) a relaxation or streamlining of some of the pesky requirements of parallel databases would give one the same convenience without requiring 3.1 to 6.5 times as much processing power.