samskivert
samskivert » 2009 » November

November 28, 2009

Broken Flowers

I have enjoyed Jarmusch films in the past and I have thoroughly enjoyed Bill Murray films in the past. Unfortunately, the chocolate and peanut butter strategy didn’t work for me in this case. I thought the writing was pretty weak and Bill Murray’s supposed “Academy Award deserving” performance was just his same old shtick with a bit more asshole plus contemplative midlife crisis added into the mix.

Posted to films by mdb at 5:17 pm | Comments (0)       

November 27, 2009

Fantastic Mr. Fox

Fantastic indeed! Wes Anderson is definitely back on the upswing. He had already regained my trust with The Darjeeling Limited, but this is above and beyond the call of duty. Clooney was excellent, and the quirky animation and storytelling really worked well with Anderson’s particular brand of kookiness.

Posted to films by mdb at 5:21 pm | Comments (0)       

November 21, 2009

Bytecodes meet Combinators: invokedynamic on the JVM – Rose

Summary
Details on existing invoke bytecodes. Details on potential method call needs of JVM languages, drawbacks of existing reification approach in 13 “pain points”. Introduction of invokedynamic: bootstrapping dynamic calls, invocation, type checking, changing call target, complete relinking. Details on method handles: direct, adapter (arg conversion, combinators), bound (currying, closures), Java method handles. Method handle security considerations (create-time access checking). Inline caches via invokedynamic. Pros and cons of bytecodes vs. ASTs.

Comments
Concise yet detailed coverage of invokedynamic and what it brings to the JVM. Rose points out the much broader potential of invokedynamic and method handle graphs than simply accommodating the dynamic linking needs of non-Java languages. I find this pretty exciting. People have done some amazing stuff with bytecode rewriting, but the possibilities with method handles seem even greater (even in plain-old Java). Fancy metaprogramming becomes possible in a way that still allows the VM to achieve good optimization. He also mentioned a project to emulate invokedynamic on older JVMs (clearly with loss of performance, but not any worse than what dynamic languages on the JVM are doing today). This is also great in that it may enable earlier adoption of the new bytecode on projects (like Scala) that are unlikely to abandon support for 1.5 and 1.6 JVMs any time soon.

Source: PDF

Posted to papers by mdb at 12:14 pm | Comments (0)       

November 20, 2009

Static Extraction and Conformance Analysis of Hierarchical Runtime Architectural Structure using Annotations – Abi-Antoun, et al.

Summary
Describes a system, Scholia, for specifying object ownership in Java programs via an annotation-based type system, extracting hierarchical object-ownership graphs from said code, abstracting away architecturally insignificant details while preserving architecturally significant communications via edge lifting and summarizing, visualizing the extracted architecture, comparing the extracted architecture to a target architecture, and tracing differences of commission back to offending source lines. Details of the type system, details of the extraction and abstraction processes, case studies, and conformance measurement are also provided.

Comments
This is a bit of a tour de force paper, being a summary of the author’s PhD work. The general idea is one in which I have keen interest. The basic problem is that many systems start with a rough architectural sketch (or less), and then proceed full speed into implementation with no particular facility for ensuring conformance with the original architecture or easy way to see how that architecture evolves as the rubber hits the road.

Various approaches have been developed involving architecture description languages and even language extensions that enforce architectural constraints. These are all noble ideas and have achieved surprisingly little currency in industry. I can’t claim to definitively know why this is, but I suspect it involves a number of factors.

The desired architecture of a system is hard to determine up front and must usually evolve with the system. Yet once the system begins to evolve, the focus shifts to the actual code and the features of the product, and architecture takes a back seat. It only comes back up when some feature is likely to have major architectural impact. This is probably natural and efficient, but it means that the likelihood that the architectural modeling tool will be pulled out and used to update the designed architecture to account for system evolution is low.

Language support for layering architectural information on top of actual code (i.e. annotations) is relatively new (and not yet supported by many popular languages). As a long history of out of date documentation can tell us, if something is not specified right next to the code, or by the code itself, it’s likely to be neglected. Thus maintaining a separate architectural description is probably doomed to failure. If the compiler were enforcing the architectural restrictions, the architecture would likely remain more up to date, but the compiler (or a compiler plugin) would likely only be doing that job if the architectural requirements were specified as annotations on the code. Otherwise it would be handled by some separate model checker which requires yet more effort to integrate into a project’s build system and maintain.

This work seems to me to be close to tipping the scales in favor of a first-class representation of architectural concerns directly in the code base, enforced by the compiler and amenable to extraction by tools to generate architectural visualizations of a system as it is actually implemented. Whether or not a target architecture is maintained, having the ability to see the architecture as it evolves allows the natural spaghetti tendencies of any evolving system to be observed and kept in check.

Without having used Scholia myself on a system of reasonable size, I am probably underestimating the pain in adding and maintaining object-ownership annotations. The authors even indicate in the paper that it’s somewhat cumbersome. It remains to be seen whether some combination of ownership inference and a sensible choice of ownership defaults could be used to reduce the annotation burden to levels tolerable by real developers. Ideally one could make some use of partial annotations, but if a partially annotated system presented only a partial object ownership graph, architecture violating interactions might be hidden.

Though I’m not giving up on the idea of a total architectural description, I wonder if value could not be obtained through descriptions of separate aspects of a system’s architecture. Certainly there are many ways of looking at a system, and focusing on one aspect may allow useful and sound information to be obtained without huge effort by the programmer.

Take one architectural concern: separation of client and server. Often one wants to separate a code base into client code, server code and shared code. The client may make totally different assumptions about the runtime environment, use different dependency management techniques, have different performance considerations. As such, one may wish to restrict the client to use only client and shared code and the server to use only server and shared code, and possibly place stronger restrictions on the shared code, like no implicit dependencies. This is only one architectural decomposition of a system, and you can imagine others that might slice the system by feature-set or some other architectural aspect of interest.

I also think that more clearly defined modularity concepts in the source language could help to bridge the gap between a giant pile of interacting objects and a high-level view of inter-module communication. There are clearly many ways to skin this cat.

Source: PDF ACM

Posted to papers by mdb at 3:09 pm | Comments (0)       

November 9, 2009

Pig Latin: A Not-So-Foreign Language for Data Processing – Olston, et al.

Summary
Motivation: RDBMSs not ideal for analytics on “Internet-scale” data sets, programmers prefer procedural, map/reduce too low-level. Proposal: Pig Latin, a parallelizable procedural data processing language that compiles down to map/reduce tasks. Notable differences from RDBMSs: schema-optional, nested data model, pervasive UDF support, parallelism required, debugging support. Pig Latin: data model (atom, tuple, bag, map), LOAD/USING/AS, FOREACH, FLATTEN, FILTER, COGROUP, GROUP, JOIN, etc. Nested operations, output via STORE. Implementation: logical plan creation (laziness), map-reduce physical plan creation, nested-bag efficiency when using algebraic grouping functions. Debugging: Pig Pen, sandbox data set generation. Usage scenarios: rollup aggregation, temporal analysis, session analysis. Related work: Dryad, Sawzall, data-parallel languages. Future work: “safe” optimizer, better UI, non-Java UDFs, LINQness.

Comments
Pig looks very useful, but I can’t help but feel that a big part of it is a reinvention of the internals of (admittedly not widely hyped) parallel database systems. There are useful differences: no need for predefined schemas, the non-flat data model, separating join into cogroup and cross-product, allowing the developer to specify the query plan rather than being completely declarative. Some would call this last benefit a burden since the developer now has to understand how the system works sufficiently well to structure their query efficiently, but Pig targets situations where the database system doesn’t really know anything about the underlying data and thus is not in a good position to make those decisions on the developer’s behalf.

However, even with those additions, it takes a few big steps backwards. Having no information on key distribution means you always process your entire data set. This quickly becomes untenable as we discovered when developing Panopticon at OOO. A concrete example is: if you know that a file contains data that only spans a certain date range, you can completely avoid reading that file when doing a processing job that doesn’t intersect that date range. When you have 25,000 files and 24,500 of them don’t intersect your date range, the speedup can be quite substantial.

Less fundamental but still problematic is that they materialize intermediate data on their data-duplicating distributed file system rather than stream it directly to the reduce jobs. This is good for fault tolerance, but bad for performance. It’s not entirely clear that Hadoop Streaming is meant to address this problem.

It seems like (for analytics anyway) a relaxation or streamlining of some of the pesky requirements of parallel databases would give one the same convenience without requiring 3.1 to 6.5 times as much processing power.

Source:: PDF ACM

Posted to papers by mdb at 3:33 pm | Comments (0)       

MapReduce: Simplified Data Processing on Large Clusters – Dean & Ghemawat

Summary
Overview of map/reduce algorithm. Examples: dgrep, url count, reverse web-link graph, term-vector per host, inverted index, dsort. Overview of implementation in Google data center environment. Execution overview, fault tolerance approach, importance of locality, impact of task granularity, cost/benefit of backup task technique. Algorithm refinements: manual data partitioning, ordering guarantees, post-map combiner, I/O types, side-effects support, bad-record skipping, debuggable local execution, status reporting, counter support. Performance summary for grep and sort on terabyte input data set, effects of backup tasks and machine failures. Stats on adoption inside Google.

Comments
I have nothing to say about map/reduce that hasn’t surely already been said before. I will say that this paper does a great job of going “by the book”. They introduce their technique, give a number of concrete examples of its use, explain its implementation, cover some extensions, show its adoption by real engineers, give lots of hard data on how it performs, provide related work and a rousing conclusion. About the only thing they left out was a section on future work. We’ll assume that since they are Google engineers, the implied future work is “use map/reduce to take over the world.”

Source: PDF ACM

Posted to papers by mdb at 1:06 pm | Comments (0)       

Parallel Database Systems: The Future of High Performance Database Processing – DeWitt, et al.

Summary
Parallel database systems making a comeback on commodity hardware. Pipelined parallelism and partitioned parallelism both provide benefit but the latter gives bigger gains. Shared-nothing, partitioned data systems are general consensus. Challenges to performance: startup, interference, skew. Data partitioning methods: round-robin, hash, range. Parallelizing operators generally means introducing split and merge. Certain traditional relational operators are better in parallelized environments than others, e.g. hash-join beats sort-merge join. Extant systems: Teradata, Tandem, Gamma, etc. Future directions: batch+OLTP, parallel query optimization, client parallelism, physical design, online data reorganization.

Comments
I find it tragic that data-partitioned parallel databases were old hat in 1992 (Teradata was apparently doing this stuff as far back as 1978) and yet numerous, large-scale, websites are still doing horizontal data partitioning manually. To be fair, data partitioning has historically meant partitioning data across separate disk drives inside an SMP machine, but the trajectory has definitely been toward totally separate servers.

Oracle provided “Real Application Clusters” as far back as 2001 (though many might suggest the name replace the final s with a popular four-letter word), but very few large-scale websites use Oracle (I think eBay does, I know of no others). Open-source engines have failed to follow suit with cluster support. Is that because it’s hard? Certainly, but solutions to other hard problems make their way into FOSS. I think the major malfunction has been the assumption that ACID properties must be preserved in the clustered system. This is both technically very difficult and meaningfully reduces scalability. Thus it seems that the most important contribution of the parade of NoSQL projects is the realization that sometimes one will happily sacrifice some consistency for scalability and simplicity.

Source: PDF CiteSeerX

Posted to papers by mdb at 12:56 pm | Comments (0)       

November 8, 2009

100 Essential Things You Didn’t Know You didn’t know

A collection of 100 essays on vaguely mathematics-related things. Makes for good train or bus reading as each essay is only a few pages and self-contained. On the other hand, the brevity of the essay tends to prevent them from getting into super interesting details, so it’s definitely light reading. A final note: don’t buy this on the Kindle. It lacks the illustrations, which for some of the essays is a rather critical omission. It’s apparently coming out in paperback in 2010, so kill some trees and get a version with pictures.

Posted to books by mdb at 3:10 pm | Comments (0)       

November 6, 2009

The Men Who Stare at Goats

Ranged from pretty funny to falling out of my chair hilarious. Jeff Bridges and George Clooney were awesome. I could have done without the protracted “everyone lived happily ever after” ending, but perhaps they were having so much fun they just didn’t want to stop making the movie.

Posted to films by mdb at 7:03 pm | Comments (0)       

November 4, 2009

Thank Gosling

The following is not legal Java:

    public static void main (String[] args) {
        int[] foo = new int[1];
        int[] bar = new int[1];
        ((args.length == 0) ? foo[0] : bar[0]) = 1;
    }

That makes my life easier.

Posted to code by mdb at 5:25 pm | Comments (2)       

Column-Stores vs. Row-Stores: How Different Are They Really? – Abadi, et al.

Summary
Column-store DBMSs claim dramatically improved performance over row-store systems. Comparisons generally fail to compare against “column-optimized” row-store configuration. Authors ask whether row-store DBMS can achieve improved performance on OLAP queries, investigate: vertically partitioned row-store, index-only plans, materialized views. Authors investigate relative contribution to c-store performance of: late materialization, block iteration, column compression and propose invisible joins as additional c-store optimization.

Explanation of star schema benchmark (SSMB). Explanation of motivation for row-store physical layout changes (vertical partitioning, index-only plans, materialized views). Review of c-store optimizations: compression, late materialization, block iteration. Explanation of proposed invisible join optimization (hash, join, rejoin) and between-predicate rewriting.

Experiment goals, motivation and setup defined. Column-store simulation in row-store results: bad across the board. Reasons: per-tuple overheads, early column joins, system insistence on merge over hash joins. Also vertical partitioning prevents use of table partitioning. Column-store performance (on average): compression improves by 2x, late materialization by 3x, block-processing 5-50%, invisible join 50-75%. Denormalization in c-store not a huge benefit due to invisible joins.

Comments
This paper gives a great breakdown of how and why column-store databases massively outperform (by an order of magnitude) row-store databases for analytics queries. They also offhandedly mention some pretty substantial space savings due to better compressibility of columns and less per row overhead: 4 GB compressed for row-store and 2.3 GB compressed for the same data in a column-store. That’s just run-length encoded data which the c-store database actually processes in that form without decompressing.

This all inclines me to believe that a lot of medium-sized organizations that are going to substantial trouble to set up Hadoop plus Pig/Sawzall analysis engines (or, ahem, their own custom system) are probably barking up the wrong tree. You can get a terabyte or more of data on a single machine, and you can stick different tables on different machines if you really have a working set larger than that. The c-store database is going to process that data way faster than any sort of divide and conquer approach until you get to ludicrous quantities of data. The c-store is “mmap() a giant array of ints and address it like an array” fast. Compare that to the dozens of layers of indirection your data passes through from the disk drive to a Pig query, for example.

Source: PDF ACM

Posted to papers by mdb at 12:08 am | Comments (0)       

November 2, 2009

On Understanding Data Abstraction, Revisited: Cook

Summary
It is widely assumed that Objects == ADTs (or even Objects <= ADTs). They are both forms of data abstraction, but fundamentally different. OO programming languages conflate the two and obscure the difference. Conflation useful, obscurity unfortunate. ADTs: public name, hidden representation, operations to create, combine and observe. Provide representation independence, optimizability, formal reasonability. Require unique implementation (module jockeying helps but does not allow interoperation). Objects: a record of higher-order functions (an interface). Classes are functions that create instances. Objects are autognostic: know only their own representation. Adds flexibility, reduces optimizability, drastically reduces formal reasonability, inclines programmer to expose internals to regain performance. Relationships: ADTs require static types, objects don't. Both are equivalent for non-complex operations. Extensibility problem. Imperative state and polymorphism actually orthogonal to this duality. Reality: OOP in Java, type classes in Haskell, Smalltalk. Understanding these distinctions informs the choice of which to use when.

Comments
Wow, this is a great paper! Maybe everyone in academia already understands this like the back of their hand, but as someone who was introduced to these concepts through real-world programming languages, this really clarifies a lot of fuzzy notions I’ve had about what fundamental abstractions I use when I’m programming. I now have a much better idea of the extent to which Java was influenced by Smalltalk and why Scala’s reintroduction of ADT constructs as elegant language features is so compelling (even if they’re implemented under the hood using OO mechanisms). One wants both approaches and one wants to avoid the more-typing/loss-of-clarity burden of emulating one approach in the other. There are numerous other great insights in here: OOP being fundamentally higher-order programming because objects are actually records of HOFs, OO being hard to analyze as a result, OO being less popular in academia (than ADTs) as a result of that result. I am now convinced that one of the first things that should be done in CS education (maybe not in 101) is to implement an object system in an ADT-based language (like C) and implement a proper ADT in an OO language (like Java or Smalltalk). I had done the former and it was a mind-expanding experience, I had not (explicitly) done the latter.

Source: PDF

Posted to papers by mdb at 3:28 pm | Comments (1)       

Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals: Gray, et al.

Summary
Introduction of common statistical analysis pattern: formulate, extract, visualize, analyze. Explanation of need to reduce dimensions by summarizing along certain dimensions. Sample analysis problem, limitations of analysis via existing SQL constructs (mainly GROUP BY). GROUP BY extensions provided in some databases to overcome these limitations in certain cases. CUBE and ROLLUP operators introduced along with ALL value. Algebraic combinations of CUBE, ROLLUP and GROUP. Syntax proposed for new operators. Implications of addition of ALL value. Alternative to use of ALL value. Decoration columns, their utility in GROUP BY queries and their interaction with ALL. Star and snowflake schema/queries and use in capturing hierarchical attribute dimensions. Proposal for extracting values from a CUBE result. An approach to computing CUBE. Aggregate functions classified as distributed, algebraic or holistic; each type’s computability via proposed approach discussed. Maintaining CUBE and ROLLUP results as data changes inspired by SQL Server customer behavior.

Comments
The CUBE operator seems very handy for certain data analysis tasks, but would also seem to be on the fuzzy edge of what is broadly useful and obviously belongs in a relational database and what should probably be left to special purpose data analysis tools. Not surprisingly, it only seems to be implemented in commercial databases where they ran out of ways to differentiate themselves and had to start tackling specialized markets. These days one would probably start with Hadoop and Sawzall rather than an RDBMS. With this sort of data processing, the strengths of an RDBMS — transactions, indexes, sophisticated query optimizers — either don’t help at all or just get in the way.

Source: PDF CiteSeerX

Posted to papers by mdb at 9:32 am | Comments (0)       

MDB

Bits
Ludography
Reviewlets
Camera Wrestling
Archives:

samskivert.com ©2001 - 2009 Michael Bayne <mdb@samskivert.com>