samskivert
samskivert » 2009 » October

October 30, 2009

Constrained Types for Object-Oriented Languages – Nystrom, et al.

Summary
Constrained types introduced as extension to standard OO type systems and form of dependent type. Implemented as part of X10 language. Classes have typed properties (reified as public final fields), boolean expressions on those properties can be expressed as a constraint on the type of the class. Constraints can appear on class declarations (providing class invariants), method arguments (providing preconditions), return types (providing postconditions), properties, fields and local variables. Pluggable constraint systems are used to evaluate constraints, multiple systems may be combined in a given constraint declaration. Checking of separately compiled classes also supported. Syntax and semantics explained, implemented constraint systems (equality, presburger, set constraints) described, implementation described (translation to Java), formal semantics provided (CFJ extension of FJ) along with soundness proof. Related work: constraint-based type systems, dependent types, pluggable types.

Comments
This seems like a very powerful technique for concisely specifying the sort of thing that one tends to resort to increasingly incomprehensible machinations to express using parameterized types. If you’ve ever seen an encoding of the naturals in a type using Church numerals, you can really appreciate how nice it would be to just say C{i > 0}, not to mention something like, say:

class List<T>(length: int{self >= 0}) {
    def get (idx :int{self >= 0 && self < this.length}) :T = { ... }
}

I am also intrigued by the idea of expressing pre- and postconditions as constraints on the types of method arguments, but I wonder how well that would work out in practice. The authors specifically mention the utility of using constraints to enforce constructor preconditions rather than the common pattern of runtime assertion checking.

One concern is how quickly does the constraint solving part of the compilation process become horribly slow and how frequently does one bump into divergence in checking types? At least the separation of the constraint solvers from the compiler implementation means that they can be (and already are) highly optimized by people other than the compiler authors.

Source: PDF ACM

Posted to papers by mdb at 10:47 pm | Comments (0)       

Concurrency Control and Recovery – Franklin

Summary
Concurrency, recovery, transactions and ACID introduced. Serializability section: conflict and view serializability, transaction schedules and precedence graph. Recovery section: types of failure (transaction, system, media), undo and redo, steal/no-steal and force/no-force, logging, checkpointing, write-ahead logging. Best practices in concurrency control: two-phase locking, deadlock detection, isolation levels (read uncommitted, read committed, repeatable read, serializable), hierarchical locking, optimistic and multiversion concurrency control. Best practices in recovery: phsiological logging, ARIES protocol (analysis, redo and undo). Extensions and limitations: two-phase commit (distributed dbs), systems not suited to ACID, lack of app knowledge exploitation.

Comments
This is a great overview of the basic ideas that underlie transaction support in RDBMSs. Understanding what exactly has to take place to implement correct transaction semantics in the face of arbitrary failure gives a pretty helpful perspective on when and why a database is going to write to disk. The ARIES protocol is a fine piece of work. The notion of repeatability is another interesting challenge. Say you do a query and then do the same query again later in that same transaction. If repeatability is desired, one must not see new rows inserted by any other transaction. Accomplishing this without write-locking the entire table requires complex coordination. Fortunately, I believe most databases don’t provide that isolation level as the default. It’s way less expensive to just not worry about it.

Source: PDF CiteSeerX

Posted to papers by mdb at 10:15 pm | Comments (0)       

October 23, 2009

A Serious Man

High quality, hard hitting entertainment from the Cohen brothers. Wouldn’t have minded a bit more plot resolution, though. It’s one thing to leave some room for interpretation but this was more like yanking the rug out from under the audience and leaving them dazed and confused on the floor.

Posted to films by mdb at 6:17 pm | Comments (0)       

October 21, 2009

Reveal your sources

Walter expressed a desire for easy access to the papers covered here, so I have gone back and added a link to a PDF for all the papers I’ve reviewed and also a link to the ACM Portal site where possible. I’ll aim to do that going forward as well, though if there aren’t freely available versions I may only provide the ACM link and those of you with a membership can at least download them from there.

Posted to papers by mdb at 3:30 pm | Comments (0)       

Access Path Selection in a Relational Database Management System – Selinger, et al.

Summary
System R is introduced. SQL statement processing summarized: parsing, optimizing, code generation, execution. Research Storage System is introduced, supported scans summarized: segment and index scan. Costs for single relation access paths outlined along with general formula for estimating cost. Selectivity factor introduced, approximations provided for supported predicates (c = v, c > or < v, c1 < v < c2, c in l, c in subquery, or, and, not). Search algorithm plus heuristics for join access path selection summarized. Method for calculating costs is provided along with full walk-through of example join. Handling of nested queries is explained.

Comments
If you’ve ever wondered about the gory details of how a relational database decides precisely how to process a query, this is a great place to start. They give good formulas for estimating the cost (in disk reads) of various query predicates based on limited statistics about the indexes and relations, and a robust method for combining that information together as you consider different potential join orders. The whole gamut of query optimization has produced many PhDs over the last thirty years and is accordingly pretty complex, but this is the solid foundation on which it all rests.

I also discovered correlation subqueries (select NAME from EMPLOYEE X where SALARY > (select SALARY from EMPLOYEE where EMPLOYEE_NUMBER = X.MANAGER)), which I never knew about (or never had a query that needed such a thing, I suppose). My mind balks at how expensive they are to compute (they talk about some possible optimizations), but they definitely increase the expressive power of SQL.

A couple of amusing anecdotes:

  • These guys were talking about subqueries in 1979. How long did it take MySQL to support them?

  • Even these authors, which include one of the inventors of SQL, can’t decide whether to say “a SQL statement” or “an SQL statement”.

Source: PDF ACM

Posted to papers by mdb at 11:49 am | Comments (0)       

October 19, 2009

Join Processing in Database Systems with Large Main Memories – Shapiro

Summary
Four algorithms for joining larger than main memory tables are considered: sort-merge, simple hash, GRACE hash and hybrid hash. Sort-merge partitions the tables using an ordering established by the join columns, GRACE and hybrid hash partition the tables via a hashing function and the simple hash algorithm does not explicitly partition the tables.

The performance of the four algorithms is analyzed, their minimum memory requirement (for non-pathological performance) is established, and circumstances in which the algorithms prove equivalent or to dominate one another is also discussed. In most circumstances, the hybrid algorithm is shown to provide the best performance.

Strategies for dealing with partition overflow are discussed, but not quantified. The benefits and performance implications of using the algorithms with a virtual memory system are also discussed with an toward achieving good performance and memory utilization where multiple queries may be run simultaneously. Finally three join-efficiency-improving tools are shown to be compatible with the algorithms: database filters, Babb arrays and semijoins.

Comments
Nothing like constrained resources to inspire clever algorithms. The hybrid algorithm proposed nicely degrades to the one that performs well in low memory when memory is low and upgrades to the one that performs well with copious memory when memory abounds. There is some serious hand waving over how the algorithm might cope with non-uniform distribution of the join keys which would likely result in it generating partitions too large to fit into memory. Some vague fallbacks are provided but they mostly involve adjusting the hashing algorithm after the fact, which sounds non-trivial. One presumes that in the intervening two decades these issues have been hammered out.

In these days of copious memory, a paper that talks about data sets requiring a whopping five megabytes of RAM might sound a bit quaint, but fortunately the amount of data we’re trying to process these days nicely outstrips the amount of RAM on our machines by the same few orders of magnitude. These algorithms may well prove useful on projects like Pig, where the tendency is probably to say “We scale out, so we don’t have to worry about performance on individual machines; just do things the naive way.” Maybe we’ll see these same algorithms crop up in another five years, except instead of being motivated by a reduction in query time, they’ll be motivated by the reduction in power consumption and disk drive wear that is achieved by doing fewer disk reads and memory compares/copies as we grind through our petabytes.

Source: PDF ACM

Posted to papers by mdb at 10:32 am | Comments (0)       

October 18, 2009

Where the Wild Things Are

Very pretty, very whimsical, and very sad. Another triumph of honesty over feel-good moralizing. Max was kind of fucked up. The wild things were kind of fucked up. Life usually isn’t peaches and cream.

Posted to films by mdb at 6:22 pm | Comments (0)       

October 14, 2009

Generalized Search Trees for Database Systems – Hellerstein, et al.

Summary
Introduces existing state of the art: specialized search trees and search trees for extensible data types. Introduces Generalized Search Tree (GiST). Explains general search tree abstraction, shows how this maps nicely to GiST. Explains structure, invariants and key methods (Consistent, Union, Compress, Decompress, Penalty, PickSplit) of GiST. Defines Search, elaborates on optimizations for search on linearly ordered domains. Outlines how Insert works, omits explanation of Delete. Instantiates GiST as B+-tree, R-tree and (novel) RD-tree (set-valued attribute index), summarizes key methods in each case. Discusses performance implications of key overlap and compression loss, provides analysis of performance of RD-tree. Sketches implementation issues: in-memory efficiency, concurrency-related, variable-length keys, bulk loading, optimizer integration and actual coding. Proposes as future work: theory of indexability, indexing of non-standard domains, query optimization and cost estimation, lossy key compression techniques and algorithmic improvements (c.f. R*-tree improvements over R-tree).

Comments
The GiST is clearly a useful abstraction over pre-existing search trees. The key contribution of this paper would seem to be identifying the generalization of those trees as successive partitions of a data set and defining algorithms based around that abstraction. In addition to simply saving time for the implementer of a new index, it provides a conceptual starting point for considering whether and how your data can be indexed at all. Will it partition into a nicely balanced tree with good fanout? If not, some other approach may be desirable.

Of the six methods needed to implement a GiST, four really capture the essence of the process: Consistent indicates whether the subtree identified by a particular edge may contain data that matches your query. If so, you traverse it. If not, it can be skipped. Union is needed when restructuring the tree after an insert. Every edge must be made Consistent with the Union of the keys in the tree below that edge. Penalty guides the algorithm toward the optimal subtree in which to insert new data and PickSplit guides the algorithm in splitting a tree node in two when it becomes too large. In retrospect this seems like a perfectly natural way to capture the construction and use of a search tree but it no doubt took keen insight on the part of the authors to tease out exactly what was important and what was not.

Also interesting is the identification of limitations of the GiST. Not every data set nicely partitions into a perfectly non-overlapping hierarchy and though the GiST handles overlapping keys, performance degrades as overlap increases. Further, the union of the keys in a subtree must provide some sort of compressed representation that does not make the size of the index prohibitively large. For B+-trees one need only store the lower and upper bound of the data contained in each subtree (in fact, just the lower bound) and this partitions the data with no overlapping. Other index types are more problematic: with R-trees overlap becomes harder to avoid due to the use of bounding boxes to define Consistent, but using more fine-grained data would result in too much data at each index node. With RD-trees and set valued data, overlap becomes more problematic still. It is possible that the illustrations of the constraints on tree indexing is a greater contribution than the generalized framework for implementing tree indexes itself.

Source: PDF ACM

Posted to papers by mdb at 11:16 am | Comments (0)       

October 12, 2009

Operating System Support for Database Management – Stonebraker

Summary
Various operating system services correspond to services needed by DBMSs and yet lack just enough functionality or performance that they end up being reimplemented by DBMS vendors. The paper outlines those features and the mismatches in question.

  • Buffer pool management suffers from poor performance, an inability for the DBMS to provide hints to the replacement and prefetch mechanisms and a lack of control over flushing to disk needed for correct crash recovery.
  • The file system provides a variable length character array abstraction which is not ideal for databases and utilizes two tree structures (i-nodes and directory trees) on top of the trees used by a DBMS index.
  • Process scheduling is suboptimal in that context switches are expensive and a DBMS can often make intelligent more decisions about context switching like not switching out a process that holds an important buffer pool lock as all other processes will likely simply queue up behind that process if they are scheduled for execution.
  • IPC message sends have poor performance which limits the architectural choices that a DBMS might make to achieve higher throughput with multiple disks and better support a multiuser DBMS.

Limitations on having the operating system provide transaction exist in relation to crash recovery. Paged virtual memory systems also suffer some of the same problems as buffer pool management, and have additional page table memory overhead.

Comments
This paper dates from 1981 and is talking mainly about early versions of UNIX on PDP-11s and VAXen, thus its appeal is mainly historical. Many of the issues have simply disappeared thanks to Moore’s law applied to CPU speed and memory size. Having a 100K page table when mmap-ing a 100M file would seem no longer to be an issue. Similarly context switching overhead and IPC message sends have taken a back seat from a performance standpoint to efficiently utilizing multiple processors and optimally making use of multi-tier memory architectures.

One problem that has not gone away is dealing with disk subsystems. For a time DBMS vendors tried using raw disk I/O to achieve maximum performance, but advances in file system technology coupled with an explosion in disk drive technologies (RAID, drives with huge memory caches and complex controllers, SAN, NAS, SSD, etc.) have reduced the viability of that approach. It remains the purview of familiar-with-the-state-of-the-art DBAs to know how to best configure a database and a disk subsystem to achieve good performance.

Source: PDF ACM

Posted to papers by mdb at 11:12 am | Comments (0)       

Loft Remodel

Shortly before I departed sunny San Francisco, I had my loft remodeled. I finally got some pictures from my contractor as I didn’t have time to take any of my own before I took off. They are here for your viewing pleasure.

Posted to ramble by mdb at 10:29 am | Comments (0)       

October 9, 2009

GWT AsyncGen

The back-breaking straw landed on this camel earlier today when adding some methods to a GWT RemoteService interface and then remembering that, “Oh yeah, I have to go add those manually to the async mirror interface that goes with my service.”

First I poked around on the web. I found some sort of Maven plugin that in theory automatically generates these extra services, but I don’t use Maven and I’m not about to start now. Plus it had some bugs filed against it that made me think it was written by a barrel full of monkeys. There was some sort of IDE plugin, but I’m definitely not going to start using some random IDE. I also found some crazy old command line tool that harkened back to the @gwt.typeArgs days. No good.

So I did what I should have done a couple of years ago and wrote an ant task to DWIFN and to do it properly. So if you find yourself using GWT presently or in the future, run, don’t walk, to the following URL and spend the two minutes it will take to save yourself a bunch of annoying typing:

http://code.google.com/p/gwt-asyncgen/

Posted to code by mdb at 6:57 pm | Comments (0)       

October 8, 2009

Of Objects and Databases: A Decade of Turmoil – Carey, DeWitt

Summary
Objects & Databases in 1986: extended relational databases, persistent programming languages, OO databases, DBMS toolkits. O&D in 1996, what happened: DBMS toolkits and persistent programming languages never took off, OODBs also lackluster, object-relational “wrappers” (ORM) emerged, extended relational renamed to Object-Relational (ugh) and gained traction. Observations on why DB toolkits and OO databases didn’t take off, why OR databases and object-relational wrappers did. Brief aside on CORBA, OLE, COM/DCOM, Java. Predictions for O&D in 2006: OR matures, subsumes OODB. Challenges to get there: OR and OO query performance, more sophisticated clients (caching, splitting queries across cached client-side data and server data), parallelizing OR and OO, facades for legacy data, standardizing all of this.

Comments
Nothing too controversial in their review of the state of databases in 1986 and then 1996 (when the paper was written). I find it interesting to note that ORM was around in 1996. It feels like it took another ten years for Java to get any decent ORM but maybe the ORM they had in 1996 wasn’t quite Hibernate for C++. Points to the authors for predicting the demise of CORBA and for good reason: “years of database research has not been able to separate the majority of these services (e.g., indexing and transactions, or collections and queries) in a manner capable of providing anything approaching reasonable performance. … We expect such approaches will perform poorly and will ultimately fail as well.” Funny that they see Java mainly as an interesting way to support ADTs in the database server (as a database extension language). Though Oracle did that at some point, I don’t believe it is engendering comparisons to sliced bread. I think at the same time they decided they needed to tack an ‘i’ on their version number to make things more Internety.

“Prediction is very hard, especially of the future,” so I won’t dwell much on their predictions. They seemed to think OO would mature as an extension inside the RDBMS but it seems to have mostly died. Not many people are putting direct links between rows in their databases (versus using foreign keys), and we’re not writing SQL that involves object path expressions. Instead that stuff is all getting handled in more sophisticated ORM libraries (like Hibernate and Active Record) which handle the mapping to the relational model on the client side. Perhaps this is because a thousand client library writers can experiment and refine more rapidly than a handful of commercial DB vendors and a couple of schizophrenic open source projects. Also some of the need for OO has been replaced by XML, which albeit very different, does allow you to store a hierarchy of “objects” and extract attributes via path expressions.

Source: PDF CiteSeerX

Posted to papers by mdb at 8:37 am | Comments (0)       

October 7, 2009

Zombieland

It’s not going to unseat Shaun of the Dead as the funniest zombie movie ever, but it’s pretty dang funny. The movie does a bit of the taking itself too seriously, and there’s a little gratuitous romance, but by and large it sticks to the zombies and humor. Woody Harrelson even manages to bring a tender side to a role not too unlike the one he played in Natural Born Killers.

Posted to films by mdb at 7:09 pm | Comments (0)       

October 6, 2009

Finding Bugs in Dynamic Web Applications – Artzi, et al.

Summary
A system for automatically generating test inputs for and testing the output of PHP web applications is described. Input that maximizes coverage is generated via concolic (concrete and symbolic) execution of the PHP code, tracking user input as it travels through the program and accumulating execution path constraints when variables tagged as input are involved in conditionals. New inputs are generated by successively negating elements of each path constraint and using a constraint solver in conjunction with a test input generator to generate new inputs that satisfy the mutated path constraints.

This process is repeated on successive sets of path constraints. PHP errors as well as HTML validity errors are considered bugs and are tracked in a database along with the path constraints and particular choice of inputs that triggered them. The architecture of the system is described as well as each individual component: executor (shadow interpreter, database manager), bug finder (oracle, bug report repository, input minimizer) and input generator (symbolic driver, constraint solver, value generator).

The results of executing the tool on four open source PHP projects is provided along with comparisons versus random test input generation and a static analysis tool that uses a substantially different approach.

Comments
The process of tracking which variables are associated with inputs and constructing a set of execution path constraints that identify a particular unique execution path through the program seems to be a powerful technique, though it apparently did not originate in this paper. Their main contribution seems to be being crazy enough to do this for PHP and ironing out some of the resultant wrinkles. They also claim that using an HTML validator to identify a larger class of program “failures” is a novel contribution. Fair enough, but I don’t think it’s as big a deal as they make it out to be.

They also had to punt somewhat on simulating user input and handling session state. For scripts that output form buttons or other means by which a user would provide input and cause a new script to be executed, they manually added some code to make that input appear as if it were an input to the original page and then added a switch statement that executed the subsequent page inline, tacking its output onto the output of the first page. That’s clever and expedient, but it also side-steps one of the hard and possibly interesting challenges of testing web applications. To be fair, they clearly point out this limitation and claim to be working on it.

I applaud the extra effort put into making the results actually useful to a developer using them to find and fix bugs in their application. They go to extra trouble to track the particular line of PHP code that triggers the script failure or that emits the malformed HTML and they include that information in their bug reports. They also take the trouble to merge reports that generate essentially the same failures and make efforts to minimize the path constraints needed to trigger a failure such that they usually provide the minimum set of inputs needed to reproduce a bug.

Source: PDF ACM

Posted to papers by mdb at 10:11 pm | Comments (0)       

Hazard Reduction in Scala Actors – Stiegler

Summary
Stiegler decries the existence of four hazards of using actors in Scala: global mutable state, the ability to pass a mutable object into an actor’s constructor, the ability to pass a mutable object as a message to an actor and the ability to capture mutable state when defining an actor as an anonymous closure. He then proposes a mechanism to mitigate these hazards which comes in the form of Actress, a wrapper around the Actor class, and Sendable, a root type that identifies legal (ostensibly immutable) arguments to be provided to the Actress constructor and which must be sent as messages. He also proposes that the protected Actress.make() method, which is used to define the body of the actual Actor wrapped by an Actress, encourages the programmer to avoid constructs that might capture mutable state.

Comments
The fundamental problem that the proposal attempts to address is no doubt valid: Scala has mutable state, this introduces dangers when writing concurrent programs. Unfortunately, the proposed technique is neither elegant, nor an especial improvement on the current state of affairs.

In a language which allows mutability we have two approaches when it comes to providing mechanisms for concurrent programming (I’m going to ignore the third approach of “do nothing to discourage the programmer from hurting themselves,” one that has proved all too popular in the past):

1. Provide a type system that can represent propositions about mutability and immutability and design your concurrency mechanisms to make use of that type system to ensure that only immutable data is shared between threads. This is the difficult solution and incidentally one of my research interests. I conjecture, in fact, that immutability is a stronger requirement than is needed and a truly useful system will allow some mutability in provably safe circumstances.

2. Design your concurrency mechanisms such that they provide a clear indication to writers and readers of code that We Are Now Doing Concurrent Things: Be Careful! One benefit of the actor model is that it does exactly that. When you are creating an actor you are reminded that you are in the dangerous realm of concurrent programming, and thus motivated to heed the lessons of the past and use mutable state carefully and sparingly. You know it is a bad idea to send mutable state between actors in a message, so you don’t.

I believe that approach number one is worth pursuing, but it will have to provide a meaningful improvement over approach number two which, though only recently adopted over the “do nothing” approach, seems to be helping quite a bit.

Stiegler’s proposal provides no such improvement. It is a thin veil of annoyance standing between the programmer and doing the wrong thing which is altogether too easy to rip through.

  • He claims that by requiring that Actress take a Sendable in its constructor, it encourages the programmer not to provide mutable state to the constructor of their derived Actresses. I disagree.

    The right way to pass three integers to your actor (please try not to shudder at the verbosity):

    class MyActor (args: Sendable) extends Actress(args) {
        protected def make () :Actor = actor {
            val (one, two, three) = args match {
                case ListS(IntS(aone) :: IntS(atwo) :: IntS(athree)) => (aone, atwo, athree)
                case _ => throw new IllegalArgumentException
            }
            // ...
        }
    }
    

    The lazy (or clever programmer) will realize that the integers are immutable, so why go to all that trouble:

    class MyActor (one: Int, two :Int, three :Int) extends Actress(StrS("ignored")) {
        protected def make () :Actor = actor {
            // ...
        }
    }
    

    Then some other programmer will come along and derail the whole effort and add some mutable parameter:

    class MyActor (one: Int, two :Int, three :Int, four :Array[Int]) extends Actress(StrS("ignored")) {
        protected def make () :Actor = actor {
            // ...
        }
    }
    

    and we’re right back where we started.

    Further, if you wish to provide no arguments to your Actress constructor, you have to needlessly pass some placeholder Sendable to Actress’s constructor to satisfy its demand for an argument.

    class MyActor extends Actress(StrS("ignored")) {
        protected def make () :Actor = // ...
    }
    

    As you can also see above, your constructor arguments then need to be extracted from this single compound argument before you can use them. Boilerplate++.

  • He claims that allowing only Sendable derivations as arguments to the ! method encourages the programmer not to send mutable state as an argument, but them immediately goes on to point out that it is trivially easy to create something like IntArrayS(data :Arrray[Int]) which allows mutable data to be used as an argument. When you are forced to package every single message argument in these niggling wrapper classes, it would be hardly surprising for a naive programmer to simply wrap up their mutable state and send it along, thinking this was just “how things worked.”

I’ll stop ranting now. Clearly a solution that achieves an increase in safety through a reduction in elegance and convenience rubs me the wrong way. On the plus side, I can thank the author for redoubling my interest in working on a proper solution to this problem.

Source: PDF

Posted to papers by mdb at 9:24 am | Comments (0)       

Older »
MDB

Bits
Ludography
Reviewlets
Camera Wrestling
Archives:

samskivert.com ©2001 - 2009 Michael Bayne <mdb@samskivert.com>