Tuesday, August 31, 2010

1010data: Operating Well in a Parallel Universe?

Recently, I received a blurb from a company named 1010data, claiming that its personnel had been doing columnar databases for more than 30 years. As someone who was integrally involved at a technical level in the big era of database theory development (1975-1988), when everything from relational to versioning to distributed to inverted-list technology (the precursor to much of today’s columnar technology) first saw the light, I was initially somewhat skeptical. This wariness was heightened by receiving marketing claims that performance in data warehousing was better than not only relational databases but also than competitors’ columnar databases, even though 1010data does little in the way of indexing; and this performance improvement applied not only to ad-hoc queries with little discernable pattern, but also to many repetitive queries for which index-style optimization was apparently the logical thing to do.

1010data’s marketing is not clear as to why this should be so; but after talking to them, and reading their technical white paper, I have come up with a theory as to why it might be so. The theory goes like this: 1010data is not living in the same universe.

That sounds drastic. What I mean by this is, while the great mass of database theory and practice went one way, 1010data went another, back in the 1980s, and by now, in many cases, they really aren’t talking the same language. So what follows is an attempt to recast 1010data’s technology in terms familiar to me. Here’s the way I see it:

Since the 1980s, people have been wrestling with the problem of read and write locks on data. The idea is that if you decide to update a datum while another person is attempting to read it, each of you will see a different value, or the other person can’t predict which value he/she will see. To avoid this, the updater can block all other access via a write lock – which in turn slows down the other person drastically; or the “query from hell” can block updaters via a read lock on all data. In a data warehouse, updates are held and then rushed through at certain times (end of day/week) in order to avoid locking problems. Columnar databases also sometimes provide what is called “versioning”, in which previous values of a datum are kept around, so that the updater can operate on one value while the reader can operate on another.

1010data provides a data warehouse/business intelligence solution as a remote service – the “database as a service” variant of SaaS/public cloud. However, 1010data’s solution does not start by worrying about locking. Instead, it worries about how to provide each end user with a consistent “slice of time” database of his/her own. It appears to do this as follows: all data is divided up into what they call “master” tables (as in “master data management” of customer and supplier records), which are smaller, and time-associated/time-series “transactional” tables, which are the really large tables.

Master tables are more rarely changed, and therefore a full copy of the table after each update (really, a “burst” of updates) can be stored on disk, and loaded into main memory if needed by an end user, with little storage and processing overhead. This isn’t feasible for the transactional tables; but 1010data sees old versions of these as integral parts of the time series, not as superseded data; so the actual amount of “excess” data “appended” to a table, if maximum session length for an end user is a day, is actually small in all realistic circumstances. As a result, two versions of a transactional table include a pointer to a common ancestor plus a small “append”. That is, the storage overhead of additional versioning data is actually small compared to some other columnar technologies, and not that much more than row-oriented relational databases.

Now the other shoe drops, because, in my very rough approximation, versioning entire tables instead of particular bits of data allows you to keep those bits of data pretty much sequential on disk – hence the lack of need for indexing. It is as if each burst of updates comes with an online reorganization that restores the sequentiality of the resulting table version, so that reads during queries are potentially almost eliminating seek time. The storage overhead means that more data must be loaded from disk; but that’s more than compensated for by eliminating the need to jerk from one end of the disk to the other in order to inhale all needed data.

So here’s my take: 1010data’s claim to better performance, as well as to competitive scalability, is credible. Since we live in a universe in which indexing to minimize disk seek time plus minimizing added storage to minimize disk accesses in the first place allows us to push against the limits of locking constraints, we are properly appreciative of the ability of columnar technology to provide additional storage savings and bit-mapped indexing to store more data in memory. Since 1010data lives in a universe in which locking never happens and data is stored pretty sequentially, it can happily forget indexes and squander a little disk storage and still perform better.

1010data Loves Sushi
At this point, I could say that I have summarized 1010data’s technical value-add, and move on to considering best uses. However, to do that would be to ignore another way that 1010data does not operate in the same universe: it loves raw data. It would prefer to operate on data before any detection of errors and inconsistencies, as it views these problems as important data in their own right.

As a strong proponent of improving the quality of data provided to the end user, I might be expected to disagree strongly. However, as a proponent of “data usefulness”, I feel that the potential drawbacks of 1010data’s approach are counterbalanced by some significant advantages in the real world.

In the first place, 1010data is not doctrinaire about ETL (Extract, Transform, Load) technology. Rather, 1010data allows you to apply ETL at system implementation time or simply start with an existing “sanitized” data warehouse (although it is philosophically opposed to these approaches), or apply transforms online, at the time of a query. It’s nice that skipping the transform step when you start up the data warehouse will speed implementation. It’s also nice that you can have the choice of going raw or staying baked.

In the second place, data quality is not the only place where the usefulness of data can be decreased. Another key consideration is the ability of a wide array of end users to employ the warehoused data to perform more in-depth analysis. 1010data offers a user interface using the Excel spreadsheet metaphor and supporting column/time-oriented analysis (as well as an Excel add-in), thus providing better rolling/ad-hoc time-series analysis to a wider class of business users familiar with Excel. Of course, someone else may come along and develop such a flexible interface, although 1010data would seem to have a lead as of now; but in the meanwhile, the wider scope and additional analytic capabilities of 1010data appear to compensate for any problems with operating on incorrect data – and especially when there are 1010data features to ensure that analyses take into account possible incorrectness.

Caveat
To me, some of continuing advantages of 1010data’s approach depend fundamentally on the idea that users of large transactional tables require ad-hoc historical analysis. To put it another way, if users really don’t need to keep historical data around for more than an hour in their databases, and require frequent updates/additions for “real-time analysis” (or online transaction processing), then tables will require frequent reorganizing and will include a lot of storage-wasting historical data, so that 1010data’s performance advantages will decrease or vanish.

However, there will always be ad-hoc, in-depth queryers, and these are pretty likely to be interested in historical analysis. So while 1010data may or may not be the be-all, end-all data-warehousing database for all verticals forever, it is very likely to offer distinct advantages for particular end users, and therefore should always be a valuable complement to a data warehouse that handles vanilla querying on a “no such thing as yesterday” basis.

Conclusion
Not being in the mainstream of database technology does not mean irrelevance; not being in the same database universe can mean that you solve the same problems better. It appears that taking the road less travelled has allowed 1010data to come up with a new and very possibly improved solution to data warehousing, just as inverted list resurfaced in the last few years to provide new and better technology in columnar databases. And it is not improbable that 1010data can continue to maintain any performance and ad-hoc analysis advantages in the next few years.

Of course, proof of these assertions in the real world is an ongoing process. I would recommend that BI/data warehousing users in large enterprises in all verticals kick the tires of 1010data – as noted, testbed implementation is pretty swift – and then performance test it and take a crack at the really tough analyst wish lists. To misquote Santayana, those who do not analyze history are condemned to repeat it – and that’s not good for the bottom line.

No comments: