Monday, July 27, 2015

In-Memory Computing Summit 2: Database Innovation

In the late 1990s, my boss at the time at Aberdeen Group asked me a thought-provoking question:  Why was I continuing to cover databases?  After all, he pointed out, it seemed at first glance like a mature market – because the pace of technology innovation had slowed down and nothing important seemed to be on the horizon.  Moreover, consolidation meant that there were fewer and fewer suppliers to cover.  My answer at the time was that users in “mature” markets like the mainframe still needed advice on key technologies, which they would not find because analysts following my boss’s logic would flee the field. 
However, as it turned out, this was not the right reason at all – shortly thereafter, the database field saw a new round of innovations centering on columnar databases, data virtualization, and low-end Linux efforts that tinkered with the ACID properties of relational databases.  These, in turn, led to the master data management, data governance, global repository, and columnar/analytic database technologies of the late 2000s.  In the early 2010s, we saw the Hadoop-driven “NoSQL” loosening of commit constraints to enable rapid analytics on Big Data by sacrificing some data quality.
As a result, the late-1990s “no one ever got fired for buying Oracle” mature market wisdom is now almost gone from memory – and databases delivering analytics are close to the heart of many firms’ strategies.  And so, it seems that the real reason to cover databases today is that their markets, far from being mature, are rapidly spawning new entrants and offering many technology-driven strategic reasons to upgrade and expand.
The recent 2015 In-Memory Computing Summit suggests that a new round of database innovation, driven by the needs listed in my first post, is bringing changes to user environments especially in three key areas:
1.       Redefinition of data-store storage tiering, leading to redesign of data-access software;
2.       Redefinition of “write to storage” update completion, allowing flexible loosening of availability constraints in order to achieve real-time or near-real-time processing of operational and sensor-driven Big Data; and
3.       Balkanization of database architectures, as achieving Fast Data turns out to mean embedding a wide range of databases and database suppliers in both open source and proprietary forms.

Real-Time Flat-Memory Storage Tiering:  Intel Is a Database Company

All right, now that I’ve gotten your attention, I must admit that casting Intel as primarily a database company is a bit of an exaggeration.  But there’s no doubt in my mind that Intel is now thinking about its efforts in data-processing software as strategic to its future.  Here’s the scenario that Intel laid out at the summit:
At present, only 20% of a typical Intel CPU is being used, and the primary reason is that it sits around waiting for I/O to occur (i.e., needed data to be loaded into main memory) – or, to use a long-disused phrase, applications running on Intel-based systems are I/O-bound.  To fix this problem, Intel aims to ensure faster I/O, or, equivalently, the ability to service the I/O requests of each of multiple applications running concurrently, and do it faster.  Since disk does not offer much prospect for I/O-speed improvement, Intel has proposed a software protocol standard, NVRAM(e), for flash memory.  However, to ensure that this protocol does indeed speed things up adequately, Intel must write the necessary data-loading and data-processing software itself.
So will this be enough for Intel, so that it can go back to optimizing chip sets? 
Well, I predict that Intel will find that speeding up I/O from flash storage, which treats flash purely as storage, will not be enough to fully optimize I/O.  Rather, I think that the company will also need to treat flash as an extension of main memory:  Intel will need to virtualize (in the old sense of virtual memory) flash memory and treat main memory and flash as if they were on the same top storage tier, with the balancing act between faster-response main memory and slower-response flash taking place "beneath the software covers."  Or, to coin another phrase, Intel will need to provide virtual processors handling I/O as part of their DNA.  And from there, it is only a short step to handling the basics of data processing in the CPU -- as IBM is already doing via IBM DB2 BLU Acceleration.

Real-Time Flat-Memory Storage Tiering:  Redis Labs Discovers Variable Tiering

"Real time" is another of those computer science phrases that I hate to see debased by marketers and techie newbies.  In the not-so-good old days, it meant processing and responding to every single (sensor) input in a timely fashion (usually less than a second) no matter what.  That, in turn, meant always-up and performance-optimized software aimed specifically at delivering on that "no matter what" guarantee.  Today, the term seems to have morphed into one in which basic old-style real-time stream data processing (e.g., keep the car engine running) sits cheek-by-jowl with do-the-best-you-can stream processing involving more complex processing of huge multi-source sensor-data streams (e.g., check if there's a stuck truck around the corner you might possibly bash into).  The challenge in the second case (complex processing of huge data streams) is to optimize performance and then prioritize speed-to-accept-data based on the data's "importance".
I must admit to having favorites among vendors based on their technological approach and the likelihood that it will deliver new major benefits to customers, in the long run as well as the short. At this conference, Redis Labs was my clear favorite.  Here's my understanding of their approach:
The Redis architecture begins with a database cluster with several variants, allowing users to trade off failover/high availability with performance and maximize main memory and processors in a scale-out environment.  Then, however, the Redis Labs solution focuses on "operational" (online real-time update/modify-heavy sensor-driven transaction processing).  To do this of course, the Redis Labs database puts data in main memory where possible.  Where it is not (according to the presentation), the Redis Labs database  treats flash as if it were main memory, mimicking flat-memory access on flash interfaces.  As the presenter put it, at times of high numbers of updates, flash is main memory; at other times, it's storage.
The Redis Labs cited numerous benchmarks to show that the resulting database was the fastest kid on the block for "operational" data streams.  To me, that's a side effect of a very smart approach to performance optimization that crucially includes the ideas of using flash as if it were main memory and of varying the use of flash as storage, meaning that sometimes all of flash is the traditional tier-2 storage and sometimes all of flash is tier-1 processing territory.  And, of course, in situations where main memory and flash are all that is typically needed, for processing and storage, we might as well junk the tiering idea altogether:  it's all flat-memory data processing.

memSQL and Update Completion

Redis Labs' approach may or may not complete the optimizations needed for flash-memory operational data processing.  At the Summit, memSQL laid out the most compelling approach to the so-called "write-to-storage" issue that I heard. 
I first ran into write-to-storage when I was a developer attending meetings of the committee overseeing the next generation of Prime Computer's operating system.  As they described it, in those days, there was either main memory storage, which vanished whenever you turned off your PC or your system crashed, and there was everything else (disk and tape, mostly) that kept this information for a long time, whether the system crashed or not.  So database data or files or anything that was new or changed didn't really "exist" until it was written to disk or tape.  And that meant that further access to that data (modifications or reads) had to wait until the write-to-disk had finished.  Not a major problem in the typical case; but where performance needs for operational (or, in those days, online transaction processing/OLTP) data processing required it, write-to-storage was and is a major bottleneck.
memSQL, sensibly enough, takes the "master-slave" approach to addressing this shortcoming.  While the master software continues on its merry way with the updated data in main memory, the slave busies itself with writing the data to longer-term storage tiers (including flash used as storage).  Problem solved?  Not quite.
If the system crashes after a modification has come in but before the slave has finished writing (a less likely occurrence, but still possible), then both the first and second changes are lost.  However, in keeping with the have-it-your-way approach of Hadoop, memSQL allows the user to choose a tradeoff between performance speed and what it calls a "high availability" version.  And so, flash plus master-slave processing plus a choice of "availability" means that performance is increased in both operational and analytical processing of sensor-type data, the incidence of the write-to-storage problem is decreased, and the user can flexibly choose to accept some data loss to achieve the highest performance, or vice versa.

The Balkanization of Databases:  The Users Begin To Speak

Balkanization may be an odd phrase for some readers, so let me add a little pseudo-history here.  Starting a bit before 100 BC, the Roman Empire had the entire region from what is now Hungary to northern Greece (the "Balkans") under its thumb.  In the late 400s, a series of in-migrations led to a series of new occupiers of the region, and some splintering, but around 1450 the Ottoman Empire again conquered the Balkans.  Then, in the 1800s, nationalism arrived, Romantics dug up or created rationales for much smaller nations, and the whole Balkan area irretrievably broke up into small states.  Ever since, such a fragmentation has been referred to as the "Balkanization" of a region.
In the case of the new database scenarios, users appear to be carrying out similar carving out of smaller territories via a multitude of databases.  One presenter's real-world use case involved, on the analytics side, HP Vertica among others, and several Hadoop-based databases including MongoDB on the operational side.  I conclude that, within enterprises and in public clouds, there is a strong trend towards Balkanization of databases.
That is new.  Before, the practice was always for organizations to at least try to minimize the number of databases, and for major vendors to try to beat out or acquire each other.  Now, I see more of the opposite, because (as memSQL's speaker noted) it makes more sense if one is trying to handle sensor-driven data to go to the Hadoop-based folks for the operational side of these tasks, and to the traditional database vendors for the analytical side.  Given that the Hadoop-based side is rapidly evolving technologically and spawning new open-source vendors as a result, it is reasonable to expect users to add more database vendors than they consolidate, at least in the near term.  And in the database field, it's often very hard to get rid of a database once installed. 

SAP and Oracle Say Columnar Analytics Is "De Rigueur"

Both SAP and Oracle presented at the Summit.  Both had a remarkably similar vision of "in-memory computing" that involved primarily columnar relational databases and in-main-memory analytics.
In the case of SAP, that perhaps is not so surprising.  SAP HANA marketing has featured its in-main-memory and columnar-relational technologies for some time.  Oracle’s positioning is a bit more startling:  in the past, its acquisition of TimesTen and its development of columnar technologies had been treated in its marketing as a bit more of a check-list item -- yeah, we have them too, now about our incredibly feature-full, ultra-scalable traditional database ...
Perhaps the most likely answer why both Oracle and SAP were there and talking about columnar was that for flat-memory analytics, columnar's ability to compress data and hence fit it in the main-memory and/or flash tier more frequently trumps traditional row-oriented relational strengths where joins involving less than 3 compressible rows are concerned.  Certainly, the use case cited above where HP Vertica's columnar technology was called into service makes the same point.
And yet, the rise in columnar's importance in the new flat-memory systems also reinforces the Balkanization of databases, if subtly.  In Oracle's case, it changes the analytical product mix.  In SAP's case, it reinforces the value of a relatively new entrant into the database field.  In HP's case, it brings in a relatively new database from a relatively new vendor of databases that is likely to be new to the user or relatively disused before.  Even within the traditionally non-Balkanized turf of analytical-database vendors some effective Balkanization is beginning to happen, and one of its key driving forces is the usefulness of columnar databases in sensor-driven-data analytics.

A Final Note:  IBM and Cisco Are Missing At the Feast But Still Important

Both IBM's DB2 BLU Acceleration and Cisco Data Virtualization, imho, are important technologies in this Brave New World of flat-memory database innovation; but neither was a presenter at the Summit.  That may be because the Summit was a bit Silicon-Valley heavy but I don't know for sure.  I hope to give a full discussion at some point of the assets these products bring in the new database architectures, but not today.  Hopefully, the following brief set of thoughts will give an idea of why I think them important.
In the case of IBM, what is now DB2 BLU Acceleration anticipated and leapfrogged the Summit in several ways, I think.  Not only did BLU Acceleration optimize main-memory and to some extent flash memory analytics using the columnar approach; it also optimized the CPU itself.  Among several other valuable BLU Acceleration technologies is one that promises to further speed update processing and, hence, operational-plus-analytic columnar processing.  The only barrier -- and so far, it has proved surprisingly high -- is to get other vendors interested in these technologies, so that database "frameworks" which offer one set of databases for operational and another for analytical processing can incorporate "intermediate" choices between operational and analytic, or optimize operational processing yet further.
In the case of Cisco, its data virtualization capabilities offer a powerful approach to creating a framework for the new database architecture along the lines of a TP monitor -- and so much more.  The Cisco Data Virtualization product is pre-built to optimize analytics and update transactions across scale-out clusters, so is well acquainted with all but the very latest Hadoop-based databases, and has excellent user interfaces.  It can also serve as a front end to databases within a slot/"framework", or as a gateway to the entire database architecture.  As I once wrote, this is amazing "Swiss army knife" technology -- there's a tool for everything.  And for those in Europe, Denodo’s solutions are effective in this case as well.
I am sure that I am leaving out important innovations and potential technologies here.  That's how rich the Summit was to a database analyst, and how exciting it should be to users.
So why am I a database analyst, again?  I guess I would say, for moments like these.

No comments: