Wednesday, June 29, 2011

Big Data, MapReduce, Hadoop, NoSQL: Pay No Attention to the Relational Technology Behind the Curtain

One of the more interesting features of vendors’ recent marketing push to sell BI and analytics is the emphasis on the notion of Big Data, often associated with NoSQL, Google MapReduce, and Apache Hadoop – without a clear explanation of what these are, and where they are useful. It is as if we were back in the days of “checklist marketing”, where the aim of a vendor like IBM or Oracle was to convince you that if competitors’ products didn’t support a long list of features, that those competitors would not provide you with the cradle-to-grave support you needed to survive computing’s fast-moving technology. As it turned out, many of those features were unnecessary in the short run, and a waste of money in the long run; remember rules-based AI? Or so-called standard UNIX? The technology in those features was later to be used quite effectively in other, more valuable pieces of software, but the value-add of the feature itself turned out to be illusory.

As it turns out, we are not back in those days, and Big Data via Hadoop and NoSQL does indeed have a part to play in scaling Web data. However, I find that IT buyer misunderstandings of these concepts may indeed lead to much wasted money, not to mention serious downtime. These misunderstandings stem from a common source: marketing’s failure to explain how Big Data relates to the relational databases that have fueled almost all data analysis and data-management scaling for the last 25 years. It resembles the scene in Wizard of Oz where a small man, trying to sell himself as a powerful wizard by manipulating stage machines from behind a curtain, becomes so wrapped up in the production that when someone notes “There’s a man behind the curtain” the man shouts “Pay no attention to the man behind the curtain!” In this case, marketers are shouting about the virtues of Big Data related to new data management tools and “NoSQL” that they fail to note the extent to which relational technology is complementary to, necessary to, or simply the basis of, the new features.

So here is my understanding of the present state of the art in Big Data, and the ways in which IT buyers should and should not seek to use it as an extension of their present (relational) BI and information management capabilities. As it turns out, when we understand both the relational technology behind the curtain and the ways it has been extended, we can do a much better job of applying Big Data to long-term IT tasks.

NoSQL or NoREL?
The best way to understand the place of Hadoop in the computing universe is to view the history of data processing as a constant battle between parallelism and concurrency. Think of the database as a data store plus a protective layer of software that is constantly being bombarded by transactions – and often, another transaction on a piece of data arrives before the first is finished. To handle all the transactions, databases have two choices at each stage in computation: parallelism, in which two transactions are literally being processed at the same time, and concurrency, in which a processor switches between the two rapidly in the middle of the transaction. Pure parallelism is obviously faster; but to avoid inconsistencies in the results of the transaction, you often need coordinating software, and that coordinating software is hard to operate in parallel, because it involves frequent communication between the parallel “threads” of the two transactions.

At a global level (like that of the Internet) the choice now translates into a choice between “distributed” and “scale-up” single-system processing. As it happens, back in graduate school I did a calculation of the relative performance merits of tree networks of microcomputers versus machines with a fixed number of parallel processors, which provides some general rules. There are two key factors that are relevant here: “data locality” and “number of connections used” – which means that you can get away with parallelism if, say, you can operate on a small chunk of the overall data store on each node, and if you don’t have to coordinate too many nodes at one time.

Enter the problems of cost and scalability. The server farms that grew like Topsy during Web 1.0 had hundreds and thousands of PC-like servers that were set up to handle transactions in parallel. This had obvious cost advantages, since PCs were far cheaper; but data locality was a problem in trying to scale, since even when data was partitioned correctly in the beginning between clusters of PCs, over time data copies and data links proliferated, requiring more and more coordination. Meanwhile, in the High Performance Computing (HPC) area, grids of PC-type small machines operating in parallel found that scaling required all sorts of caching and coordination “tricks”, even when, by choosing the transaction type carefully, the user could minimize the need for coordination.

For certain problems, however, relational databases designed for “scale-up” systems and structured data did even less well. For indexing and serving massive amounts of “rich-text” (text plus graphics, audio, and video) data like Facebook pages, for streaming media, and of course for HPC, a relational database would insist on careful consistency between data copies in a distributed configuration, and so could not squeeze the last ounce of parallelism out of these transaction streams. And so, to squeeze costs to a minimum, and to maximize the parallelism of these types of transactions, Google, the open source movement, and various others turned to MapReduce, Hadoop, and various other non-relational approaches.

These efforts combined open-source software, typically related to Apache, large amounts of small or PC-type servers, and a loosening of consistency constraints on the distributed transactions – an approach called eventual consistency. The basic idea was to minimize coordination by identifying types of transactions where it didn’t matter if some users got “old” rather than the latest data, or it didn’t matter if some users got an answer but others didn’t. As a communication from Pervasive Software about an upcoming conference shows, a study of one implementation finds 60 instances of unexpected unavailability “interruptions” in 500 days – certainly not up to the standards of the typical business-critical operational database, but also not an overriding concern to users.

The eventual consistency part of this overall effort has sometimes been called NoSQL. However, Wikipedia notes that in fact it might correctly be called NoREL, meaning “for situations where relational is not appropriate.” In other words, Hadoop and the like by no means exclude all relational technology, and many of them concede that relational “scale-up” databases are more appropriate in some cases even within the broad overall category of Big Data (i.e., rich-text Web data and HPC data). And, indeed, some implementations provide extended-SQL or SQL-like interfaces to these non-relational databases.

Where Are the Boundaries?
The most popular “spearhead” of Big Data, right now, appears to be Hadoop. As noted, it provides a distributed file system “veneer” to MapReduce for data-intensive applications (including Hadoop Common that divides nodes into a master coordinator and slave task executors for file-data access, and Hadoop Distributed File System [HDFS] for clustering multiple machines), and therefore allows parallel scaling of transactions against rich-text data such as some social-media data. It operates by dividing a “task” into “sub-tasks” that it hands out redundantly to back-end servers, which all operate in parallel (conceptually, at least) on a common data store.

As it turns out, there are also limits even on Hadoop’s eventual-consistency type of parallelism. In particular, it now appears that the metadata that supports recombination of the results of “sub-tasks” must itself be “federated” across multiple nodes, for both availability and scalability purposes. And Pervasive Software notes that its own investigations show that using multiple-core “scale-up” nodes for the sub-tasks improves performance compared to proliferating yet more distributed single-processor PC servers. In other words, the most scalable system, even in Big Data territory, is one that combines strict and eventual consistency, parallelism and concurrency, distributed and scale-up single-system architectures, and NoSQL and relational technology.

Solutions like Hadoop are effectively out there “in the cloud” and therefore outside the enterprise’s data centers. Thus, there are fixed and probably permanent physical and organizational boundaries between IT’s data stores and those serviced by Hadoop. Moreover, it should be apparent from the above that existing BI and analytics systems will not suddenly convert to Hadoop files and access mechanisms, nor will “mini-Hadoops” suddenly spring up inside the corporate firewall and create havoc with enterprise data governance. The use cases are too different.

The remaining boundaries – the ones that should matter to IT buyers – are those between existing relational BI and analytics databases and data stores and Hadoop’s file system and files. And here is where “eventual consistency” really matters. The enterprise cannot treat this data as just another BI data source. It differs fundamentally in that the enterprise can be far less sure that the data is up to date – or even available at all times. So scheduled reporting or business-critical computing based on this data is much more difficult to pull off.

On the other hand, this is data that would otherwise be unavailable – and because of the low-cost approach to building the solution, should be exceptionally low-cost to access. However, pointing the raw data at existing BI tools is like pointing a fire hose at your mouth. The savvy IT organization needs to have plans in place to filter the data before it begins to access it.

The Long-Run Bottom LineThe impression given by marketers is that Hadoop and its ilk are required for Big Data, where Big Data is more broadly defined as most Web-based semi-structured and unstructured data. If that is your impression, I believe it to be untrue. Instead, handling Big Data is likely to require a careful mix of relational and non-relational, data-center and extra-enterprise BI, with relational in-enterprise BI taking the lead role. And as the limits to parallel scalability of Hadoop and the like become more evident, the use of SQL-like interfaces and relational databases within Big Data use cases will become more frequent, not less.

Therefore, I believe that Hadoop and its brand of Big Data will always remain a useful but not business-critical adjunct to an overall BI and information management strategy. Instead, users should anticipate that it will take its place alongside relational access to other types of Big Data, and that the key to IT success in Big Data BI will be in intermixing the two in the proper proportions, and with the proper security mechanisms. Hadoop, MapReduce, NoSQL, and Big Data, they’re all useful – but only if you pay attention to the relational technology behind the curtain.

No comments: