Sunday, September 23, 2012

Is Your Organization Suffering From Data Warehouse Disease?


I have a feeling that a fair amount of readers – especially vendors and IT BI types – are going to be upset by what I have to say in this post.  However, viewing some of the material that has passed across my desk recently, I really think it’s time to raise the question of whether too much organizational power given to data warehouse folks is beginning to cause some significant under-performance in meeting today’s key organizational information management needs.

The immediate occasion for these reflections is that I am partway through a book on a related subject that goes into some detail on data warehousing’s view of the world:  how BI should be handled, what the organizational information architecture should be, and how we got this way.  This book will remain nameless, because in many ways it’s an excellent primer.  However, over the last 22-31 years (depending on whether you count my software development days), I have had a cross-organization, cross-vendor view of the same area, and I have to say that the book redefines history and the purposes of various things in the ideal information architecture in major ways.

Usually, I find that going over history just wastes time in a blog post – but here, it helps to see how data warehouse concepts of common information management terms make them reinterpret the purposes of the underlying products, making the information architecture – and the whole information handling process – potentially (and, probably, actually) less effective in the medium and long term. So let’s combine history and exposition of my assertion.

A Data Warehousing View of the World

In brief, the book’s view of the information architecture seems to be as follows: Data of all types comes in to production systems, which immediately pass it on to the data warehouse for cleansing and aggregation. Behind the data warehouse is an optional operational data store for key data, and things like master data management operate in parallel with the data warehouse to provide a global view of multiple local ways to store customer data. On top of the data warehouse are key Business Intelligence applications, which include both repetitive, scheduled reporting and analytics.

Now, this view of the world seems reasonable if you were born yesterday, or if you’ve spent the last fifteen years entirely in data warehousing.  However, there are, in my view, some major problems with it.
In the first place, afaik, only in data warehousing are the databases at the initial entry point referred to as “production systems”.  For twenty years, I have been calling them “operational databases”. In fact, they were business-critical before data warehousing existed, and so were the apps on top of them – like ERP. 

Why does this matter? Because it allows data warehouse folks to shift the “operational data store” behind the data warehouse.  The operational data store is a later concept, and one that I (among others, I assume) wrote papers proposing around 2004 and 2005. The idea is that the data warehouse is simply too slow to react immediately to key operational data – but that operational data is scattered across multiple operational data stores, and so an “operational data store” makes sure that a subset of operational data for quick decision-making is either put in a central point for quick analysis in parallel with its arrival, or monitored by a central “virtual database.” Putting the operational data store behind the data warehouse defeats its entire purpose.

Likewise, the master data management system. I wrote papers on this in assessing IBM’s version of the concept in 2006 and 2007. Again, the notion was of combining operational data coming in to operational databases – in this case, by enforcing a common format that allowed cross-organization and cross-country leveraging of operational data by ERP and customer intelligence apps. By redefining the master data management as existing within the data warehouse or at the same remove from operational databases, data warehouse folks ensure that master data management moves no faster than the data warehouse.

And finally, there is the idea that (implicitly) analytics is entirely contained in BI, and hence is entirely dependent on the data warehouse. On the contrary, an increasing amount of analytics goes on outside of BI.  For example, analytics is part of products that analyze computer infrastructure semi-automatically to optimize performance or detect upcoming problems. Or, it is used to analyze key computer-supported business processes.  This is “intelligence” in the sense of “military intelligence” – proactively going out and finding out what’s going on – but it is not “business intelligence” in the sense of finding out what’s going on inside and outside the business on the basis of data that is handed to you, and that your reporting tools are too slow or shallow to tell you. In other words, these applications of analytics are entirely outside of a reactive data warehouse.

Why It Matters

There are two places that over-emphasis on data warehousing can impede organizational BI and other information management effectiveness:  the information architecture, and the organization’s “agility” in responding to new kinds of information from outside. As I’ve suggested in the previous section, a data warehousing view of the information architecture shifts operations that involve lots of “updates” and data just arrived from outside to the data warehouse or behind it.  That means going through the data-warehouse cleansing and aggregation process and arriving in a centralized location that is handling queries from all over the organization and is optimized for adding new data not “on the fly” but in delayed bursts. There is simply no way that is going to be as timely as performing tasks on the data as it arrives in the operational systems.

Just as troubling, the entire emphasis of the organization is now more reactive and focused farther away from the organization’s “antennae” to the outside environment. The IT organization appears to be focused on responding to new demands from business for timelier data, not actively seeking the latest new information and merging it back into existing systems. The IT organization appears to emphasize cleaning up the data and merging it and only then analyzing it at an internal “choke point”, rather than handling the information faster where it arrives. 

If you think these concerns are theoretical, think about the case of social-media Big Data. Yes, Oracle as a major vendor is emphasizing inhaling huge amounts of this data from multiple clouds into the data warehouse and then analyzing it – when the whole purpose of the NoSQL movement is to allow rapid in-cloud analysis of inconsistent, uncleansed data – but it would not do so unless there was some organizational push to avoid analytics outside the data warehouse.  I conclude that there is some strong evidence that a data warehousing focus is impeding organizational ability to process and feed to business decision makers key information in as timely a fashion as possible.

Moreover, there is some sense that this is not an organizational quirk but a tendency so embedded in the IT organization that this impediment is a symptom not of a temporary problem that is easy to fix, but rather of an organizational “disease.” In other words, simply directing the organization to pay more attention to doing social-media processing in the cloud will probably not work.

Action Strategies and Conclusion

First (although I think there is little danger of this) I must caution against throwing the baby out with the bathwater.  There are very good reasons to have a data warehouse performing the core functions of querying for BI. I have, in the past, conjectured that if I were to design a new information architecture today, I might not create a data warehouse or data mart at all – instead, I might impose “data virtualization” and master data management tools over existing operational databases. However, practically speaking, in most if not all cases, the sheer experience behind today’s data warehousing products makes them far more preferable for core functions.

Rather, I would suggest that data warehousing be placed under, and be responsive to rather than dominant over, an information architecture and information strategy function aimed more at the edge of the organization than its central data center. This is not a matter of making the organization more responsive to the business; it is a matter of making the IT organization more agile (by my definition, which stresses the utility of proactive and outside-the-organization-directed agility).

Until I saw this book, which suggested that data warehouse folks had gone too far in asserting “IT information handling is all about the data warehouse”, I was not too concerned about data warehousing folks; I would get into annoying arguments with folks who thought I just didn’t “get” data warehousing, but it seemed to me that the benefits of a powerful database-related IT function outweighed the negatives of data warehouse folks’ “not invented here” blind spots.  Now, I am rethinking my position.  If the result of this type of rewriting of history is an increasingly sub-optimal information architecture, then such a “disease” is not so harmless after all.

Does your organization suffer from data warehouse disease?  If so, what do you think should be done about it?

No comments: