By Uday Singh, Craig Kane and Anshuman Jaiswal
Are enterprise data warehouses—systems that have been used for decades for reporting, analysis and dashboard capabilities, but that have some key limitations—going to be surpassed by Hadoop’s open-source framework and the seemingly endless bounds of a system built to process large data sets? Or will EDWs remain the preferred business intelligence (BI) foundation for today’s information-hungry organizations?
We believe the answer lies somewhere in the middle. EDWs—and the batch extraction, transformation and loading (ETL) processes commonly used with them—can work together with Hadoop as parts of a unified strategy that offers a clear road map for core BI and analytics systems.
Successful companies have created and implemented clear processes to generate democratized data access, richer analytics and real business value from data projects. They have armed their business units with access to the power of information as never before.
The debate is understandable. An EDW is the heart of business intelligence systems, and it is the most frequently used system for well-governed, structured data, such as that used in financial reporting.
However, EDWs are expensive storage structures and come with capacity limitations. Handling and combining voluminous data with various data sources is difficult for several reasons, including difficulties in accessing historic data, slow batch processing, complexity and unavailability.
With the exponential growth in both the quantity and types of data (structured and unstructured) available, it is clear that EDW has outgrown its original purpose, which was to provide answers from well-structured data to recurring reporting and analytical needs. It was never meant to hold huge data sets or provide support to unstructured data analysis.
The ETL process has been the typical way to feed data into EDWs, but the huge volumes of often unstructured data have pushed ETL to its limit. Some business leaders, frustrated in their desires for a 360-degree, real-time view of their business, now yearn to shrink their reliance on EDWs.
The old systems aren’t the only problem: Every new technology solution comes with its own set of challenges. Hadoop, with its open-source framework that’s designed to process large data sets, has become popular in recent years as a tool that might process infinite sets of both structured and unstructured data. (Human genome analysis is a good example.)
But some business leaders are understandably wary of Hadoop and have opted for the comfort of EDWs and their more manageable structure. Hadoop is like a data ocean, where mining the right data, planning for data discovery, communicating the value of Hadoop efforts to business partners, and allaying fears of risk and compliance concerns remain major challenges to adopting this technology.
A Pragmatic Approach to the EDW-Hadoop Debate
There is a balanced solution to choosing between EDWs and Hadoop, but it requires a pragmatic and flexible approach.
Create a road map for future EDW-based reporting and analytics.
While their growth lags behind Hadoop’s storage, EDWs will continue to be, for the foreseeable future, the main sources of well-modeled, known and structured financial reporting and audit-intensive data.
Build Hadoop to provide for several needs.
As Hadoop grows, its range of capabilities increases correspondingly. Hadoop should be built to provide both storage and staging (to provide access to vast volumes of data, both unstructured and structured) and archiving (as low storage costs provide a solution for archiving needs).
Create a Hadoop road map.
Determine what capabilities are needed on Hadoop, both for experimentation and analytics use cases with the highest business value.
Optimize and simplify the ETL batch process.
This includes removing and consolidating batch jobs, resequencing scripts, bringing the data that is needed, and performing better quality checks to support capabilities, as outlined in the EDW road map.
Build non-ETL capabilities.
These include real-time data integration; extraction, loading and transforming; and change data capture.
Develop data virtualization solutions.
This can help integrate data from EDWs, Hadoop and native data sources for greater integration and access, without disturbing the structure and growth of the EDW and Hadoop systems.
While it’s not easy to develop a unified data strategy with a clear road map, it is possible. And it can help ensure that you are setting the stage for a future-oriented, data-driven, digitally forward-looking organization.
Uday Singh is a partner in management consultant A.T. Kearney’s Financial Institutions practice and is based in New York. He can be reached at [email protected].
Craig Kane is a partner in A.T. Kearney’s Strategic IT Practice and is based in Dallas. He can be reached at [email protected].
Anshuman Jaiswal is a manager in A.T. Kearney’s Strategic IT Practice and is based in Atlanta. He can be reached at [email protected].