Optimizing an Enterprise Datawarehouse on Hadoop
As companies move from analytic datamarts and datawarehouses built on Teradata, Vertica or even Oracle/MYSQL to a Hadoop based architecture, consumption of data for B.I and Analytics workloads become critical. Hadoop has traditionally not been geared for consumption of data as users of Tableau know very well. Hive queries are slow.
Products like Impala and Presto have eased the pain a bit but the challenge of building a high performance enterprise datawarehouse at low cost remains a huge bottleneck.
A fast B.I stack should allow business users consuming data from Hadoop to have a smooth experience in terms of response times. The engine should be easy to deploy and cost effective.
Data architects looking to build a solution should take into account the following elements in order to future proof a solution.
1. Separate consumption from ETL.
Hadoop mapreduce and Hive are great for ETL workloads, batch jobs running at night to produce a clean dataset or multiple datasets geared for analytics consumption. However SQL engines on Hadoop are not suitable for B.I since they are not designed for interactive workloads. Even Spark with its in memory architecture is not enough for an enterprise datawarehouse B.I need.
2. In memory is not enough.
Leaving aside the challenges of putting all data in memory, an purely in memory solution still has to do a full scan on data for any query. Scans are faster but for interactive slice and dice B.I use cases they can become expensive and slow. Performance on joins on star schema can be a challenge as well.
3. Fit in well with your user workloads.
The optimization stack should be able to handle various consumption needs. This can vary from reporting, ad-hoc slice and dice, periodic volume extracts based on filters, machine learning and A.I. The engine should work well with traditional B.I tools like Tableau.
4. Easy to deploy
One of the main challenges of Hadoop based products is that they are hard to deploy and manage and require an army of extract engineers who sole job is to manage summary tables for reporting. This is a fundamentally broken system. Data has to be maintained at the lowest grain possible and all aggregations and reporting have to be done on the fly where possible, still satisfying SLAs of sub second response time.
5. Low cost
The cost of professional services should be factored in when deploying solutions. Open source is not free and can be more expensive when done with a business model of consulting. The solution should eliminate a variety of existing manual labor and should reduce cluster costs. Customers should not have to deploy a proprietary stack to get high performance.
An enterprise fast B.I product is not designed for a single user. A typical enterprise has thousands of users consuming data across various workloads. ( See #3). The product should be able to deal with thousands of concurrent users and satisfy the user SLAs as low cost.