5 ways to rethink your data access strategy
Leveraging data is increasingly becoming critical to business success. However much of the data is locked in slow datamarts, legacy OLAP cubes and inaccessible Hadoop clusters. As a result, business users are stuck in second gear on an old car, unable to move at the speed needed to execute effectively.
Dashboards, A.I and cool visualizations may dominate the conversation around analytics and B.I, but the real differentiator is the unsexy engine that connects enterprise data to end users.
The biggest frustration for businesses is the inability to access data fast, interactively , exploring the contours of the detailed data. A smart visualization product is important but as data volumes grow, the visualization becomes ineffective if it takes a minute for each drag and drop analysis.
Chief data officers and Business intelligence leaders should look at data access platforms as a key investment theme and solidify this layer to support not only B.I but also machine learning and modeling use cases.
In our work with customers and prospects we have found some common patterns that can help you rethink your data strategy to lay the foundation for a long term analytics platform.
Separate out data storage, compute and visualization. Phase out tools that combine compute and storage. Scale out platforms are needed as your data volumes and user access needs grow. You don’t want to provision for the worst case compute scenario.
Avoid pre-aggregating data. Modern use cases of big data require analysis at the lowest grain of data. Pre-aggregating imposes a huge expense in ETL work and rework and forces business users to work in second gear because any change involves weeks of effort.
Separate ETL workloads from Analytics/B.I workloads. Hadoop is great for batch processing and there is a tendency to assume that what is good for nightly batch jobs is good for interactive analysis. This is the single biggest mistake we see in deployments. Hadoop clusters are typically configured for batch workloads and trying to make interactive queries work in an environment of constant ETL jobs is a non starter.
Pick a single data access layer that all your visualization tools can connect to – Whether its SQL or non-SQL workloads ( R/Python/Scala) etc. Having a single layer that can serve multiple use cases helps in maintaining consistency of logical data models in addition to standardizing on access patterns.
In-memory alone is not an answer for fast B.I. Most EDWs have star schema joins, slice and dice analysis, drill through and count distinct queries. Further advanced OLAP style analysis of year over comparisons on calculated fields, Hierarchy navigation etc require more than just caching data in memory.
A strong data access layer is critical for business users to leverage the power of modern distributed computing systems like Spark. Large enterprises have invested in multiple tools like Tableau, Qlik, Spotfire , OBIEE etc and each has its own in-memory /extract layer. The users eventually want to access enterprise data live and not through extracts or disparate in-memory caches . A common data access layer can facilitate disparate visualization tools to connect to live data sets in Hadoop, eliminating costly extracts and democratizing data access within the enterprise.