Making data useful and ubiquitous
Datawarehouses have evolved over the years. With Hadoop reaching a level of maturity and Spark as a powerful engine to power various workloads, we are now at a point to truly democratize consumption of data to power insights.
Savvy data driven companies, combine the power of automated data analysis with human insights. In order to get everyone in an organization to leverage the data, data at the lowest grain has to be accessible for analysis.
Analysts and data scientists use their favorite tools like Tableau, Qlik, R to slice and dice the data, find patterns through machine learning or simply run daily reports and present it in dashboards. Executives need summarized data, but to get to the summary, one has to start at the detail.
With the volume of data exploding, traditional methods such as building summary tables and cubes no longer work. They are inefficient, do not handle the needs of a variety of workloads and are expensive and error prone.
Moreover the consumption of data even after building summary tables and cubes is a huge issue. As data gets larger, slicing a small subset of data for analysis takes time. Queries run slow. A user on Tableau cannot be expected to wait a minute before each query comes back.
Thus Acceleration becomes the first immediate need. In-memory and columnar technologies have been a good first step but it is still not enough.
Secondly the platform has to adapt to various workloads. Some queries may ask for all data over a time period( reporting), some explore the data ( drill downs on Geo or other hierarchies, drill through, rollups and summarizations on various attributes) and some need iterative access to various slices of data ( get all transactions between two dates for a specific category of products run a model and repeat on other categories ).
The platform has to digest the patterns from various workloads and apply “self-learning” to create the necessary optimizations to organize the data based on demand. The response to a query should be based on location of data, type of request, cardinality, aggregation, types of filters and many such factors.
A typical data lake contains the lowest grain data, probably organized as a star schema as a starting point for analytics. However users consuming the data need various slices and roll ups at different points in time. Finance users may need monthly or quarterly rollups of some dimensions. Sales analysis requires daily or weekly analysis on other dimensions. Data scientists need access to the lowest grain analysis. Most companies resort to manually building summary tables and static extracts to solve these problems. However these cannot “adapt” to new business needs fast and are expensive. The platform should be able to amplify the lowest grain data, without manual duplication and summarization, in an “Automated” way to derive snapshots of the aggregated data, responding or adapting to queries( by learning through the above adaptive layer).
This is the future of a modern analytics platform and is what we are working on at Sparkline Data through our SNAP platform.