Multi-Dimensional analytics at scale
In a typical enterprise there are broadly two kinds of B.I projects .
Focus on factual reporting and analysis
– These projects involve implementing Hadoop or some Big Data stack for organizing and managing an enterprise datawarehouse and running SQL queries for reporting or connecting Tableau etc for slice and dice analysis
– Some level of ad hoc querying using tools provided by single node B.I tools.
– These projects involve a combination of data drills and data science, exploring data for anomalies, patterns and general behavioral analysis of the business entities.
SNAP combines #1 and #2 in one platform because it is fast and flexible and allows companies to build a metamodel to represent entities of importance to a data scientist as well as a business analyst.
Let us look at use cases involving data exploration for #2
A typical starting point of analysis is a set of facts( events) and related dimensions. Example: Sales and customers. We want to look at how sales has changed across multiple dimension combinations ( product/channel/promotion type) and grains( quarter/month/day) and location ( State/Zipcode) and find important jumps or deviations.
The grain of facts refers to the lowest level at which events are recorded. For example : sale of a product is recorded with sale time, a timestamp . Aggregation of sales can be across multiple levels such as revenue by channel and product or revenue by category, product, store and state.
Facts/Dimensions and grains are important to understand because analysis of metrics cannot be the same across dimensions and granularities.
For example, exploring trends and anomalies in the monthly revenue of a company will most likely not have the same behavior as the monthly revenue of an individual product by channels or daily revenue of the products. Thus, the ability to explore data behavior at multiple granularities and dimensions is important for realizing high ROI on business intelligence projects.
In the past, OLAP cubes were used to derive value from multi-dimensional data. However cubing involves pre-Aggregation thereby causing data reduction. The right granularity for aggregation is usually hard to anticipate in advance. Thus pre-aggregated cubes and materialized summary tables break down when it comes to Big Data analytics and Machine learning.
While dashboards and reporting style queries may not need a rich multi-dimensional framework ( SQL on Hadoop or Druid/Elastic search for example are simple reporting of facts by dimensions), deriving insights requires data exploration across regions of a cube, looking for anomalies.
For example simple reporting on pre-aggregated dimensions may show an increase in revenue from Month to Month for some months and no change for others but deeper analysis may illuminate a larger than increase in revenue on one sub category offset a decrease in other categories. Having pre-aggregated monthly revenue will hide such useful pieces of information. It is impractical to build cubes for all combinations of all dimensions and metrics using traditional methods to account for such deep dives.
Let us walk through an example with Tableau and SNAP
Below is a slice of a sales dashboard snapshot showing monthly sales increases of tickets by month and category group.
We can see while September showed strong growth October was flat and December was declining. Such summary level data may be hiding the underlying behaviors of the components that make up the category group or behaviors over other timeperiods( example a monthly sales will show a different behavior vs a daily one)
Exploding the category group we see the following. What seemed to be flat for October for ‘shows’ is actually interesting if we look at the detail. ‘Opera’ showed a strong growth and Plays had declined quite a bit. These may be hiding behaviors that may need further drill downs.
If executives just had a category group level dashboard hard coded to some time periods because the B.I tool could not handle more data, they would miss the real action!
In real life data with multiple dimensions of analysis ( product, channel, promotion types, customer segments …) a rich multi-dimensional framework is key for realizing business insights and ROI. Automating finding such behavioral anomalies/changes in metrics is impossible without proper modeling. Further exploring anomalies across several dimensions and metrics requires a very fast compute engine that understands SQL and Statistics and can seamlessly blend one with the other at blinding speeds.
SNAP Qubes and the SNAP platform provide a natural framework to systematically explore patterns at different granularities and dimensions at scale and speed.