Enterprise B.I platform at scale

More and more of Enterprise Data is moving to Data Lakes: which could be on an on-premise scale out cluster, but increasingly it could just as well be a cloud object store.  Enterprises are in the process of leveraging these datasets for a variety of analysis: from Operational Analytics to Reporting to Business Intelligence/Data Science and everything in between.  There is a plethora of  scale-out Analytic Platform Vendors  ready to help customers in their quest to develop these solutions.

Each tool, platform has its own technical capabilities, purpose built for each kind of Analytic Solution: for example,  a Fraud Detection Solution’s platform needs are not the same as an Operational Analytics Solution.  On the other hand there is enough commonality in Platform needs to be enticed into a `use my favorite platform’ for any problem domain.  Experience has shown us that it is easy to get started down such a path, but usually leads to a lot of disappointment.

Given this backdrop, we start by making the case of how a B.I. solution is distinguished from other kinds of Analytic Solutions, and what are its unique needs. This leads us to explaining the why and how of our SNAP offering.

In our conversations with customers, and having worked on a variety of analytics use cases over the past decade, we see the following buckets.

Operational reporting : Time-series trends, streamed events and real time need for reporting on “what happened”. Example: Active user count, clicks and impressions over last 7 days etc.

Batch/SQL reporting : Charting, Dashboarding on various aspects of business metrics. A Sales dashboard for example. Queries are simple projects, filters and scans.

Enterprise B.I OLAP: Advanced Insights into historical data to find patterns. Traditionally this was done using OLAP cubes with pre-aggregation designed to answer questions involving business hierarchies, allocation of budgets across multiple levels in hierarchy or Campaign Attribution, Forecasting and Planning with scenario analysis and multi-level, multi-dimensional slice and dice analysis.

Data science : Taking a sample of historical data and projecting trends or finding patterns based on feature modeling.

Let us take a look at each and see where they fit in and why.

Operational Reporting

Companies like Splunk, and products like Druid, Elastic Search have excelled at this kind of reporting. They are easy to setup and they track time-series events and do simple pivots on a limited set of dimensions and metrics.

The architecture and datamodel capabilities are driven by events such as in Adtech or log files. Historical analysis requires combining fact data with multiple dimensions and modeling these for analysis. Operational reporting is not setup to answer questions leading to insights. Many of these tools do not provide full SQL support for enterprise transactional data analysis.

Reporting and Dashboards

Another set of Applications involve building Reports and Dashboards on top of the EDW( Enterprise Data Warehouse). Here the focus is on SQL capabilities and SLAs around speed and cost of delivering Reports. Fast SQL is a key platform capability for these applications. These are built on very rich SQL data models developed by EDW data architects.

The open-source Spark-Druid offering of SparklineData is an example of a platform to address this need, where OLAP Indexing from Apache Druid is combined with Spark SQL to provide a full-fledged SQL environment with the benefits of very fast slice-and-dice using an OLAP Index.

But we found that Spark+Druid did not fully address the needs when used to develop typical B.I. solutions like General Ledger Analytics, Travel Cube, Campaign Perfomance Management etc. 

The Federated Architecture of Spark + Druid has major drawbacks: the Druid Inverted Index Semantics is very different from SQL Semantics, Index Management is tied to Druid and not integrated into SQL, and critically, because of the federated nature and because the system doesn’t capture B.I level metadata, the scope of optimization is very limited.

Enterprise B.I

So that brings us to the key platform elements needed to support Enterprise B.I: The system must capture B.I metadata(Cubes, Dimensions, Hierarchies, KPIs) not just the SQL datamodel; the Cube data structure must be integrated into the EDW, and not surfaced as a separate system; the Optimization layer must leverage both the Cube structure and the B.I. metadata; and the runtime must have both SQL and B.I. building blocks.



The combination of Cube Data Structure + B.I. metadata + combined Optimization and runtime is the winning combination, unmatched by other Platform Architectures used to support BI.

How does SNAP address the requirements of fast, true B.I ? 



SNAP is a Spark native B.I. platform: by that we mean we leverage all the goodness of Spark and add-in B.I. capabilities into it. We enable B.I. metadata: Cubes, StarSchemas, Dimensions, Hierarchies to be captured on Spark SQL tables; our Cube FileFormat enables structuring data in an OLAP Index that provides fast access and partial aggregation on slices of a large multi-dimensional Cube;  we enhance Spark’s Catalyst layer to provide many Optimizations for B.I. Query patterns, for example Star-Join elimination, Eager and Partial Aggregation, Dimension/Hierarchy  Semi-Join; finally we have enhanced the Spark Runtime to have Cube Operations and optimized access to Cube data.



SNAP is Spark native B.I. The run time footprint is Spark and operationally it is as simple as managing a Spark cluster. It can take data from HDFS/S3 or any Spark datasource and expose it to visualization tools like Tableau/OBIEE/Spotfire for on-demand adhoc reporting. It can also plug into Notebooks ( Jupyter/Zeppelin ) for datascience.

SNAP has 4 components designed to meet the complex needs for reporting and enterprise B.I for Big Data.


Different analytics solutions require different platform capabilities. Force fitting solutions to platforms can lead to high cost and functional issues.

The combination of Cube Data Structure + B.I. metadata + combined Optimization and runtime is the winning combination, for B.I at scale.

SNAP on Spark provides a compelling platform to build B.I Solutions at scale.



Comments are closed.