Fast aggregations/metrics on Spark with Tableau

Ad-hoc queries, with sub second response time, is critical for enterprises. Vast amounts of data exist in Hadoop or AWS datalakes and consumption of this data, in a scalable /fast manner using existing B.I tools like Tableau, is a challenge.  Transactions at the lowest grain(hourly/daily etc), are stored in fact tables. In order to achieve an acceptable level of performance, companies resort to writing extracts or summary tables with pre-aggregated data, often failing business SLAs.

Extracts and summary tables can become onerous to manage as well. Requirements can change, joins with other tables may be needed.  Even with pre-aggregations and extracts the queries can be painfully slow.

With Sparkline SNAP, the data can be stored at the lowest grain, without any pre-aggregation. Sparkline SNAP  can return fast results on the fly, without any manual materialized pre-aggregated cubes or tables.

Its is simple, to point Tableau at Sparkline SNAP using the built in Spark SQL connector in Tableau. As an example ,we have a dashboard below, with a few metrics. The dashboard has data from three sheets. The underlying data spans multiple years and can run into hundreds of millions of rows.

When the user applies filters, Tableau submits multiple queries for each sheet sometimes to the live datasource( in our case to Sparkline SNAP). These queries should all execute and return back in seconds for a good user experience. Further more, the same SLAs  should be available for hundreds of concurrent users.

In-memory databases are good in some use cases but are not sufficient for scale and large Enterprise needs. Filters by date across multiple years , filters on specific dimensions and on-the fly aggregations on ANY dimension or date would still require a full scan. Sparkline SNAP is optimized for these kinds of ad-hoc queries adding advanced indexing with columnar compression to in-memory data with several logical optimization techniques.

SNAP is designed to scale horizontally on your existing Spark clusters supporting hundreds of concurrent users.


Comments are closed.