Advanced Tableau on Spark /Hadoop
Most benchmarks on datawarehouse optimizations and SQL engines stop with simple examples. The real world uses business intelligence tools where the use cases are not single user single SQL as in a simulated benchmark,
Modern B.I on Big Data should satisfy three key requirements
- Should be able respond interactively as a user drills down into data in Hadoop/Spark, in seconds.
- While B.I is not about retrieving a million row result set back to the user, the platform should allow querying slices of a petabyte multi-dimensional dataset where each user accesses a specific part of the dataset.
- Should work at high levels of concurrency. Hadoop/Spark etc were originally designed for ETL or data science use cases where multi-user concurrency requirements were less. But with mainstream B.I, the platform should be able to support multi-user queries across terabyte scale datasets.
Let us take an example of a Tableau workbook with a what-if analysis on a sales demo dataset.
This Sales demo dataset is based on a tpch benchmark dataset on orders, parts, suppliers etc. We want to see how revenue would be impacted if the discount applied to various orders had been different.
The sheet in Tableau below is looking at 3 years historical data in Hadoop and applying filters.
- A parameter driven Dimension filter – In this case on Order Priority. The query translates to asking for “What would be the revenue change if I had applied X% less discount on orders with MEDIUM priority. Exclude the market segment “FURNITURE” from this analysis. “
- Users may also want to drill down into specific Regions and Nations which are defined as Hierarchies in Tableau.
- The slider on What-IF segment can be moved to simulate changes in Revenue based on the filters.
This is one sheet which is operating on a multi-dimensional dataset, but hundreds of users could be running such queries against the dataset and SLAs have to be maintained.
Each sheet in Tableau can produce multiple SQL queries.
A short video on analyzing Hadoop /Spark data with SparklineData from Tableau. Interactive analysis on large datasets.
Sparkline Data allows business users to deploy Tableau on live data with the full power of Tableau against large scale Hadoop or S3 based Data Lakes.
Next we will show how you can pass through Windowing functions into Sparkline from Tableau so it can be executed at scale on your cluster. Stay tuned.