Fast B.I on Spark SQL

A typical slice and dice query on a database has the following pattern.


On large datasets, the response for such interactive queries have to be in the order of 1 or 2 seconds as users navigate across different Tableau worksheets or choose filters on their web application.

A standard in-memory solution may be suboptimal for such slice and dice queries. First, caching large amounts of data in memory may not be feasible. Second, even when cached such queries may not perform at think time speeds.

Sparkline Data leverages an in-memory OLAP index using Druid along with Spark to accelerate such queries. Queries such as the above are translated to a Druid Group-By query(under certain conditions the Group By is optimized to a Druid TimeSeries or TopN query). The results of the Druid Query are bridged into the Spark Plan via the DruidRDD instance, a Projection on top of the DruidRDD handles any datatype translations, and value projections.

More technical details can be found on our Github Wiki. 


Comments are closed.