Fast analytics on Spark – Really fast

Interactive ad-hoc analytics requires fast responses. Fast, in many benchmarks are single user tests that do not really reflect the realities of how business users use an analytics or Big Data platform.

A good test is a comprehensive simulation of Tableau users pounding a system with a variety of queries. We simulated one such use case on a single r3.2x node on AWS ( 6 CPU, 60 GB memory) on 122 million row data set with 30 dimensions and 10 metrics. The dataset was based on TPCH.

Our simulation was for 11 concurrent users which translates to 100+ active users of Tableau or 1000+ named users( Based on IBMs Cognos B.I translation of Active to concurrent users).

The results of the test, is evidence of how analytics can be both blindingly fast and operationally cheap leading to immense savings in TCOs.

An r3-2x aws machine costs around 10 cents per hour, but there are other costs: s3 costs, data movement cost etc. Factoring all this and the fluctuations in spot prices, we very conservatively assume here that the cost per hour is 15 cents.

In our Benchmark, queries are being processed at the rate of 34.5/min or around 2000/hourThe Average time per Query across all query types is 1.277 secs with Slice-Dice queries on average running in 100s of milli-secs. So Avg. Cost Per Query is 0.0075 cents. Indexing speeds for the TPCH Dataset(around 35 dimensions, 10 metrics) is 0.5GB/min, so Indexing Cost is around 0.005 cents per GB of input data.

We run the benchmark using Apache Jmeter, details on the ddl and Jmeter configuration are here. Each Query is parameterized on what part of the multi-dimensional space it operates on: so predicates on the product brand, the ship/order date range, customer market segment are randomly chosen from a set of possible values for each Query. Depending on the Query Type the performance ranges from sub-second to a few seconds.

Now think about the implications of this for your massive datalake. You can run 2000 complex queries an hour, supporting 100s of users on one single box that costs 15 cents an hour. Since this scales horizontally, the ROI can be immense.

Contact us for more details on the benchmark or to Pilot your dataset on SNAP.

 

 


sparklinedata

Comments are closed.