We have seen before, from our benchmarking exercises, how efficient SNAP can be in providing the best performance at the lowest cost. SNAP does not need specialized hardware, GPUs or large clusters.

Many benchmarks do not focus on real use cases involving true adhoc queries accessing multiple regions of a multi-dimensional  large dataset. For example running Tableau workloads is different from hand written benchmark queries. This is because of Quick Filters and the fact that each Tableau sheet can result in multiple queries. Further Tableau users can drag and drop date filters which results in a min/max on the entire dataset to get date range. These kind of real world use cases can be challenging for traditional SQL on Hadoop and OLAP on Hadoop tools and products.

Recently we worked with a large media/adtech company to deploy SNAP on an in house machine running 48 cores, 244Gb RAM. Three datasets totaling 1.5 TB were snapped( SNAP OLAP Index) and loaded for ad-hoc queries on Tableau and web UI.

The original datasets resided on HDFS deep store and SNAP connected to a Cloudera HDFS to access the data.

The Tableau environment was as follows.

10 Worksheets and 2 dashboards.

Average of 5 dimension quick filters in a sheet.

Number of rows 1.2 billion, number of dimensions 45, number of metrics 12

Number of users concurrent  20+ ( users with sessions at the same exact time)

Response times with SNAP was consistently < 4 seconds for ad-hoc queries with date range filters, drag and drop dimensional analysis, rollups/drill downs etc. More importantly the cost of such a performance is just one server at < $500 a month.

Multi-dimensional analysis on billion row datasets involve various users accessing different parts of the cube. For example a 5 billion row ad-tech dataset can contain data for multiple advertisers and multiple campaigns across variosu regions and time periods. Each user can potentially access different sections of the cube – Some may look for advertisers in July, some may look for campaigns in August for France. Physical partitioning can be restrictive and sometimes not possible in many cases in dimensional analysis. SNAP overcomes these limitations through in-memory multi-dimensional analysis enabling various portions of the Index to be accessed and still maintaining query SLAs.

SNAP can run on Spark standalone or YARN and support multi-billion row datasets efficiently. Try it on your dataset today.


Comments are closed.