Exploring Big Data with Sparkline SNAP, Spark SQL and nteract with Python
Sparkline SNAP is used as a full fledged datawarehouse in place of traditional MPPs at large enterprises where fast data access is required for ad hoc analysis and reporting.
Increasingly we see data engineers, used to tools like Jupyter notebooks, accessing SNAP data because its fast and iterative and simple to use with Python.
First let us start with the connection – SNAP uses the Spark/hive thrift server and so connecting to a SNAP /Spark cluster is a simple step. In production SNAP supports Kerberos and various authentication mechanisms that are available as part of the Spark JDBC/ODBC driver.
The query is a Spark SQL query( The way to query SNAP is through Spark SQL ). We use the ipython-sql magic and query a 120 million row dataset running on an r3 4x demo machine that is shared among various use cases. The original dataset and the SNAP Qube(sales_demo_index1) live on S3. The cluster is a single node Spark cluster. There is no Hadoop or other processing engines involved.
The first thing to notice is the time. Querying a slice of a 120 million row Qube on a single shared r3.4x takes under 2 seconds. SNAP scales out elastically and if you allocate more cores most queries will be in the order of 100 milli seconds ( SLA for some of our customers ) .
Also note that we are running on standard R3 on AWS and there are no GPU gimmicks.
So now that we have an idea of the quantity and volume of revenue by region, month and nation, lets drill further into detail. I want to drill into month 6 and let see how the revenue is distributed by quantity buckets.
NOTE: Since SNAP does not pre-aggregate data unlike old style OLAP, users can drill to any level of detail in-place.
Again we query SNAP to get the Brand and Part sizes, drilling into month=6, in addition to other filters we had earlier. Now we are still querying the live 120 million row dataset but further exploring a slice of the Qube.
The above query is returning over 10000 rows back to the client on my laptop( I am running nteract a desktop app on my laptop). Hence the 3.4 s. The actual query execution on the SNAP side as expected should be < 1s. Let’s verify that on the Spark UI (All SNAP queries can be monitored through the Spark Console). The first query below is the same one that just got executed and it took 0.7 s in SNAP with the remaining 2.7s to ship 10000 result rows back to my app.
So now that we have our data, lets again pivot it with Pandas and look at the distribution by size and a new quantity bucket across countries.
Note: We are using binning here with Pandas but you can do server level binning for higher volume queries on the SNAP end as well.
I do want to explore all brands for size 6 across countries and see if there are any unseen issues from all this data.
Note through all this process I am filtering down and across. This same in-memory SNAP Qube is also being accessed by several other users, some using Tableau and some using Jupyter notebooks for their own needs. Every user has their own filters and aggregation.
SNAP Qubes are in memory and hence provide a very fast Analytics view that can be shared by hundreds of concurrent users accessing various parts of the multi-dimensional dataset. And every user would see lightning fast response times. Unlike query caching, in-memory SNAP does not cache query results by default. Query caching will not work when there is truly adhoc querying across multiple dimensions and filters by several hundred users. SNAP allows query caching and can be turned on if needed. For this exercise we have turned it off.
So now, let us look at the Brands with part size 6 and see how revenue is distributed.
We use the ‘funtastic’ Altair library to plot and visualize the result. Revenue for Brands 31 , 51 and some more seem to be much lower than others. I can now pass on this dataset with notes to my colleagues looking for “store/zip code” wise anomalies for these brands and sizes.
Before that let us look at what is dragging down these brands by drilling down to day level detail.
With nteract/Jupyter notebook and Sparkline SNAP, I can analyze datasets fast and try to get to the root cause. With SNAP on Spark, I can do such B.I queries at blazing speed and also combine machine learning use cases all in the same environment.
In the next post we will see some examples of combining B.I with machine learning .