Tableau and OLAP analytics on Hadoop data

Most customers find Hadoop based query access to tools like Tableau cumbersome and slow. When you account for concurrency requirements of enterprises, using a B.I tool on top of Hadoop turns to a project with a slew of summarized extracts consuming enormous resources and time and leading to productivity sinks.

With SNAP we have a different approach to dealing with the ad-hoc query requirements on large datasets- whether they are in Hadoop or elsewhere.

Lets us take an example of a a typical Tableau workbook. We have a demo data based on a dataset from the Transportation government data.

 

The units of Ad-hoc analysis are dimensions and measures. Analysts ask questions like – What is the average arrival delay by airports for the year 2016. How does Arrival delays vary by time of day for San Francisco.

Such questions invariably involve aggregations( sum, average) and filters ( where airport=San Francisco) an timestamps ( between date 1 and date 2 or hour 1 and hour2).

It has been shown many times over( Druid for example) that to enable ad-hoc analysis at scale and OLAP index in memory is the fastest path to performance.

SNAP takes your datasets and creates an OLAP index with dimensions and metrics at the “lowest grain”. This is important since we believe that summary tables and extracts are onerous and high maintenance and all analysts should have access to ALL data ALL the time.

An OLAP index with SNAP consists of your data in a columnar format with an inverted index against it. AN inverted index has a mapping of column values to row indices – for example San Francisco –> rows( 5, 20, 25 ) [ 0, 0,0,0,1,0,0…]

There are many papers on using Bit Map indices for OLAP workloads and prior generation products like Sybase IQ have implemented aspects of such methods.

The definition of an OLAP index in SNAP mirrors what you would normally see in B.I products – Dimensions and Metrics.

create olap index flights_snap_local on flights_fixed_local
timestamp dimension issue_date spark timestampformat "MM/dd/yyyy" is nullable nullvalue "01/01/1970"
dimension Month is nullable nullvalue "NA"
dimension DayofMonth is nullable nullvalue "NA"
dimension DayOfWeek is nullable nullvalue "NA"
dimension UniqueCarrier is nullable nullvalue "NA"
dimension FlightNum is nullable nullvalue "NA"
dimension TailNum is nullable nullvalue "NA"
dimension Origin is nullable nullvalue "NA"
dimension Dest is nullable nullvalue "NA"
dimension Cancelled is nullable nullvalue "NA"
dimension CancellationCode is nullable nullvalue "NA"
dimension Diverted is nullable nullvalue "NA"
dimension DepTime is nullable nullvalue "NA"
dimension CRSDepTime is nullable nullvalue "NA"
dimension ArrTime is nullable nullvalue "NA"
dimension CRSArrTime is nullable nullvalue "NA"
dimension type is nullable nullvalue "NA"
dimension manuf is nullable nullvalue "NA"
dimension model is nullable nullvalue "NA"
dimension status is nullable nullvalue "NA"
dimension aircraft_type is nullable nullvalue "NA"
dimension engine_type is nullable nullvalue "NA"
dimension manufyear is nullable nullvalue "NA"
dimension description is nullable nullvalue "NA"
dimension flights_airports_dest.airport is nullable nullvalue "NA"
dimension flights_airports_dest.city is nullable nullvalue "NA"
dimension flights_airports_dest.state is nullable nullvalue "NA"
dimension flights_airports_dest.country is nullable nullvalue "NA"
dimension flights_airports_dest.lat is nullable nullvalue "NA"
dimension flights_airports_dest.longi is nullable nullvalue "NA"
metric ActualElapsedTime aggregator longSum is nullable nullvalue "0.0"
metric CRSElapsedTime aggregator longSum is nullable nullvalue "0.0"
metric AirTime aggregator longSum is nullable nullvalue "0.0"
metric ArrDelay aggregator longSum is nullable nullvalue "0.0"
metric Distance aggregator longSum is nullable nullvalue "0.0"
metric TaxiIn aggregator longSum is nullable nullvalue "0.0"
metric TaxiOut aggregator longSum is nullable nullvalue "0.0"
metric CarrierDelay aggregator longSum is nullable nullvalue "0.0"
metric WeatherDelay aggregator longSum is nullable nullvalue "0.0"
metric NASDelay aggregator longSum is nullable nullvalue "0.0"
metric SecurityDelay aggregator longSum is nullable nullvalue "0.0"
metric LateAircraftDelay aggregator longSum is nullable nullvalue "0.0"
metric ArrDelay aggregator longSum is nullable nullvalue "0.0"
options (path "/snap/samples/flights/flights_snap_all", nonaggregatequeryhandling "push_project_and_filters",avgsizeperpartition "800mb",
 preferredsegmentsize "50mb",
 rowflushboundary "100000",
 defaultNullValueInIndex "",
 indexSizeReduction "0.6"
)
partition by year
 ;

When a Tableau query is issued to SNAP we analyze the plans and optimize the query to go against our SNAP index and return results in seconds. Since all this happens within Spark with no proprietary components, the acceleration applies to any Spark workloads- not just SQL.

Contact us now for a free POC on your dataset.

 


sparklinedata

Comments are closed.