Terabyte scale Data Lake analytics on S3, Hadoop with Spark

In our recent work with customers, there is one constant. The need to make sense of terabytes of fact and time-series data that lands in the datalake( Physically S3 or HDFS).

Here is a typical process before we get engaged. 

The first step in this process is organizing data in the datalake.

A typical fact table for our customers, such as events of all advertising-exposures for example, consists of data at a very low granularity( an event captured every second lets say), and consists of several dimensions, metrics and timestamps.

Fact tables are typically “time-series” data and hence can be partitioned based on date/hour. Most queries would filter data based on time. New data would arrive on new partitions.

All of this data can physically reside on S3 or HDFS.

In addition to such fact tables, customers have dimension tables which contain master data on names and other metadata for the IDs in the fact tables. For example a campaign dimension table would contain details of campaign names, start and end dates and information about strategies and creatives employed.

Analysts, would use Tableau or other tools to query against both dimension and fact tables, joining them at query time or pre-joining and creating a flattened table.

The challenge, pre-Sparkline Data.

The key challenge is in the speed of consumption(queries). Ad-hoc queries on such large scale data are prohibitively expensive ( many nodes/clusters) and slow that most companies only expose the recent week or month’s data to analysts. If more data is needed, data that is at the hourly granularity is aggregated to monthly and exposed as a summary table.

This is not a scalable solution when business need changes and more dimensions and metrics are introduced and the “Last 10 days extract” approach restricts the depth of view of the data to analysts.

The Sparkline Data approach

Sparkline Data eliminates all these bottlenecks by leveraging an OLAP index using Druid to store a flattened dataset( pre-joined if needed between Facts and Dimensions) at the lowest granularity needed by analysts. And our customers can store this data for 12+ months since its compressed. This dataset can then be queried through Sparkline and exposed to standard tools like Tableau or other SQL front ends.

When dealing with very large datasets as in Ad-Tech, IOT, Payments, Retail/Ecommerce, Banking, Telecom/mobile data etc, analysts need ad-hoc analysis at think speed. Its not just the response time for a few users that matter, but the system as a whole, when there are thousands of users pounding on a multi-terabyte dataset.

Sparkline Data is Spark native and can work with data in S3 or HDFS.

Learn more and get a personalized demo on your dataset.

Watch our videos on how you can get to Fast B.I with minimal spend on terabyte scale data.


Comments are closed.