Going beyond Data Lakes

We often see customers start to build data lakes on Hadoop or S3 as way to get their transactional data with dimensional data in a common place. This data is cleaned and organized in a star schema like in an enterprise data warehouse.

The challenge begins here since consuming data in a Hadoop data lake is not easy.

The first challenge is ad hoc analytics. When terabytes of data are stored in the data lake and the fact tables runs into millions and billions of rows, think-time responses to queries are not possible unless every single workload is known before hand and summary tables or cubes are purpose built for a need.

Of course in the absence of any other option, engineers are asked to build a slew of tables each to satisfy a specific business case. Soon the management of this becomes a massive challenge while users constantly complain about performance.

The technical aspect of satisfying modern query workloads, SQL or otherwise, involves 2 layers.

  1. An OLAP index on large datasets since 99% of queries on a datawarehouse are not full table scans.
  2. A query serving layer that can dynamically understand the query patterns( Filters, aggregations, joins etc.) and respond by querying the right index for the task.

Sparkline Data’s SNAP platform is designed to scale consumption of data on a data lake whether the need is reporting, ad-hoc or machine learning /AI workloads.


Comments are closed.