Apache Spark for enterprise datawarehousing

Most people, when they think of Apache Spark think machine learning and data science. Spark is so much more than that.

Enterprises today, struggle to make sense of the alphabet soup in Hadoop. Big Data was synonymous with Hadoop. However Hadoop is not one thing.

The biggest value of Hadoop for analytics and B.I was, and is HDFS which is a distributed file system. But with object stores and a decoupled compute/storage architecture even that is questionable.

Take Enterprise B.I workloads which are SQL heavy for example. It is well known that many enterprises that run Hadoop clusters don’t really have petabyte scale data. What is good for Facebook or Netflix may not be good for you in enterprise IT, when you are running a 10 node Hadoop cluster. The headaches of managing such clusters are simply not worth it.

With Spark SQL, companies that traditionally used Hive or Impala can now get a much faster and better ROI using Spark SQL against an object store like S3 for most use cases. Using S3 as a datalake on AWS is a much simpler option when coupled with Spark. Apache Spark with Spark SQL can be deployed with a Hadoop cluster or a YARN resource manager and support fast queries on billions of rows of data.

One knock on Spark and Hive has been interactive query performance and true B.I workloads to replace traditional OLAP use cases, especially going against object stores.

Sparkline SNAP addresses precisely this gap. SNAP is built on Apache Spark and use Spark SQL but provides very fast responses through its powerful in-memory indexes, advanced optimization and elastic caching mechanisms all purpose built for enterprise datawarehousing needs. The operational cost can be as low as 1/10th of operating a Hadoop cluster for analytics workloads and a far simplified management and operational setup leading to huge savings.

SNAP works with Tableau, Looker and Spotfire as well Jupyter and Zeppelin notebooks and allows you to slice and dice your data at think speed.

Schedule a demo of Spark SQL on S3 data if you are interested to learn more.


Comments are closed.