B.I and A.I with context – a key to deeper insights.
A Business Intelligence and/or Data Science initiative is driving at similar outcomes: to gain insights/extract useful patterns about a particular business activity. The activity could be anything from analyzing sales, to web traffic, to Ad. campaigns and on and on. The insights sought are also very varied from trends in sales, performance of campaigns based on clicks, problematic customers based on returns etc.
But across these vastly different activities, it is likely that the Analysis involves:
- A Relational DB system. And hence at least part of the analysis is done using SQL Queries.
- Analysts have a semantic model in mind about their data, even if it is informal; for example it is quite common to think in terms of Dimensions and Metrics. Metrics capture details about an activity likes sales_amount, sales_quantity about a sales event; and Dimensions capture the context of the event like the product sold, store in which sale happened, customer who bought etc.
Further, even though the Analysis could be very varied, it is possible to talk about common patterns such as:
- Looking for anomalies by contrasting it with a Benchmark : for example find me problematic customers for a store, these are the ones whose total_returns exceed the average_store_return benchmark by 20%.
- Understanding patterns about an Interesting Entity Set, where Set membership is based on Activity : for example look for buying and demographic patterns in Customers who bought via all channels(catalog, web and store). The Customers and/or activity may be restricted to sales in this year, or for a particular region or about some household demographic etc.
- Contrast behavior between different Contexts: for example find the Items whose Sales dropped more than 10% compared to last year. This could be a first step in a broader analysis to understand the drivers of the drop in sales. Again the Context maybe further focussed on Items in a particular Category, or a time period etc.
It is equally likely for an Analysis activity to be focussed on a subpopulation or sub-context as it is to be about the entire dataset. So we could be looking for problematic customers in a Geographic Region, or we could be trying to understand Customer behavior for a certain age group or household demographic; as opposed to a Report on all problematic customers, or a job to build buying behavior on all Customers.
Doing this in SQL is hard
SQL isn’t the easiest language to express detailed analysis questions: questions that involve multiple Contexts, that require comparing behavior across Context, or that can involve many layers of sub-calculations before you get to the final result.
For example consider the first pattern about finding problematic customers; here is a query that does that(this is from the TPCDS Benchmark). We will go over this query in the next section, the point here is it is really hard to see the question being asked from the SQL query:
WITH customer_total_return AS ( SELECT sr_customer_sk AS ctr_customer_sk, sr_store_sk AS ctr_store_sk, sum(sr_return_amt) AS ctr_total_return FROM store_returns, date_dim WHERE sr_returned_date_sk = d_date_sk AND d_date > to_date("2000-01-01") GROUP BY sr_customer_sk, sr_store_sk) SELECT c_customer_id FROM customer_total_return ctr1, store, customer WHERE ctr1.ctr_total_return > (SELECT avg(ctr_total_return) * 1.2 FROM customer_total_return ctr2 WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk) AND s_store_sk = ctr1.ctr_store_sk AND s_state = 'TN' AND ctr1.ctr_customer_sk = c_customer_sk ORDER BY c_customer_id LIMIT 100
It takes a minute to realize the Analysis is restricted to sales from 2001 and to the State of Tennessee.Because the medium of expression in SQL is tables and relational operations, the intent of the analytical question can get buried when you just look at the SQL query.
Very often the semantics of the domain are ignored/missing in the SQL layer, things like: tables are laid out in a Star Schema which implies a loss-less decomposition of fact information, the functional dependency between columns, the statistics of columns, the hierarchies typically navigated when analyzing data etc.
Worst an Analyst often outsources the SQL writing to the SQL Guy or to a tool that generates SQL. By the time these guys are done, it is often only about the SQL: it is unfair to expect them to understand the domain or the intent of the analysis.
In this problematic customers query a key piece of information is that we are focusing on the State of Tennessee. But how does it matter if we convey the context or not? The SQL Engine will faithfully and correctly execute the query.
Knowing the context, having an understanding of the data model can help the SQL Optimizer generate smarter plans: ones that do just the right amount of work.
restricting the work done by the SQL Engine to just the facts from the State of Tennessee is going to make a huge difference in terms of Query performance and on the work done by the system. No knock on Tennessee, but it is reasonable to assume at least a 100 times reduction in the facts that need to be processed.
This observation led us to investigate two paths:
- Is there way to be more explicit about the Analysis Context in the question being asked. MDX has been tried, but getting folks to move from SQL has proven to be a fool’s errand. But there is a way to make substantial progress through the use of Analytical Views. In a subsequent post we will talk about Hierarchical Views that will make comparison of different Contexts(like Country Sales vs State Sales) very easy and direct to express and Cube Views that will expose Calculated Measures. These are columns defined using a formula language that combines SQL row level functions with navigation functions like children, parent, siblings etc(allowing the expression of calculations that combine say State sales with their parent Country sales)
- On the other hand can we infer the Analytical Context from a SQL Query? This led to the development of a patented technique ‘Dimensional Context Propagation for Analytical SQL Queries’ which is the focus of this Blog.
A Sales Analysis Scenario
Let’s walk through a representative Scenario and a couple of sample analysis. Consider an Enterprise has sales data from 3 channels: store, web and catalog. The sales data is embellished with a lot of rich context: customer information(demographics, household demographics), product information(classification, manufacturer, color, size etc), time and date information(weekend, holiday, first day of week/quarter/month etc), store information(geography, management, floor space, hours etc). Here is a high level logical model of the tables involved:
Store, Catalog and Web Sales tables capture the actual sales activity(facts) and are linked to contextual information about the Item, Date, Time, Customer, Customer Address and Customer Demographics. In addition each Sale type captures information about the Store, Web page or Catalog where the sale happened. A detailed description of this scenario can be found in the TPCDS Benchmark.
What the pure relational model doesn’t capture is the semantics of the data, which our Apache Spark native SNAP BI Platform (you can read about SNAP in an earlier blog post) fills in: it allows the domain expert to define Star Schemas(implying the loss-less decomposition of facts into Fact tables and Dimension Contexts, business hierarchies, function dependencies and much more). Here is an example of the store_sales Star Schema definition:
create star schema on store_sales as many_to_one join of store_sales with time_dim on ss_sold_time_sk = t_time_sk many_to_one join of store_sales with store on ss_store_sk = s_store_sk many_to_one join of store_sales with date_dim on ss_sold_date_sk = d_date_sk many_to_one join of store_sales with item on ss_item_sk = i_item_sk many_to_one join of store_sales with promotion on ss_promo_sk = p_promo_sk many_to_one join of store_sales with customer_demographics on ss_cdemo_sk = cd_demo_sk many_to_one join of store_sales with customer on ss_customer_sk = c_customer_sk many_to_one join of store_sales with customer_address on ss_addr_sk = ca_address_sk many_to_one join of store_sales with household_demographics on ss_hdemo_sk = hd_demo_sk many_to_one join of store with date_dim on s_closed_date_sk = d_date_sk many_to_one join of promotion with date_dim on p_start_date_sk = d_date_sk many_to_one join of promotion with date_dim on p_end_date_sk = d_date_sk many_to_one join of promotion with item on p_item_sk = i_item_sk many_to_one join of customer with date_dim on c_customer_sk = d_date_sk many_to_one join of customer with customer_demographics on c_current_cdemo_sk = cd_demo_sk many_to_one join of customer with customer_address on c_current_addr_sk = ca_address_sk many_to_one join of customer with household_demographics on c_current_hdemo_sk = hd_demo_sk many_to_one join of household_demographics with income_band on hd_income_band_sk = ib_income_band_sk table date_dim d_date_id determines d_date_sk d_date determines d_date_sk d_day_name determines d_weekend level hierarchy d_week_seq, d_month_seq, d_quarter_seq table time_dim t_time_id determines t_time_sk t_time determines t_time_sk t_hour determines t_am_pm table store s_division_id determines s_division_name s_company_id determines s_company_name table promotion p_promo_id determines p_promo_sk table customer c_customer_id determines c_customer_sk table customer_address ca_address_id determines ca_address_sk
Now let’s walk through two typical analysis questions: we look at the SQL and then show how understanding and applying the dimensional context to their Query Plans has a huge impact on Query Performance.
TPCDS Query 1: Identify Problem Customers
We have already posted the Query above, the figure to the right shows what a typical optimized SQL Query Plan looks like. The left Aggregation sub-plan involves computing return amounts at the Customer and Store Grain, the right Aggregation computes return amounts at the Store Grain, these are then joined to extract information about Customer, Store combinations whose return amounts exceed the corresponding Store Average by 20%. After this the rows are filtered on the Store’s State being Tennessee.
This is an expense Query to process because of the two scans on store_returns. But as is clear from the model semantics that the Context of the Analysis is the State of Tennessee and the Year 2000.
Dimensional Context Propagation(DCP) is a Query Optimization Technique that analyzes Query Plan Graph from the perspective of each FactSource in the Plan. A FactSource is any relation that is a source of facts (the sales tables in the TPCDS model): this could be the original fact table or some other materialization of the facts.
DCP traverses the Query Plan Graph from each FactSource, using data model semantics such as: the join graph of Fact and Dimension Context tables is a lossless-join decomposition, functional dependencies among columns and the physical materialization of facts, to actively push the Dimensional Context onto the operator scanning Facts. This leads to the significant reduction in number of facts processed and can also reduce the cost of processing fact rows.
The two most common materialization techniques are pre-aggregated views and Cube representations such as OLAP Indexes. Our SNAP Platformenables fact data to be maintained as OLAP Indexes. OLAP Indexes are pre-joined materializations of facts(pre-joined with contextual attributes). In a sense Index is a misnomer as this structure(as is typically of an ‘OLAP Index’) contains the data in columnar form(just like ORC or Parquet File Formats) along with inverted indexes on dimension values. For each dimension value a position bitmap is maintained, so applying a predicate like ‘state = “Tennessee” and date =”2000″‘ is a very fast, computationally cheap bitmap AND operation, along with a scan based on the positions remaining after the bitmap AND.
Besides DCP, SNAP has many optimizations and physical operations to make Aggregate-Join plans involving Facts and Dimensions very fast and resource utilization efficient. Some of the features relevant to this discussion are:
- The ability to rewrite ‘Fact — Dimension’ table joins to use the SNAP Index and hence eliminate potentially expensive Star Joins. The added benefit of join elimination being the pushing of often highly selective dimension filters (applied as fast bitmap operations) to the SNAP Index Scan.
- The ability to do partial aggregation at the block level(think of a block as a small region in a vast multi-dimensional space). Performing block level partial aggregation can lead to greatly reduced number of output rows coming out of a FactSource Scan.
- SNAP platform has a Index SemiJoin operation, which is similar to a traditional Left-Semi Join operation used to optimize joins in distributed SQL engines, with the additional step of pushing the joining predicate as an `in-list` to the SNAP Index so the filtering of facts is a very fast Bitmap operation followed by a skip-scan based on the position bits set in the final Bitmap.
The DCP technique accounts for the SNAP Index capabilities and so the final Plan we get is shown on the right. This involves a SNAP Index scan with :
- predicates on state and date pushed to the SNAP Index.
- the Date Dimension join is eliminated.
- The SNAP Index can do partial aggregation by block.
So eliminating the Date join, the pushing of the highly selective predicate on the state column, the fast bitmap based predicate evaluation, position bitmap based scan and ability to perform block level partial aggregates all combine to provide an execution that is orders of magnitude faster and cheaper than the original plan
SNAP Indexing in itself provides robust performance gains on typical Aggregate-Join query plans; combining this with the DCP technique of inferring pushable context enhances the advantage of the SNAP Index manyfold.
Now if the predicate in the Query was on the store_manager column and the semantics of your model dictated that store_manager should be modeled as a type 1 change. In this case we cannot push the store_manager predicate to the SNAP Index scan as pre-joining of ‘update in place’ fields is not an option, since we want to analyze based on the current store_manager value. In such cases the Index Semi Join can provide almost the same benefit. Based on the store_manager predicate we can very cheaply compute an in-list of some dimension column in the index, typically the key of the dimension(store_key in this case). This in-list can be added as predicate of the SNAP Index scan, potentially giving the same benefits as the predicate on state. We use the column level statistics to estimate the selectivity of pushable predicates and the potential cost reduction on the scan of Facts. Only if the cost reduction is above a certain threshold will the Index SemiJoin kick-in.
TPCDS Query 10: Understand Customers who shop using all channels
Query 10 is about customers who have shopped using all Channels(store, web and catalog). It focuses on Quarter 1 of 2002 and on certain counties.
SELECT cd_gender, cd_marital_status, cd_education_status, count(*) cnt1, cd_purchase_estimate, count(*) cnt2, cd_credit_rating, count(*) cnt3, cd_dep_count, count(*) cnt4, cd_dep_employed_count, count(*) cnt5, cd_dep_college_count, count(*) cnt6 FROM customer c, customer_address ca, customer_demographics WHERE c.c_current_addr_sk = ca.ca_address_sk AND ca_county IN ('Rush County', 'Toole County', 'Jefferson County', 'Dona Ana County', 'La Porte County') AND cd_demo_sk = c.c_current_cdemo_sk AND exists(SELECT * FROM store_sales, date_dim WHERE c.c_customer_sk = ss_customer_sk AND ss_sold_date_sk = d_date_sk AND d_year = 2002 AND d_moy BETWEEN 1 AND 1 + 3) AND (exists(SELECT * FROM web_sales, date_dim WHERE c.c_customer_sk = ws_bill_customer_sk AND ws_sold_date_sk = d_date_sk AND d_year = 2002 AND d_moy BETWEEN 1 AND 1 + 3) OR exists(SELECT * FROM catalog_sales, date_dim WHERE c.c_customer_sk = cs_ship_customer_sk AND cs_sold_date_sk = d_date_sk AND d_year = 2002 AND d_moy BETWEEN 1 AND 1 + 3)) GROUP BY cd_gender, cd_marital_status, cd_education_status, cd_purchase_estimate, cd_credit_rating, cd_dep_count, cd_dep_employed_count, cd_dep_college_count ORDER BY cd_gender, cd_marital_status, cd_education_status, cd_purchase_estimate, cd_credit_rating, cd_dep_count, cd_dep_employed_count, cd_dep_college_count LIMIT 100
The context of the analysis is hard to see from the SQL, but gets a little clearer when you see a typical optimized Query Plan:
The driver of this Analysis is the Customer table scan on the left and the joins with the three Sales tables ensure we only consider Customer rows who have shopped using each channel.
Assuming that the semantics of the data model are such that the fact tables are linked to customer_address rows via the customer table, we can infer the Context of the Sales scans to be restricted to the counties predicate on the address table. It is important to understand the implication of this assumption though: the model is saying that the sales should be tied to the customer’s address at the time of the Query; another way to model this is to record the customer address at the time of the sale.
Applying DCP along with having SNAP Indexes on each of the fact tables the plan becomes:
Again since the DCP rewrite can push the highly selective predicates on Date and County on to the SNAP Index the Query execution improvement(and processing cost reduction) is massive. Note, even if we have a semantic model where customer addresses are tied to the time of sale, we can still get significant improvement in the SNAP platform by the inferring the Date context of Quarter 1 of 2002.
Even if informally, data analysis activity imply a semantic model on the data: using Dimensions and Metrics is a very common way to anchor your analysis. No matter the Analysis domain, or the Analysis activity there are many common patterns that show up: like looking for anomalies, insights about entities that have some common behavioral patterns etc. An Analysis activity is often focused on a particular sub population/sub-context, and not on the entire dataset. Inferring the Dimensional Context of the Analysis from SQL queries and applying it in Query execution can have a huge payoff.