Causal Analytics

otterk10 · on Aug 15, 2019

Scott here from ClearBrain - the ML engineer who built the underlying model behind our causal analytics platform.

We’re really excited to release this feature after months of R&D. Many of our customers want to understand the causal impact of their products, but are unable to iterate quickly enough running A/B tests. Rather than taking the easy path and serving correlation based insights, we took the harder approach of automating causal inference through what's known as an observational study, which can simulate A/B experiments on historical data and eliminate spurious effects. This involved a mix of linear regression, PCA, and large-scale custom Spark infra. Happy to share more about what we did behind the scenes!

6gvONxR4sf7o · on Aug 16, 2019

>observational study, which can simulate A/B experiments

This is 100% overselling. Observational studies can be suggestive, but cannot replace experiments. Unobserved variables cannot be accounted for.

otterk10 · on Aug 16, 2019

Thanks for the feedback! I totally agree about observational studies being suggestive but not replacing A/B tests - that’s why the main use case I listed in the blog (and how current customers have used the product so far) is “prioritization of a/b tests”, not replacing a/b tests themselves. The language around “simulating a/b tests” is just a way to try to concisely explain to someone at a high level who may not be very technical or have much experience with causal inference. Happy for suggestions on how to better convey this without over-selling!

bertil · on Aug 16, 2019

I’ve noticed two questions on twitter:

- Do you use a causal graph? Would it make sense?

- Spark seems over-kill for what you yourself describe as regression: is there something more intensive here that we could be missing?

otterk10 · on Aug 16, 2019

Our analysis runs over our user’s customer data (usually collected through either a tag manager or a CDP such as Segment), which is a few petabytes of data for some of our larger customers. The reason for using Spark is to quickly transform this massive amount of raw data into a ML-ready format. You’re correct that the regression itself does not need to be done inside of Spark.

otterk10 · on Aug 16, 2019

We didn’t explore causal graphs because doing so would require manually creating a causal graph for each relationship that you wish to explore. Our goal was to create an automated approach that could provide an estimate of the treatment effect for any page/event within your app.

lootsauce · on Aug 15, 2019

Would love to hear more about the architecture and ml behind your approach. We have been doing more ml in BigQuery and it has been a great fit for us.

otterk10 · on Aug 15, 2019

Good to hear! In my experience, BigQuery ML (and other cloud ml products) is great for creating basic models out of the box, but don't provide a ton of flexibility for non-standard ML use-cases. For example, our approach to causal analytics requires doing things such as dimensionality reduction and computing a covariance matrix that are not available through BigQuery ML.

So what we've done instead is create a SparkML task that can read in a feature matrix and trains and scores the causal analytics model. The causal lift estimates for each user are then written out to BigQuery so that in our frontend a customer can filter for, say, users between the ages of 18-35, and then within seconds we'll return them the causal lift of viewing page X for this segment.

cuchoi · on Aug 16, 2019

Very exciting to see causal theory being productionized!

From the article, this seems like a normal regression to me. Would be interesting to know what makes it causal (or at least better) compared to an OLS. PCA has been used for a long time to select the features to use in regression. Would it be accurate to say that the innovation is on how the regression is calculated rather than the statistical methodology?

Either way, it would interesting to test this approach against an A/B test and check how much an observational study differs from the A/B estimates, and how sensitive is this approach to including (or not) a set of features. Also would be interesting to compare it to other quasi-experimental methodologies, such as propensity score matching.

Is there a more extended document explaining the approach?

Good luck!

otterk10 · on Aug 16, 2019

Yes, you're correct that the underlying algorithm used is very close to OLS. What allows the regression to provide an estimate for average treatment effects is how it is structured. Namely, adding in pre-treatment confounders as well as interactions between the treatment and confounders. I found this chapter (http://www.stat.columbia.edu/~gelman/arm/chap9.pdf) on causal inference does a good job of outlining the approach.

Yes, we actually explored other approaches such as PSM. The main reason we did not initially go with PSM was because of the compute power required - you would need to train a model for each treatment variable. However, we're actually in the midst of developing a way to train a model for each treatment variable efficiently, which will allow us to add items such as inverse propensity weighting (or explore other approaches such as PSM).

lern_too_spel · on Aug 16, 2019

This approach only works if all confounders are known, which is never the case in practice, so the model you fit is correlational and not suitable for causal inference. Propensity matching suffers from the same issue if the propensities are estimated from the same features. If not all confounders are known, you must be able to find instrumental variables to build a causal model.

cuchoi · on Aug 16, 2019

Thanks for answering!

6gvONxR4sf7o · on Aug 15, 2019

I only skimmed it, so forgive me if I got this wrong. The causal model used here makes some incredibly strong (unlikely to be close enough to accurate) assumptions. Are these results valid if there are unobserved confounders or selection bias?

benmaraschino · on Aug 16, 2019

Well, at the end of the day, you can never really be sure of strongly ignorable treatment assignment/unconfoundedness, no matter what problem you’re working on. Especially if you’re an economist or an epidemiologist working with data that’ve been collected by someone else—you can’t exactly easily go back and measure more predictors of treatment assignment. But if you’re running a website, there are a lot of variables you can measure on the user end and more opportunities to iterate, and so SITA then begins to look more and more like a better bet.

6gvONxR4sf7o · on Aug 16, 2019

Maybe you run a different kind of website or ask different kinds of questions, but despite being able to measure all kinds of things, there's so much at my job that you need experimentation for. You do the observational study, and it points in this direction. Sometimes it's true and sometimes it's not. Selling this kind of observational analysis as 'you don't need A/B tests anymore' is totally disingenuous.

otterk10 · on Aug 16, 2019

Thanks for the feedback! I totally agree about observational studies being suggestive but not replacing A/B tests - that’s why the main use case I listed in the blog (and how current customers have used the product so far) is “prioritization of a/b tests”, not replacing a/b tests themselves. The language around “simulating a/b tests” is just a way to try to concisely explain to someone at a high level who may not be very technical or has never heard of an observational study. Happy for suggestions on how to better explain observational studies to less technical customers without over-selling! It’s something we’ve been iterating on ourselves.

lern_too_spel · on Aug 16, 2019

You're right. I would go so far as to say the assumptions that their model makes are never valid in practice on real world problems.

mrbonner · on Aug 16, 2019

I have been involving in causal inference analysis since 2015. We use a mixed model of decision tree and fixed effect regressions. I read your paper and could not find a reference of why, while one cannot do AB test to verify the relationship but can use observational analysis to do it. Could you share a reference please? Thank you for this insightful article!

otterk10 · on Aug 16, 2019

You can definitely do an AB test to verify the causal relationship - in fact, that's the preferred method! Our platform is for situations where you didn't run an A/B test - either because you can't run as many as you'd like or you forgot - in order to give you an estimate after the fact.

whoisnnamdi · on Aug 16, 2019

Cool stuff, thanks for sharing publicly.

Did you all consider using Double Selection [1] or Double Machine Learning [2]?

The reason I ask is that your approach is very reminiscent of a Lasso style regression where you first run lasso for feature selection then re-run a normal OLS with only those controls included (Post-Lasso). This is somewhat problematic because Lasso has a tendency to drop too many controls if they are too correlated with one another, introducing omitted variable bias. Compounding the issue, some of those variables may be correlated with the treatment variable, which increases the chance they will be dropped.

The solution proposed is to run two separates Lasso regressions, one with the original dependent variable and another with the treatment variable as the dependent variable, recovering two sets of potential controls, and then using the union of those sets as the final set of controls. This is explained in simple language at [3].

Now, you all are using PCA, not Lasso, so I don't know if these concerns apply or not. My sense is that you still may be omitting variables if the right variables are not included at the start, which is not a problem that any particular methodology can completely avoid. Would love to hear your thoughts.

Also, you don't show any examples or performance testing of your method. An example would be demonstrating in a situation where you "know" (via A/B test perhaps) what the "true" causal effect is that your method is able to recover a similar point estimate. As presented, how do we / you know that this is generating reasonable results?

[1] http://home.uchicago.edu/ourminsky/Variable_Selection.pdf [2] https://arxiv.org/abs/1608.00060 [3] https://medium.com/teconomics-blog/using-ml-to-resolve-exper...

otterk10 · on Aug 16, 2019

Thanks! Yes, the concerns you mentioned would also apply to PCA. What we've actually done to help alleviate this is a union of components from y-aware[1] and normal PCA to capture variables that are correlated to both the dependent variables and (hopefully) most of the treatment variables. This seems similar to the double selection approach you mention - the difference being that since we are trying to run this at scale for 1000s of treatment variables, running a feature selection with each of the 1000 treatment variables as the dependent variable isn't super feasible, so the normal PCA acts as proxy for this part of the double selection.

Regardless, we're never going to completely remove omitted variable bias, as we're never going to capture 100% of relevant variables. One way we monitor our model's bias is by looking at the error distribution between users in the treatment vs control. If these aren't similar, there's too much bias in our estimate of the treatment effect, so we wouldn't want to serve an estimate of the treatment effect for this variable to our customers.

The current product is in beta and we're working with some of our current customers to try to re-create our results with A/B tests. I'm hoping that by our GA release in the fall we'll have some case studies with specific examples!

[1] http://www.win-vector.com/blog/2016/05/pcr_part2_yaware/

kk58 · on Aug 16, 2019

Did you guys look into Partial mutual information for confounding variable selection

Granger causality for estimating Granger cause

whirlofpearl · on Aug 15, 2019

Looks like you lifted this straight of Judea Pearl's seminal research.

Congratulations! Just remember to patent it :)

otterk10 · on Aug 15, 2019

Thanks! You're correct in surmising that our approach was heavily influenced by Judea Pearl's research.

And yes, the timing of the blog post isn't a coincidence, we actually filed a patent last week :)

bertil · on Aug 16, 2019

If you are implementing a documented technique, what are your claims of originality for a patent?

I’m asking because we are building our own implementation of mSPRT. There are some variants but I didn’t expect enough to patent. We are confronted with internal debates and I’d rather have actual examples than ageing memory of my law class.

bmahmood · on Aug 16, 2019

Thanks for the interest! (Cofounder of Clearbrain here).

The patent covers a combination of statistical techniques and engineering systems we built. The tricky part of this is the infrastructure needed to select confounding variables and estimate treatment effects for thousands of variables at scale in seconds. That was what we filed a patent on.

stewart87 · on Aug 15, 2019

Just read Pearl's book! Can you link to the patent?

selimthegrim · on Aug 16, 2019

I'm reading his book right now as well. Looking forward to your work.

move-on-by · on Aug 16, 2019

An analytics platform without a privacy policy? :(

404: https://www.clearbrain.com/privacy

404: https://www.clearbrain.com/terms

bmahmood · on Aug 16, 2019

Apologies! Looks like the site was mid-update when you noted the 404s. It's back live now :)

Rainymood · on Aug 16, 2019

Interesting to note that ClearBrain is in YC.