class: inverse, center, middle
# Bayesian Causal Inference for Data Driven Businesses ## A Bayesian framework for decision making ### Ignacio Martinez ### [whybayes.martinez.fyi](https://whybayes.martinez.fyi) #### Bay Area Tech Economics Seminar Series #### 2023/04/20 --- class: center, middle, inverse # Disclaimer: ### I am speaking on my own behalf and not on behalf of Google or Alphabet. --- class: center, middle, inverse # Before we start ## A bit about me and the Econ Team @Google --- ## A bit about you <center> <iframe src="https://www.mentimeter.com/app/presentation/blzmj791umda1w3tynjr8kr5m77cyaje" width="65%" height="400px" data-external="1"></iframe> </center> --- ## My goals for today 1. You feel comfortable asking questions, and openly disagreeing with me. 2. I convince you that: + The lure of incredible certitude is a big problem + Decision-makers should "think in bets" + "Data-scientists" need to spend a lot more time asking questions before touching any data + "Data-scientists" should make sure that decision-makers understand that quality of evidence is a continuum, and there is no such thing as a free lunch. + The Bayesian framework is underutilized, and we should not be afraid of using informative priors + The quality of your presentation is as important as the quality of your analysis --- class: center, middle, inverse # Everyone wants to be data-driven ## What does that mean? --- ## Some questions that people asks us <!---- This is a macro for Mathjax -----> $$\newcommand\dollars{\$}$$ <!---- --------------------------- -----> .large[ 1. Does a new launch meaningfully move the needle? 2. Is the ROI of a given investment high enough to justify that investment? 3. How likely it is that the ROI of investing `\(\dollars X\)` in A is higher than the ROI of investing `\(\dollars X\)` in B? 4. Is the effect of the new shiny thing stable over time? 5. What is the effect of pitching X to a client? 6. What is the effect of adopting X for a client? 7. What kinds of customers are affected more/less after a new launch? ] --- ## Challenges .large[ 1. Do we have a valid counterfactual? + Spillovers + Heterogeneity + General equilibrium effects + Long term vs short term effects 2. Communicating our findings in a clear and actionable way. ] --- class: center, middle, inverse # The status quo --- # The lure of incredible certitude .bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt5[ A common perception among economists who act as consultants is that the public is either unwilling or unable to cope with uncertainty. Hence, they argue that pragmatism dictates provision of point predictions and estimates, even though they may not be credible. .tr[ — Charles F. Manski, <a href="https://www.nber.org/papers/w24905.pdf" target="_blank">The Lure of Incredible Certitude (2018)</a></span></div> ]] --- # The Null Hypothesis Ritual .bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt5[ 1. Set up a statistical null hypothesis of "no mean difference" or "zero correlation." Don't specify the predictions of your research hypothesis or of any alternative substantive hypotheses. 2. Use 5% as a convention for rejecting the null. If significant, accept your research hypothesis. 3. Always perform this procedure. .tr[ — Gigerenzer et al., <a href="https://www.researchgate.net/publication/241372934_The_Null_Ritual_What_You_Always_Wanted_to_Know_About_Significance_Testing_but_Were_Afraid_to_Ask" target="_blank">The Null Ritual (2004)</a></span></div> ]] -- .bg-washed-red.b--dark-red.ba.bw2.br3.shadow-5.ph4.mt5[ Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. .tr[ — Wasserstein & Lazar, <a href="https://www.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108?needAccess=true" target="_blank">ASA Statement on Statistical Significance and P-Values (2016)</a></span></div> ]] --- ### The problem with p-values .smaller[ * The p-value is not the probability that the impact is zero. * The p-value cannot answer most business relevant questions (the probability that the pilot moved the needle). * A p-value of 0.05 does not imply 95% probability that the pilot worked. * A correct interpretation of statistically significant p<0.05 does not imply 95% probability that the pilot moved the needle in a meaningful way. * A **correct** interpretation of a statistically significant finding: > When .red[the true effect is zero], there is a 5% chance that .blue[the impact estimate is statistically significant (p < 0.05).] * An **incorrect** interpretation: > When .blue[the impact estimate is statistically significant (p < 0.05)] there is a 5% chance that .red[the true effect is zero.] * A statistically significant finding can have the wrong magnitude, or even sign. ] ??? Stress how ridiculous sharp nulls are. The true effect is exactly zero! 1. [Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129-133.](https://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108#.Vt2XIOaE2MN) 2. [Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641-651.](https://journals.sagepub.com/doi/full/10.1177/1745691614551642) --- # In Praise of Confidence Intervals .bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt5[ Most empirical papers in economics focus on two aspects of their results: whether the estimates are statistically significantly different from zero and the interpretation of the point estimates. This focus obscures important information about the implications of the results for economically interesting hypotheses about values of the parameters other than zero, and in some cases, about the strength of the evidence against values of zero. This limitation can be overcome by reporting confidence intervals for papers' main estimates and discussing their economic interpretation. .tr[ — David Romer, <a href="https://www.aeaweb.org/articles?id=10.1257/pandp.20201059" target="_blank">In Praise of Confidence Intervals (2020)</a></span></div> ]] --- ### Confidence intervals are often misinterpreted too * **Correct** interpretation: > A particular procedure, when used repeatedly across a series of hyptothetical data sets, yields intervals that contain the true parameter value 95% of the cases. The key is that the CIs do not provide a statement about the parameter as it relates to the particular sample at hand. * **Incorrect** interpretations: >1. The probability that the true mean is greater than 0 is at least 95%. 2. The probability that the true mean equals 0 is smaller than 5%. 3. The “null hypothesis” that the true mean equals 0 is likely to be incorrect. 4. There is a 95% probability that the true mean lies between 0.1 and 0.4. 5. We can be 95% confident that the true mean lies between 0.1 and 0.4. 6. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4. ??? [Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. Psychonomic bulletin & review, 21(5), 1157-1164.](https://ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf) --- # Statistical Significance, p-Values, and the Reporting of Uncertainty .bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt5[ Although I do not endorse a ban on the reporting of p-values, I do agree that over the years, and in some disciplines more than other, p-values and statistical significance have been overemphasized. In many cases, the p-value or the measure of statistical significance is not the relevant output from an analysis of a dataset. Therefore, its prominence in the abstracts of many empirical papers is misplaced. It would be preferable if reporting standards emphasized confidence intervals (as Romer 2020 suggests) or standard errors, and, even better, Bayesian posterior intervals. .tr[ — Guido Imbens , <a href="https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.35.3.157" target="_blank">Statistical Significance, p-Values, and the Reporting of Uncertainty (2021)</a></span></div> ]] --- class: center, middle, inverse # What I believe we should be doing --- ## Thinking in bets .large[ * We should all think of our decisions as bets. * We should focus on improving our odds of success, rather than trying to predict the future with certainty. * Beware of hindsight bias, confirmation bias, and anchoring bias. ] .footnote[ \* [Duke, A. (2019). Thinking in bets: Making smarter decisions when you don't have all the facts.](https://www.google.com/books/edition/Thinking_in_Bets/CI-RDwAAQBAJ?hl=en&gbpv=1&dq=+%22thinking+in+bets%22++by+Annie+Duke+&pg=PA1&printsec=frontcover) ] ??? * Hindsight bias is the tendency to see past events as having been more predictable than they actually were. This bias can lead us to overestimate our ability to predict the future. * Confirmation bias is the tendency to seek out information that confirms our existing beliefs, and to ignore or discount information that contradicts our beliefs. This bias can lead us to make poor decisions, because we are not considering all of the available information. * Anchoring bias is the tendency to rely too heavily on the first piece of information we receive, and to use that information as a starting point for making decisions. This bias can lead us to make poor decisions, because we are not considering all of the available information. **Here are some examples of each bias:** * Hindsight bias: After a major sporting event, people often say that they "knew" the winning team would win. However, in reality, they were just as likely to have predicted the losing team would win. * Confirmation bias: A person who believes that vaccines cause autism may only seek out information that supports that belief, and may ignore or discount information that contradicts it. * Anchoring bias: A person who is asked to estimate the value of a car may be influenced by the first price they hear, even if that price is far from the actual value of the car. --- # Questions you need to know the answer to .large[ 1. What is the business decision that we are trying to inform? + What is the outcome that we are trying to improve? + How are we going to measure that outcome? 2. What are the relevant questions we should be asking to inform that decision? 3. What are the possible answers to those questions, and the actions that decision makers would take in each case? 4. What are the different alternatives we have available to answer those questions? + What are the pros and cons of each of those? + How much uncertainty are we willing to tolerate? 5. What is the best way of communicating what we have learned? ] --- # The Bayesian approach .large[ * Bayesian statistics is a method that allows you to use data to reallocate credibility across possibilities. ] -- .large[ * In general, before doing a study we know that there are only three possibilities: + The intervention will meaningfully improve the outcome we care about. + The intervention will have no meaningful impact on the outcome we care about. + The intervention will meaningfully hurt the outcome we care about. ] -- .large[ * This approach estimates the probability of each of this outcomes using the data we have available. ] -- .large[ * These probabilities can be used to inform decisions. ] --- class: center, middle, inverse # A simple example --- ## Set-up .left-column[ * Imagine you want to decide whether or not to kill a product * You would like to keep the product if, and only if, the effect of this product on the outcome of interest is at least .red[0.1]. * In order to inform this decision, you ran a well designed pilot. ] .right-column[ ```r library(dplyr) library(ggplot2) library(broom) library(rstan) seed <- 9782 set.seed(seed) N <- 200 fake_data <- tibble( x = rnorm(n = N, mean = 0, sd = 1), t = rbinom(n = N, size = 1, prob = 0.5), e = rnorm(n = N, mean = 0, sd = 0.4), seed = seed ) %>% * mutate(y = 0.6*x + 0.02*t + e) ``` ] --- ## Status-quo .pull-left[ ```r lm1 <- lm(data = fake_data, formula = y ~ x + t) %>% tidy(., conf.int=T, conf.level=0.95) %>% filter(term=="t") plot <- ggplot(data = lm1, aes(y=estimate, x= term)) + geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) + geom_hline(yintercept = 0.01, linetype = "dashed", color = "red") + geom_hline(yintercept = 0, linetype = "dotted", color = "blue") + scale_y_continuous(breaks = seq(-0.2, 0.2, by = 0.02)) + theme_bw(base_size = 18) + xlab("") + ylab("Treatment Effect") + theme( axis.text.x = element_blank(), axis.ticks.x = element_blank()) ``` ] .pull-right[ <img src="why_bayes_files/figure-html/unnamed-chunk-3-1.png" width="90%" height="500px" /> ] --- ### Bayesian approach <center>
</center> --- class: center, middle, inverse background-image: url("./bruno.webp") # We should talk about ~~Bruno~~ priors --- ## Priors .large[ * Priors can be very useful. * Flat priors are, generally, a really bad idea. * Be transparent about your priors. * Run prior sensitivity analysis. ] .footnote[ \* [Gelman's Prior Choice Recommendations](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations) ] --- class: center, middle, inverse # Conclusions --- ## Take-home messages .large[ * Data does not speak for itself. Our job is to speak on its behalf. * Know your audience, and use that when summarizing your findings. * If we don't answer a business question directly, decision-makers can be misguided by our analysis. * Point estimates, confidence intervals, and p-values are not always the best way of summarizing findings. * Do not be afraid of using informative priors. * It is rare that we can make data driven decisions without some degree of uncertainty. We have to think in bets. ] --- class: center, middle, inverse # Questions? ## [martinezig@google.com](mailto:MartinezIg@google.com) ### Slides: <a href="https://whybayes.martinez.fyi" target="_blank">https://whybayes.martinez.fyi</a> --- # References * [Chandler, J. J., Martinez, I., Finucane, M. M., Terziev, J. G., & Resch, A. M. (2020). Speaking on data’s behalf: What researchers say and how audiences choose. Evaluation Review, 44(4), 325-353.](https://drive.google.com/file/d/1KAOGG_KgiSpwylO8VvJT8lTUIDslPuzk/view?usp=share_link) * [Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641-651.](https://journals.sagepub.com/doi/full/10.1177/1745691614551642) * [Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual. The Sage handbook of quantitative methodology for the social sciences, 391-408.](http://www.onemol.org.uk/Gigerenzer-2004.pdf) * [Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. Psychonomic bulletin & review, 21(5), 1157-1164.](https://ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf) * [Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129-133.](https://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108#.Vt2XIOaE2MN) * [Imbens, G. W. (2021). Statistical significance, p-values, and the reporting of uncertainty. Journal of Economic Perspectives, 35(3), 157-74.](https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.35.3.157) --- class: inverse, center, middle # Appendix --- class: inverse, center, middle ## Asking One Question and Answering Another: ### When decisions and statistical analysis are not aligned. #### [Ignacio Martinez<sup>1</sup>](https://ignacio.martinez.fyi), #### Jesse Chandler<sup>2</sup>, #### Daniel Thal<sup>2</sup> [1] [The Chief Economist's Team at Google](https://ignacio.martinez.fyi/). [2] [Mathematica](https://mathematica.org/). --- name: the_experiment # The Experiment .panelset[ .panel[.panel-name[The Basics] * **Our Question:** What is the effect of presenting study results in different ways on choices and confidence? * **Our Data:** We asked U.S. adults to make a decision about whether or not to adopt a fictional education technology based on results from a study using synthetic data. * **Participants saw findings depicted in one of four ways:** + a traditional frequentist test for an effect different from zero, + a Bayesian analysis that calculates the probability that the effect is greater than zero, + a frequentist analysis that estimate whether an effect is of a minimum meaningful size, and + a Bayesian analysis that estimate whether an effect is of a minimum meaningful size. ] .panel[.panel-name[Participants] * Convenience sample recruited from Amazon Mechanical Turk. * Participants were eligible for inclusion if: + they were U.S. residents + who had successfully completed at least 95% of the tasks they had previously submitted through the platform. * Excluded 84 participants who failed 3 or more comprehension checks. * Median age 35.29 years (SD = 10.30). * 57% had at least a 4-year college degree. * n = 612 responses. ] ] --- # What Participants Saw .panelset[ .panel[.panel-name[Common Text] Imagine that you are the curriculum coordinator for a school district with about 20 schools in it. .whitesmoke[As a part of your job, you have to decide which programs should or should not be adopted.] .whitesmoke[One option that your district is considering is a somewhat expensive [math/english] after-school program.] The school board thinks that it would be worth the cost if the program increases student achievement on a standardized [math/english] test, on average, by at least 0.2 standard deviations .whitesmoke[(standard deviation is a measure of variation in test scores: roughly two thirds of students will score between +/- 1 standard deviations from the average test score).] A neighboring district recently published a report, in which they describe a test of the new program. The report described an experiment in which 100 students were selected. .whitesmoke[Half received the [math/english] program for one school year (the treatment group) and the other half did not (the control group). The district then compared end of year standardized test scores between the two groups.] .whitesmoke[The report states that on average, the students that used the new [math/english] program scored higher on the year-end tests than the students in the control group. However the report also states that because only 100 students were involved in the experiment, at least some of the estimated difference in test scores between the treatment and control groups could be due to random chance.] ] .panel[.panel-name[Bayesian MME = 0.2, text] .whitesmoke[To help readers make sense of the reported findings, the researchers reported the probability that the true program effect is large enough to be worth the cost. This probability is calculated drawing on two sources of information: (1) the difference in mean outcomes observed in the researchers’ study and (2) a literature review the researchers conducted regarding how common it is for educational interventions to have effects of varying sizes. Based on a cost-benefit analysis like the one performed by your district, the researchers concluded that an effect of 0.2 standard deviations is large enough to warrant adopting the program. The researchers report that prior evidence from their literature review found that: 42% of educational interventions will increase achievement by at least 0.20 standard deviations, 16% have no meaningful effect, and 42% reduce achievement by at least 0.20 standard deviations.] Combining the data from their experiment with the prior evidence, the researchers conclude that there is a: 28% probability that the after-school program increased achievement by at least 0.20 standard deviations, 72% probability that it had no meaningful effect, and 0% probability that it decreased achievement by at least 0.20 standard deviations. The following visualization shows you the prior evidence and probability that the program had a meaningful effect based on the prior evidence and the data from the researchers’ new study. .whitesmoke[By clicking on "posterior", you can switch between probabilities based only on the prior evidence and probabilities based on both the prior evidence and study data. By clicking on the distribution icon, you can switch between the discretized and continuous version of the possible outcomes.] ] .panel[.panel-name[Bayesian MME = 0.2, viz] <iframe src="https://ignacio.martinez.fyi/ROPE/bayesian_data1_mme2.html" width="70%" height="400px" data-external="1" style="border: none; display:block; margin:auto;"></iframe> ] .panel[.panel-name[Bayesian MME = 0, text] .whitesmoke[Data from the experiment were analyzed using Bayesian statistics, a method that allows researchers to reallocate credibility across possibilities. Before looking at the data, the researchers define two possible answers to their question. The after-school program either increases achievement, or it will decrease achievement. The researchers state that prior to conducting the experiment, their beliefs were as follows: 50% probability that the after-school program will increase achievement, and 50% probability that it will reduce achievement.] The data from the experiment suggest that they should update their beliefs to: 98% probability that the after-school program will increase achievement, and 2% probability that it will reduce achievement. The following visualization shows you the researchers' prior beliefs and how these beliefs should be updated based on the data. .whitesmoke[By clicking on "posterior", you can switch between their beliefs before and after looking at the data. By clicking on the distribution icon, you can switch between the discretized and continuous version of the possible outcomes.] ] .panel[.panel-name[Bayesian MME = 0, viz] <iframe src="https://ignacio.martinez.fyi/ROPE/bayesian_data1_mme0.html" width="70%" height="400px" data-external="1" style="border: none; display:block; margin:auto;"></iframe> ] .panel[.panel-name[Frequentist CI, text] .whitesmoke[To help readers make sense of the reported findings, the researchers reported their findings in two different ways. First, they reported the probability of obtaining an effect at least as large as the one observed in the sample data, assuming that there is in fact no true difference between groups. This probability is called a "p-value." Second, the researchers reported an interval around the reported effect size called a confidence interval. If a large number of studies were conducted, the confidence intervals from those studies would contain the true effect of the program a specified percentage of the time (for example 95%).] Regression analysis of the data from the experiment showed that the point estimate for the effect of the program is 0.159. This is statistically significant (p-value = 0.042) with a 95% confidence interval from 0.006 to 0.312. The following visualization shows the point estimate and its associated confidence interval: ] .panel[.panel-name[Frequentist CI, viz] <img src="https://ignacio.martinez.fyi/ROPE/round2/round_2_frequentist_point_data1.png" width="70%" height="350px" style="display: block; margin: auto;" /> ] .panel[.panel-name[Frequentist TOST, text] .whitesmoke[To help readers make sense of the reported findings, the researchers reported their findings in three different ways. First, they reported the probability of obtaining an effect at least as large as the one observed in the sample data, assuming that there is in fact no true difference between groups. This probability is called a "p-value." Second, they reported the probability (p-value) of obtaining an effect at least as small (close to zero) as the one observed in the sample data, assuming that there is in fact a meaningful difference between groups. In this case, the researcher defined a meaningful difference as .2 standard deviations. Third, the researchers reported an interval around the reported effect size called a confidence interval. If a large number of studies were conducted, the confidence intervals from those studies would contain the true effect of the program a specified percentage of the time (for example 95%).] Regression analysis of the data from the experiment showed that the point estimate for the effect of the program is 0.159. This is statistically different from zero (p-value = 0.042) with a 95% confidence interval from 0.006 to 0.312. It is not statistically smaller than 0.2 (p-value = 0.299). The following visualization shows the point estimate and its associated confidence interval: ] .panel[.panel-name[TOST, viz] <img src="https://ignacio.martinez.fyi/ROPE/round2/round_2_frequentist_tost_data1.png" width="70%" height="400px" style="display: block; margin: auto;" /> ] ] # --- # Impacts on Choices
Proportion of participants who recommended switching programs
Test against null
Test against MME
Frequentist
61%
a
57%
a,c
Bayesian
84%
b
42%
c
Note:
Proportions with different subscripts have at least a 95% probability of differing by at least 10 percentage points.
??? * People were more likely to recommend switching to the new program when presented with a Bayesian analysis with MME equal to zero (84%) than only a traditional frequentist nil-hypothesis test (61%). This replicates findings from <a href="https://journals.sagepub.com/doi/abs/10.1177/0193841X19834968?journalCode=erxb" target="_blank">Chandler et al 2019</a>. * People were less likely to recommend switching to the new program when presented with a Bayesian test that included the MME (42%) than when presented with a frequentist test that used TOST to compare the data to the MME (57%). --- # Impacts on Confidence
Confidence in the recommendation to switch programs or maintain the current program
Test against null
Test against MME
Frequentist
6.9 (2.0)
a
6.8(2.2)
b
Bayesian
8.0(1.7)
b
7.3 (1.9)
a,b
Note:
Standard deviations are in parentheses. Proportions with different subscripts have at least a 95% probability of differing by at least half a scale point (or 5%)
??? * People making a decision based on a Bayesian analysis with MME equal to zero were more confident in their decision (M = 8.0, SD = 1.7) than those making a decision based on a traditional frequentist test (M = 6.9, SD = 2.0). There is a 98% probability that there is a difference between these conditions of at least 5 percentage points (0.5 scale points) but only an 11% probability that the effect exceeds a full point. This replicates the findings from <a href="https://journals.sagepub.com/doi/abs/10.1177/0193841X19834968?journalCode=erxb" target="_blank">Chandler et al 2019</a>. * When comparing TOST to the Bayesian framing with an MME, we estimate a 70% probability that the TOST framing lowers confidence by at least 5 percentage points.