class: center, middle, inverse, title-slide # POL90: Statistics ## Matching ### Prof. Wasow, PoliticsPomona College ### 2022-04-21 --- <style type="text/css"> .regression10 table { font-size: 10px; } .regression12 table { font-size: 12px; } .regression14 table { font-size: 14px; } </style> # Announcements .large[ - Assignments - PS09 - Report 3 teams assigned - Reading: - Skim: Elizabeth Stuart & Donald Rubin (2007), "Matching methods for causal inference: Designing observational studies" ] --- # Schedule <table> <thead> <tr> <th style="text-align:right;"> Week </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Title </th> <th style="text-align:right;"> Chapter </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> Mar 30 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Interaction terms </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:left;"> Apr 4 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Logistic regression </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:left;"> Apr 6 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Logistic regression </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> Apr 11 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Missing Data </td> <td style="text-align:right;"> Handout </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> Apr 13 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Panel Data </td> <td style="text-align:right;"> Handout </td> </tr> <tr> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 14 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Apr 18 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Mon </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Matching </td> <td style="text-align:right;color: black !important;background-color: yellow !important;"> Handout </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:left;"> Apr 20 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Matching </td> <td style="text-align:right;"> Handout </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:left;"> Apr 25 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Causal inference: Panel data </td> <td style="text-align:right;"> Handout </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:left;"> Apr 27 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Causal inference: Natural Experiments </td> <td style="text-align:right;"> Dunning </td> </tr> <tr> <td style="text-align:right;"> 16 </td> <td style="text-align:left;"> May 2 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Causal inference: RDD </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;"> 16 </td> <td style="text-align:left;"> May 4 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> No class </td> <td style="text-align:right;"> NA </td> </tr> </tbody> </table> --- ## Assignment schedule <table> <thead> <tr> <th style="text-align:right;"> Week </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Assignment </th> <th style="text-align:right;"> Percent </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 13 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Apr 18 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Mon </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> PS09 </td> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:left;"> Apr 25 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> PS10 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:left;"> May 2 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Report3 </td> <td style="text-align:right;"> 10 </td> </tr> </tbody> </table> --- class: center, middle # Causal Inference --- ## Potential outcomes: Data <br><br><br> <img src="images/potential_outcomes_table1.png" width="647" style="display: block; margin: auto;" /> --- ## Potential outcomes: Treatment condition <br><br><br> <img src="images/potential_outcomes_table2.png" width="647" style="display: block; margin: auto;" /> --- ## Potential outcomes: Outcome under treatment <br><br><br> <img src="images/potential_outcomes_table3.png" width="647" style="display: block; margin: auto;" /> --- ## Potential outcomes: Outcome under control <br><br><br> <img src="images/potential_outcomes_table4.png" width="647" style="display: block; margin: auto;" /> --- ## Potential outcomes: Covariates included? <br><br><br> <img src="images/potential_outcomes_table5.png" width="647" style="display: block; margin: auto;" /> --- ## The fundamental problem of causal inference <br><br><br><br> <!-- - In particular, the causal effect for individual `\(i\)` is the comparison of individual `\(i\)`’s outcome if individual `\(i\)` receives the treatment (the potential outcome under treatment), `\(Y_i(1)\)`, and individual `\(i\)`’s outcome if individual `\(i\)` receives the control (the potential outcome under control), `\(Y_i(0)\)`. --> <!-- - For simplicity, we use the term “individual” to refer to the units that receive the treatment of interest, but the formulation would stay the same if the units were schools or communities. --> - The “fundamental problem of causal inference” (Holland, 1986) is that, for each individual, we can observe only one of these potential outcomes, because each unit (each individual at a particular point in time) will receive either treatment or control, not both. -- - The estimation of causal effects can thus be thought of as a missing data problem (Rubin, 1976a), where we are interested in predicting the unobserved potential outcomes. .footnote[Source: Elizabeth A. Stuart (2010). "Matching methods for causal inference: A review and a look forward," https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943670/] --- ## For any unit, cannot observe both potential outcomes <br><br><br> <img src="images/potential_outcomes_table.png" width="647" style="display: block; margin: auto;" /> --- class: center, middle # Causal Inference # with Experimental Data --- ## Revisiting the Creativity Study .vertical-center[ <img src="images/ss_display_1_6.png" width="100%" style="display: block; margin: auto;" /> ] .footnote[Source: *Statistical Sleuth*, Display 1.6] --- ## What does random assignment get us? .vertical-center[ <img src="images/ss_display_1_6_modified_average.png" width="100%" style="display: block; margin: auto;" /> ] .footnote[Source: *Statistical Sleuth*, Display 1.6] --- ## Why is random assignment important? <img src="images/randomization_selection_assignment.jpg" width="90%" style="display: block; margin: auto;" /> .pull-right[ .footnote[Source: *Statistical Sleuth*, Display 1.5] ] --- ## Why is random assignment important? <img src="images/randomization_selection_assignment_modified.jpg" width="90%" style="display: block; margin: auto;" /> .pull-right[ .footnote[Source: *Statistical Sleuth*, Display 1.5] ] --- class: center, middle # Causal Inference # with Observational Data --- ## Approximate experiments with observational data? <br><br> .large[ - Want to compare apples to apples, not apples to oranges - We want to approximate a twin study - How can we identify effect of our "treatment" and plausibly rule out confounding? - Often called omitted variable bias - This week and next week we'll cover a variety of approaches - All have assumptions that can be hard to meet <!-- - Aspire to causal claim: --> <!-- - change in `\(T_i\)` (treatment) causes `\(Y_i\)` (outcome) to change, even when holding `\(\stackrel{\rightarrow}{X}_i\)` (our control variables) constant --> ] --- class: center, middle # Birdkeeping study --- ## Birdkeeping and Lung Cancer -- A Retrospective Observational Study *Statistical Sleuth*, Chapter 20, Case 02 - A 1972-1981 health survey in The Hague, Netherlands, discovered an association between keeping pet birds and increased risk of lung cancer. - To investigate bird-keeping as a risk factor, researchers conducted a case-control study of patients in 1985 at four hospitals in The Hague (population 450,000). - They identified 49 cases of lung cancer among patients who were registered with a general practice, who were age 65 or younger, and who had resided in the city since 1965. - They also selected 98 controls from a population of residents having the same general age structure. (Data based on P.A. Holst, D. Kromhout, and R. Brand, "For Debate: Pet Birds as an Independent Risk Factor for Lung Cancer," *British Medical Journal* 297 (1988): 13-21.) --- ## Birdkeeping Data The data have been gathered on the following variables: - LC = Lung cancer status - FM = Sex(1 = F, 0 = M) - AG = Age, in years - SS = Socioeconomic status (1 = High, 0 = Low), determined by occupation of the household's principal wage earner - YR = Years of smoking prior to diagnosis or examination - CD = Average rate of smoking, in cigarettes per day - BK = Indicator of birdkeeping (caged birds in the home for more than 6 consecutive months from 5 to 14 years before diagnosis (cases) or examination (controls) --- ## Birdkeeping + lung cancer data (modified) ```r birds <- Sleuth3::case2002 %>% clean_names() head(birds, 10) ``` ``` lc fm ss bk ag yr cd 1 LungCancer Male Low Bird 37 19 12 2 LungCancer Male Low Bird 41 22 15 3 LungCancer Male High NoBird 43 19 15 4 LungCancer Male Low Bird 46 24 15 5 LungCancer Male Low Bird 49 31 20 6 LungCancer Male High NoBird 51 24 15 7 LungCancer Male High Bird 52 31 20 8 LungCancer Male Low NoBird 53 33 20 9 LungCancer Male Low Bird 56 33 10 10 LungCancer Male High NoBird 56 26 25 ``` --- ## Research question: birdkeeping `\(\rightarrow\)` lung cancer? <img src="week14_01_files/figure-html/unnamed-chunk-20-1.png" width="720" style="display: block; margin: auto;" /> --- ## What if age `\(\rightarrow\)` both birdkeeping + lung cancer? <img src="week14_01_files/figure-html/unnamed-chunk-21-1.png" width="720" style="display: block; margin: auto;" /> --- ## What if age is related to birdkeeping? ```r birds %>% ggplot() + aes(x = ag, color = bk) + geom_density() + theme_bw() ``` <img src="week14_01_files/figure-html/unnamed-chunk-23-1.png" width="70%" style="display: block; margin: auto;" /> --- ## What if age is related to birdkeeping? ```r birds %>% ggplot() + aes(x = ag) + geom_density() + facet_grid(~bk) + theme_bw() ``` <img src="week14_01_files/figure-html/unnamed-chunk-24-1.png" width="70%" style="display: block; margin: auto;" /> --- ## What would an experiment look like? <br> .large[ - We would randomly assign birdkeeping to subjects irrespective of age - On average, the age of both birdkeepers and non-birdkeepers would be the same - Knowing the age of a subject would not give us any additional information about whether they are in the "treated" or "control" group - Matching attempts to approximate this balance in covariates across conditions with observational data ] --- ## In this toy example a small difference in age ```r mean(birds$ag[birds$bk == "Bird"]) ``` ``` [1] 53.26 ``` ```r mean(birds$ag[birds$bk == "NoBird"]) ``` ``` [1] 59.91 ``` --- ## What if we match? ```r birds <- birds %>% mutate( lc_bin = ifelse(lc == "LungCancer", 1, 0), bk_bin = ifelse(bk == "Bird", 1, 0) ) # find "nearest neighbors" match_out <- MatchIt::matchit( formula = bk_bin ~ ag, data = birds, method = "nearest", caliper = 0.1 ) ``` --- ## What is a caliper? .vertical-center[ <img src="images/1200px-Vernier_caliper.svg.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Let's look at match output: It's a lot! ```r summary(match_out) ``` ``` Call: MatchIt::matchit(formula = bk_bin ~ ag, data = birds, method = "nearest", caliper = 0.1) Summary of Balance for All Data: Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean distance 0.529 0.347 0.836 1.716 0.22 ag 53.264 59.907 -0.819 1.695 0.22 eCDF Max distance 0.399 ag 0.399 Summary of Balance for Matched Data: Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean distance 0.412 0.412 0.002 1.007 0.004 ag 57.429 57.449 -0.003 1.010 0.004 eCDF Max Std. Pair Dist. distance 0.041 0.008 ag 0.041 0.013 Percent Balance Improvement: Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max distance 99.8 98.6 98.4 89.8 ag 99.7 98.1 98.4 89.8 Sample Sizes: Control Treated All 118 87 Matched 49 49 Unmatched 69 38 Discarded 0 0 ``` --- ## Let's look at *N* in match output ```r summary(match_out)$nn[c(2,4,5), ] ``` ``` Control Treated All 118 87 Matched 49 49 Unmatched 69 38 ``` --- ## Compare original data to matched data ```r summary(match_out)$sum.all[ , 1:2] ``` ``` Means Treated Means Control distance 0.5291 0.3472 ag 53.2644 59.9068 ``` ```r summary(match_out)$sum.matched[, 1:2] ``` ``` Means Treated Means Control distance 0.412 0.4116 ag 57.429 57.4490 ``` --- ## Let's look at match output for everything ```r summary(match_out) ``` ``` Call: MatchIt::matchit(formula = bk_bin ~ ag, data = birds, method = "nearest", caliper = 0.1) Summary of Balance for All Data: Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean distance 0.529 0.347 0.836 1.716 0.22 ag 53.264 59.907 -0.819 1.695 0.22 eCDF Max distance 0.399 ag 0.399 Summary of Balance for Matched Data: Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean distance 0.412 0.412 0.002 1.007 0.004 ag 57.429 57.449 -0.003 1.010 0.004 eCDF Max Std. Pair Dist. distance 0.041 0.008 ag 0.041 0.013 Percent Balance Improvement: Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max distance 99.8 98.6 98.4 89.8 ag 99.7 98.1 98.4 89.8 Sample Sizes: Control Treated All 118 87 Matched 49 49 Unmatched 69 38 Discarded 0 0 ``` --- ## Is age balanced in matched data? ```r birds_matched <- MatchIt::match.data(match_out) birds_matched %>% ggplot() + aes(x = ag) + geom_density() + facet_grid(~bk) + theme_bw() ``` <img src="week14_01_files/figure-html/unnamed-chunk-32-1.png" width="70%" style="display: block; margin: auto;" /> --- ## What about imbalance in other variables? ```r birds %>% arsenal::tableby(bk ~ fm + ss, test = FALSE, data = .) %>% summary() ``` | | Bird (N=87) | NoBird (N=118) | Total (N=205) | |:------------------------|:-----------:|:--------------:|:-------------:| |**fm** | | | | | Female | 33 (37.9%) | 21 (17.8%) | 54 (26.3%) | | Male | 54 (62.1%) | 97 (82.2%) | 151 (73.7%) | |**ss** | | | | | High | 16 (18.4%) | 45 (38.1%) | 61 (29.8%) | | Low | 71 (81.6%) | 73 (61.9%) | 144 (70.2%) | --- ## What if we match across more variables? ```r # find "nearest neighbors" match_out <- MatchIt::matchit( formula = bk_bin ~ ag + yr + fm + ss, data = birds, method = "nearest", caliper = 0.02 ) ``` --- ## Let's look at match output with `bal.tab` ```r # balance table cobalt::bal.tab(match_out) ``` ``` Call MatchIt::matchit(formula = bk_bin ~ ag + yr + fm + ss, data = birds, method = "nearest", caliper = 0.02) Balance Measures Type Diff.Adj distance Distance 0.003 ag Contin. 0.123 yr Contin. 0.311 fm_Male Binary 0.029 ss_Low Binary 0.086 Sample sizes Control Treated All 118 87 Matched 35 35 Unmatched 83 52 ``` --- ## Now compare multiple variables with `love.plot` ```r cobalt::love.plot(match_out) ``` <img src="week14_01_files/figure-html/unnamed-chunk-37-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `love.plot` with absolute differences ```r cobalt::love.plot(match_out, abs = TRUE) ``` <img src="week14_01_files/figure-html/unnamed-chunk-38-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Can model with matched data as we would normally ```r class(match_out) ``` ``` [1] "matchit" ``` ```r # extract matched data from matchit object birds_matched <- MatchIt::match.data(match_out) class(birds_matched) ``` ``` [1] "matchdata" "data.frame" ``` ```r glm_out <- glm( formula = lc_bin ~ bk + ag + yr + fm + ss, family = binomial, data = birds_matched) ``` --- ## Can model with matched data as we would normally ```r summary(glm_out) ``` ``` Call: glm(formula = lc_bin ~ bk + ag + yr + fm + ss, family = binomial, data = birds_matched) Deviance Residuals: Min 1Q Median 3Q Max -1.507 -0.709 -0.459 0.956 2.179 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.909771 2.721597 -0.33 0.7382 bkNoBird -1.904163 0.605824 -3.14 0.0017 ** ag -0.000649 0.049702 -0.01 0.9896 yr 0.048756 0.030243 1.61 0.1069 fmMale -0.600487 0.990174 -0.61 0.5442 ssLow 0.056698 0.661300 0.09 0.9317 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 90.008 on 69 degrees of freedom Residual deviance: 73.170 on 64 degrees of freedom AIC: 85.17 Number of Fisher Scoring iterations: 4 ``` --- ## Can plot model as we might normally ```r sjPlot::plot_model( model = glm_out, type = "eff", terms = c("bk")) ``` <img src="week14_01_files/figure-html/unnamed-chunk-41-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Can plot model as we might normally ```r sjPlot::plot_model( model = glm_out, type = "eff", terms = c("bk", "ss")) ``` <img src="week14_01_files/figure-html/unnamed-chunk-42-1.png" width="70%" style="display: block; margin: auto;" /> --- class: center, middle # What's Going On in Matching? # Exact Matching --- ## <img src="images/cem020BestCase.png" width="100%" style="display: block; margin: auto;" /> --- ## <img src="images/cem021BestCase.png" width="100%" style="display: block; margin: auto;" /> --- ## <img src="images/cem022BestCase.png" width="100%" style="display: block; margin: auto;" /> --- class: center, middle # What's Going On in Matching? # Coarsened Exact Matching --- ## <img src="images/cem000.png" width="100%" style="display: block; margin: auto;" /> .footnote[Source: Gary King, https://gking.harvard.edu/presentations] --- ## <img src="images/cem001.png" width="100%" style="display: block; margin: auto;" /> --- ## <img src="images/cem002.png" width="100%" style="display: block; margin: auto;" /> --- ## <img src="images/cem003.png" width="100%" style="display: block; margin: auto;" /> --- ## <img src="images/cem004.png" width="100%" style="display: block; margin: auto;" /> --- ## <img src="images/cem005.png" width="100%" style="display: block; margin: auto;" /> --- ## <img src="images/cem006.png" width="100%" style="display: block; margin: auto;" /> --- class: center, middle # Questions?