class: center, middle, inverse, title-slide # POL90: Statistics ## Logistic regression & Data visualization ### Prof. Wasow, PoliticsPomona College ### 2022-04-05 --- <style type="text/css"> .regression10 table { font-size: 10px; } .regression12 table { font-size: 12px; } .regression14 table { font-size: 14px; } </style> # Announcements .large[ - Report 2 - Doesn't have to be feeling thermometer - Doesn't have to be 2020 - Should be "plausibly analogous to an experiment" - Do not use an "immutable characteristic" - *Statistical Sleuth* - This week: Skim Chapter 20 - http://appliedstats.org/chapter20.html ] --- # Schedule <table> <thead> <tr> <th style="text-align:right;"> Week </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Title </th> <th style="text-align:right;"> Chapter </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> Mar 21 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Null hypothesis, R-squared </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> Mar 23 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Multiple regression </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> Mar 28 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Interaction terms </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> Mar 30 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Interaction terms </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 12 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Apr 4 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Mon </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Logistic regression </td> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:left;"> Apr 6 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Logistic regression </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> Apr 11 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Missing data </td> <td style="text-align:right;"> Handout </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> Apr 13 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Missing data </td> <td style="text-align:right;"> Handout </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:left;"> Apr 18 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Matching </td> <td style="text-align:right;"> Handout </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:left;"> Apr 20 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Matching </td> <td style="text-align:right;"> Handout </td> </tr> </tbody> </table> --- ## Assignment schedule <table> <thead> <tr> <th style="text-align:right;"> Week </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Assignment </th> <th style="text-align:right;"> Percent </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> Mar 25 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS07 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> Apr 1 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS08 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 12 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Apr 8 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Fri </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Report2 </td> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> Apr 15 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS09 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:left;"> Apr 22 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS10 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:left;"> Apr 29 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> Report3 </td> <td style="text-align:right;"> 10 </td> </tr> </tbody> </table> --- class: center, middle # Interactions with Albuquerque # Real Estate Data --- ## Example using Albuquerque real estate data ```r # read in Albuquerque data alb_real_estate <- read.table(here("data", "alb.dat.txt"), sep = "\t", header = TRUE) head(alb_real_estate, 3) ``` ``` price sqft age feats ne cust cor tax 1 2050 2650 13 7 1 1 0 1639 2 2080 2600 NA 4 1 1 0 1088 3 2150 2664 6 5 1 1 0 1193 ``` ```r # Transform ne dummy and select three columns alb <- alb_real_estate %>% * mutate(ne_fct = ifelse(ne == 1, "yes", "no") %>% as.factor()) %>% rename(ne_bin = ne) %>% select(price, sqft, ne_bin, ne_fct) head(alb, 3) ``` ``` price sqft ne_bin ne_fct 1 2050 2650 1 yes 2 2080 2600 1 yes 3 2150 2664 1 yes ``` --- ## Visualize Albuquerque real estate data .left-code[ ```r ggplot(data = alb) + aes(x = sqft, y = price, color = ne_fct) + geom_point() ``` ] .right-plot[ <img src="week11_01_files/figure-html/alb_plot-1.png" width="432" style="display: block; margin: auto;" /> ] --- ## We could subset and run two separate models .left-code[ ```r ggplot(data = alb) + aes(x = sqft, y = price, color = ne_fct) + geom_point() + geom_smooth(method = "lm") ``` ] .right-plot[ <img src="week11_01_files/figure-html/alb_plot2-1.png" width="432" style="display: block; margin: auto;" /> ] --- ## Subset for `ne_fct == 'yes'` and model ```r alb_ne_yes <- alb %>% filter(ne_fct == 'yes') lm(price ~ sqft, data = alb_ne_yes) %>% summary() ``` ``` Call: lm(formula = price ~ sqft, data = alb_ne_yes) Residuals: Min 1Q Median 3Q Max -433.2 -89.4 -9.7 65.3 466.6 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.1691 46.8781 0.79 0.43 sqft 0.6180 0.0221 27.96 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 159 on 106 degrees of freedom Multiple R-squared: 0.881, Adjusted R-squared: 0.879 F-statistic: 782 on 1 and 106 DF, p-value: <2e-16 ``` --- ## Subset for `ne_fct == 'no'` and model ```r alb_ne_no <- alb %>% filter(ne_fct == 'no') lm(price ~ sqft, data = alb_ne_no) %>% summary() ``` ``` Call: lm(formula = price ~ sqft, data = alb_ne_no) Residuals: Min 1Q Median 3Q Max -290.9 -106.1 -1.0 66.3 923.6 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 293.104 63.011 4.65 1.6e-05 *** sqft 0.417 0.031 13.48 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 183 on 66 degrees of freedom Multiple R-squared: 0.734, Adjusted R-squared: 0.729 F-statistic: 182 on 1 and 66 DF, p-value: <2e-16 ``` --- ## Compare with interaction model ```r lm(price ~ sqft + ne_fct + sqft * ne_fct, data = alb) %>% summary() ``` ``` Call: lm(formula = price ~ sqft + ne_fct + sqft * ne_fct, data = alb) Residuals: Min 1Q Median 3Q Max -433.2 -94.3 -5.1 65.6 923.6 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 293.1036 58.0879 5.05 1.1e-06 *** sqft 0.4175 0.0286 14.62 < 2e-16 *** ne_fctyes -255.9345 76.4422 -3.35 0.001 *** sqft:ne_fctyes 0.2005 0.0369 5.43 1.9e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 168 on 172 degrees of freedom Multiple R-squared: 0.848, Adjusted R-squared: 0.846 F-statistic: 320 on 3 and 172 DF, p-value: <2e-16 ``` --- class: center, middle # Reporting Logistic Results --- ## Example: Survival in the Donner Party From *Statistical Sleuth*, Chapter 20 (p 602, 3e): >In 1846 the Donner and Reed families left Springfield, Illinois, for California by covered wagon. In July, the Donner Party, as it became known, reached Fort Bridger, Wyoming. There its leaders decided to attempt a new and untested route to the Sacramento Valley. Having reached its full size of 87 people and 20 wagons, the party was delayed by a difficult crossing of the Wasatch Range and again in the crossing of the desert west of the Great Salt Lake. The group became stranded in the eastern Sierra Nevada mountains when the region was hit by heavy snows in late October. By the time the last survivor was rescued on April 21, 1847, 40 of the 87 members had died from famine and exposure to extreme cold. .footnote[From *Statistical Sleuth*, Chapter 20 (p 602, 3e):] --- ## Load Donner data ```r # load data donner <- Sleuth3::case2001 %>% clean_names() # convert Survived/Died to ones/zeros donner <- donner %>% mutate( survived_bin = as.numeric(status == "Survived"), # male = reference category sex = sex %>% fct_relevel("Male") ) # view data head(donner) ``` ``` age sex status survived_bin 1 23 Male Died 0 2 40 Female Survived 1 3 40 Male Survived 1 4 30 Male Died 0 5 28 Male Died 0 6 40 Male Died 0 ``` --- ## Visualizing Donner data .left-code[ ```r ggplot(data = donner) + aes(x = age, y = survived_bin) + geom_point(size = 3, alpha = 0.5) ``` ] .right-plot[ <img src="week11_01_files/figure-html/donner_plot1-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Logistic regression in `R` ```r # run logistic regression with family = binomial glm_sex <- glm(survived_bin ~ sex, family = binomial, data = donner) glm_age <- glm(survived_bin ~ age, family = binomial, data = donner) glm_sa <- glm(survived_bin ~ sex + age, family = binomial, data = donner) glm_sa %>% broom::tidy() %>% kable(digits = 2) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.63 </td> <td style="text-align:right;"> 1.11 </td> <td style="text-align:right;"> 1.47 </td> <td style="text-align:right;"> 0.14 </td> </tr> <tr> <td style="text-align:left;"> sexFemale </td> <td style="text-align:right;"> 1.60 </td> <td style="text-align:right;"> 0.76 </td> <td style="text-align:right;"> 2.11 </td> <td style="text-align:right;"> 0.03 </td> </tr> <tr> <td style="text-align:left;"> age </td> <td style="text-align:right;"> -0.08 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> -2.10 </td> <td style="text-align:right;"> 0.04 </td> </tr> </tbody> </table> --- ## Comparing Donner models with `stargazer` .regression12[ ```r # create regression table stargazer(glm_sex, glm_age, glm_sa, type = 'html', header = FALSE, single.row = TRUE, omit.stat = c("aic", "ll")) ``` <table style="text-align:center"><tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3"><em>Dependent variable:</em></td></tr> <tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr> <tr><td style="text-align:left"></td><td colspan="3">survived_bin</td></tr> <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td></tr> <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">sexFemale</td><td>1.386<sup>**</sup> (0.671)</td><td></td><td>1.597<sup>**</sup> (0.755)</td></tr> <tr><td style="text-align:left">age</td><td></td><td>-0.066<sup>**</sup> (0.032)</td><td>-0.078<sup>**</sup> (0.037)</td></tr> <tr><td style="text-align:left">Constant</td><td>-0.693<sup>*</sup> (0.387)</td><td>1.819<sup>*</sup> (0.999)</td><td>1.633 (1.110)</td></tr> <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>45</td><td>45</td><td>45</td></tr> <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr> </table> ] --- ## Explaining results .large[ From *Statistical Sleuth*, (p 608, 3e) The fit of a logistic regression model to the Donner Party data, where `\(\pi\)` represents survival probability, gives $$ logit(\hat{\pi}) = 1.63 - 0.078 \times age + 1.60 \times female $$ ] --- ## What are odds & log(odds)? <br><br><br> $$ \textrm{odds} = \frac{\textrm{\# favorable outcomes}}{\textrm{\# unfavorable outcomes}} = \frac{\textrm{p(success)}}{\textrm{p(failure) }} = \frac{p}{q} = \frac{p}{1-p} $$ $$ \textrm{log(odds)} = \textrm{log}_e \left( \frac{p}{q} \right) = \textrm{log}_e \left( \frac{p}{1-p} \right) $$ --- ```r p <- c(0.001, 0.01, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.999, 0.9999) odds <- p/(1-p) log_odds <- log(odds) ``` .pull-left[ ```r data.frame(p, odds, log_odds) ``` ``` p odds log_odds 1 0.0010 1.001e-03 -6.9068 2 0.0100 1.010e-02 -4.5951 3 0.1500 1.765e-01 -1.7346 4 0.2000 2.500e-01 -1.3863 5 0.2500 3.333e-01 -1.0986 6 0.3000 4.286e-01 -0.8473 7 0.3500 5.385e-01 -0.6190 8 0.4000 6.667e-01 -0.4055 9 0.4500 8.182e-01 -0.2007 *10 0.5000 1.000e+00 0.0000 11 0.5500 1.222e+00 0.2007 12 0.6000 1.500e+00 0.4055 13 0.6500 1.857e+00 0.6190 14 0.7000 2.333e+00 0.8473 15 0.7500 3.000e+00 1.0986 *16 0.8000 4.000e+00 1.3863 17 0.8500 5.667e+00 1.7346 18 0.9000 9.000e+00 2.1972 19 0.9990 9.990e+02 6.9068 20 0.9999 9.999e+03 9.2102 ``` ] -- .pull-left[ ```r # convert p to odds & log odds by hand # if p = 0.5 .5/.5 log(.5/.5) # if p = 0.8 .8/.2 log(.8/.2) # if p = 0.75? # if p = 0.20? # if p = 0.25? # if p = 0.90? ``` ] --- ## Exercises .large[ - Compare women 50 years old with women 20 years old. Report the odds ratio for survival. - Compare a woman (`female = 1`) with a man (`female = 0`) of the same age. ] $$ logit(\hat{\pi}) = 1.63 - 0.078 \times age + 1.60 \times female $$ --- ## Exercises .large[ - Log odds of survival for women 50 years old ] ```r w50 <- 1.63 - 0.078 * 50 + 1.60 * 1 w50 ``` ``` [1] -0.67 ``` .large[ - Log odds of survival for women 20 years old ] ```r w20 <- 1.63 - 0.078 * 20 + 1.60 * 1 w20 ``` ``` [1] 1.67 ``` ```r exp(w50 - w20) ``` ``` [1] 0.09633 ``` --- ## Exercises .large[ - Can also be calculated as a simultaneous equation ] ```r exp(-0.078 * (50 - 20)) ``` ``` [1] 0.09633 ``` .large[ - Summary: Odds of survival for 50 vs 20 years old is estimated to be `0.096`, or about 1/10. - So the odds of a 20-year-old woman surviving were about 10 times the odds of a 50-year-old woman surviving. ] --- ## How about male vs female at age 30 .large[ - The estimated odds ratio of moving from male to female is `exp(1.60(1 - 0)) = 4.95` ] ```r female_log_odds <- 1.63 - 0.078 * 30 + 1.60 * 1 female_log_odds ``` ``` [1] 0.89 ``` ```r male_log_odds <- 1.63 - 0.078 * 30 + 1.60 * 0 male_log_odds ``` ``` [1] -0.71 ``` ```r exp(female_log_odds - male_log_odds) ``` ``` [1] 4.953 ``` .large[ - A woman's odds of survival were about 5 times the odds of a man of the same age. ] --- ## How to report logistic results? .vertical-center[ - **Continuous `\(X_1\)`, binary `\(Y\)` **: with logistic regression: A one unit change in `\(X_1\)` is associated with/causes the odds that `\(Y = 1\)` to change by a multiplicative factor of exp `\((\beta_1)\)`, holding all other variables constant. - **Binary `\(X_1\)`, binary `\(Y\)` **: with logistic regression: moving from group 1 to group 2 is associated with/causes the odds that `\(Y = 1\)` to change by a multiplicative factor of exp `\((\beta_1)\)`, holding all other variables constant. ] --- ## Comparing models - For ANOVA with logistic regression, need to specify test = "Chisq" ```r # for ANOVA with logistic regression, need to specify test = "Chisq" *anova(glm_sex, glm_sa, test = "Chisq") ``` ``` Analysis of Deviance Table Model 1: survived_bin ~ sex Model 2: survived_bin ~ sex + age Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 43 57.3 2 42 51.3 1 6.03 0.014 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## Creating tables with `stargazer` and `exp()` .regression12[ ```r # create regression table stargazer(glm_sex, glm_age, glm_sa, type = 'html', header = FALSE, single.row = TRUE, omit.stat = c("aic", "ll"), * apply.coef = function(x) {exp(x)}) ``` <table style="text-align:center"><tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3"><em>Dependent variable:</em></td></tr> <tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr> <tr><td style="text-align:left"></td><td colspan="3">survived_bin</td></tr> <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td></tr> <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">sexFemale</td><td>4.000<sup>***</sup> (0.671)</td><td></td><td>4.940<sup>***</sup> (0.755)</td></tr> <tr><td style="text-align:left">age</td><td></td><td>0.936<sup>***</sup> (0.032)</td><td>0.925<sup>***</sup> (0.037)</td></tr> <tr><td style="text-align:left">Constant</td><td>0.500 (0.387)</td><td>6.163<sup>***</sup> (0.999)</td><td>5.120<sup>***</sup> (1.110)</td></tr> <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>45</td><td>45</td><td>45</td></tr> <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr> </table> ] --- ## Visualizing "effect" of sex with `sjPlot` - The `sjPlot` package offers a `plot_model` command which offers a number of options to easily plot predicted probabilities from logistic regression models. <img src="week11_01_files/figure-html/sjPlot-1.png" width="576" style="display: block; margin: auto;" /> --- ## Visualizing "effect" of sex and age with `sjPlot` <img src="week11_01_files/figure-html/sjPlot2-1.png" width="648" style="display: block; margin: auto;" /> --- ## Visualizing "effect" of age and sex as moderator <img src="week11_01_files/figure-html/sjPlot3b-1.png" width="720" style="display: block; margin: auto;" /> --- ## Statistical Conclusion >While it generally appears that older adults were less likely than younger adults to survive, the data provide highly suggestive evidence that the age effect is more pronounced for females than for males (the two-sided *p*-value for interaction of gender and age is 0.05). Ignoring the interaction leads to an informal conclusion about the overall difference between men and women: The data provide highly suggestive evidence that the survival probability was higher for women (in the Donner Party) than for the men (two-sided *p*-value = 0.02 from a drop-in-deviance test). The odds of survival for women are estimated to be 5 times the odds of survival for similarly aged men (95% confidence interval: 1.2 times to 25.2 times, from a drop-in-deviance-based confidence interval). --- ## Scope of Inference >Since the data are observational, the result cannot be used as proof that women were more apt to survive than men; the possibility of confounding variables cannot be excluded. Furthermore, since the 45 individuals were not drawn at random from any population, inference to a broader population is not justified. --- class: middle, center # Questions?