class: center, middle, inverse, title-slide # POL90: Statistics ## Logistic regression & Data visualization ### Prof. Wasow, PoliticsPomona College ### 2022-04-06 --- <style type="text/css"> .regression10 table { font-size: 10px; } .regression12 table { font-size: 12px; } .regression14 table { font-size: 14px; } </style> # Announcements .large[ - Report 2 - Often multiple versions of questions - Common to have follow-up if someone answers "I don't know" - Be careful to avoid second or third question in sequence - *Statistical Sleuth* - This week: Skim Chapter 20 - http://appliedstats.org/chapter20.html ] --- # Schedule <table> <thead> <tr> <th style="text-align:right;"> Week </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Title </th> <th style="text-align:right;"> Chapter </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> Mar 23 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Multiple regression </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> Mar 28 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Interaction terms </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> Mar 30 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Interaction terms </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:left;"> Apr 4 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Logistic regression </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 12 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Apr 6 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Wed </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Logistic regression </td> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> Apr 11 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Missing data </td> <td style="text-align:right;"> Handout </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> Apr 13 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Missing data </td> <td style="text-align:right;"> Handout </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:left;"> Apr 18 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Matching </td> <td style="text-align:right;"> Handout </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:left;"> Apr 20 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Matching </td> <td style="text-align:right;"> Handout </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:left;"> Apr 25 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Causal inference: Panel data </td> <td style="text-align:right;"> Handout </td> </tr> </tbody> </table> --- ## Assignment schedule <table> <thead> <tr> <th style="text-align:right;"> Week </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Assignment </th> <th style="text-align:right;"> Percent </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> Mar 25 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS07 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> Apr 1 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS08 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 12 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Apr 8 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Fri </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Report2 </td> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> Apr 15 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS09 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:left;"> Apr 22 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS10 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:left;"> Apr 29 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> Report3 </td> <td style="text-align:right;"> 10 </td> </tr> </tbody> </table> --- class: center, middle # Why Logistic Regression? --- ## Simple example: college sports with 64 teams - Imagine 64 universities with student populations from 1000 to 50000 ```r # 64 draws from a random uniform distribution size <- runif(64, 1000, 50000) size <- size[order(size)] size ``` ``` [1] 1959 4934 4952 5730 6835 7500 8820 9133 9211 9423 9621 9900 [13] 10606 10683 10719 12046 12327 14106 15429 17393 17415 17736 18703 18739 [25] 20172 22298 24411 25824 26252 26625 26859 27208 29183 30096 30665 34230 [37] 35325 35896 37240 37307 37694 37925 38218 39083 40091 40101 41374 41887 [49] 42015 42840 43251 43587 43709 44012 44054 44115 44162 44882 44902 45360 [61] 47666 48255 49405 49698 ``` --- ## Imagine probability of winning is a function of size ```r # Assume probability of winning is determined by school size # And, assume schools with below 25000 pop lose at higher rates prob <- case_when( * size < 25000 ~ size/100000, * TRUE ~ size/ 60000 ) %>% round(2) prob ``` ``` [1] 0.02 0.05 0.05 0.06 0.07 0.07 0.09 0.09 0.09 0.09 0.10 0.10 0.11 0.11 0.11 [16] 0.12 0.12 0.14 0.15 0.17 0.17 0.18 0.19 0.19 0.20 0.22 0.24 0.43 0.44 0.44 [31] 0.45 0.45 0.49 0.50 0.51 0.57 0.59 0.60 0.62 0.62 0.63 0.63 0.64 0.65 0.67 [46] 0.67 0.69 0.70 0.70 0.71 0.72 0.73 0.73 0.73 0.73 0.74 0.74 0.75 0.75 0.76 [61] 0.79 0.80 0.82 0.83 ``` ```r # Now each team plays one "game" and we toss a coin # with the pre-assigned probability to determine the winner win <- rbinom(64, 1, prob) win ``` ``` [1] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 [39] 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 ``` --- ```r # Create a data.frame of our data teams <- data.frame(size, prob, win) ``` .pull-left[ ```r # some small school teams win head(teams, 20) ``` ``` size prob win 1 1959 0.02 0 2 4934 0.05 0 3 4952 0.05 0 4 5730 0.06 0 5 6835 0.07 0 6 7500 0.07 0 7 8820 0.09 0 8 9133 0.09 0 *9 9211 0.09 1 10 9423 0.09 0 11 9621 0.10 0 12 9900 0.10 0 13 10606 0.11 0 14 10683 0.11 0 15 10719 0.11 0 16 12046 0.12 0 17 12327 0.12 0 18 14106 0.14 0 19 15429 0.15 0 *20 17393 0.17 1 ``` ] .pull-right[ ```r # some big school teams lose tail(teams, 20) ``` ``` size prob win 45 40091 0.67 0 46 40101 0.67 1 47 41374 0.69 1 48 41887 0.70 1 49 42015 0.70 1 50 42840 0.71 1 51 43251 0.72 1 52 43587 0.73 1 53 43709 0.73 1 54 44012 0.73 1 55 44054 0.73 1 *56 44115 0.74 0 57 44162 0.74 1 58 44882 0.75 1 59 44902 0.75 1 *60 45360 0.76 0 61 47666 0.79 1 62 48255 0.80 1 63 49405 0.82 1 64 49698 0.83 1 ``` ] --- ```r # plot with body size on x-axis and win/loss (1 or 0) on y-axis plot(win ~ size, data = teams, xlab = "Student Body Size", ylab = "Lost = 0, Won = 1") ``` <img src="week11_02_files/figure-html/unnamed-chunk-10-1.png" width="50%" style="display: block; margin: auto;" /> --- ## How to model? ```r plot(win ~ size, data = teams, xlab = "Student Body Size", ylab = "Lost = 0, Won = 1") lm_out <- lm(win ~ size, data = teams) abline(lm_out) ``` <img src="week11_02_files/figure-html/unnamed-chunk-11-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Are residuals approximately normally distributed? <img src="week11_02_files/figure-html/unnamed-chunk-13-1.png" width="432" style="display: block; margin: auto;" /> --- ## Challenges of binary outcome variables <br><br><br> .large[ - Error term not normally distributed - Conditional on `\(X\)`, predictions can fall outside of 0 / 1 - Relationship between `\(Y\)` and `\(X\)` not linear - Scale issues: what's 0.7 death? - Other issues beyond scope of this class ] --- ## Recall OLS: Ideal, Normal, Simple Linear Regression .center[ <img src="images/ss_display_7_5.png" width="60%" style="display: block; margin: auto;" /> ] .footnote[Source: *Statistical Sleuth*, 3e, Display 7.5] --- ```r # calculate logistic regression using 1. glm() and 2. family = binomial *logit_out <- glm(win ~ size, family = binomial, teams) # see ?glm summary(logit_out) ``` ``` Call: glm(formula = win ~ size, family = binomial, data = teams) Deviance Residuals: Min 1Q Median 3Q Max -2.113 -0.509 -0.214 0.575 2.545 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.552159 1.107121 -4.11 3.9e-05 *** size 0.000147 0.000033 4.45 8.5e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 88.160 on 63 degrees of freedom Residual deviance: 50.064 on 62 degrees of freedom AIC: 54.06 Number of Fisher Scoring iterations: 5 ``` --- .top-code[ ```r # draws a curve based on prediction from logistic regression model plot(win ~ size, data = teams, xlab = "Student Body Size") curve(predict(logit_out, data.frame(size=x), type = "resp"), add=TRUE) ``` ] .bottom-plot[ <img src="week11_02_files/figure-html/unnamed-chunk-16-1.png" width="50%" style="display: block; margin: auto;" /> ] --- .top-code[ ```r # draws a curve based on prediction from logistic regression model plot(win ~ size, data = teams, xlab = "Student Body Size") curve(predict(logit_out, data.frame(size=x), type="resp"), add=TRUE) # show unobserved probabilities of winning (we only see 0/1) points(size, fitted(logit_out), pch=20) ``` ] .bottom-plot[ <img src="week11_02_files/figure-html/unnamed-chunk-17-1.png" width="50%" style="display: block; margin: auto;" /> ] --- ## Advantages of logistic regression with binary outcomes .vertical-center[ .large[ - Predicted values of `\(y\)` asymptotically approach 0 or 1 - Predicted values of `\(y\)` can be interpreted as predicted probabilities - With logistic regression, `\(Y\)` is a linear function of `\(X\)` ] ] --- class: middle, center # Interpreting and Transforming # Odds, log(odds) and Probabilities --- ## What are odds & log(odds)? <br><br><br> $$ \textrm{odds} = \frac{\textrm{\# favorable outcomes}}{\textrm{\# unfavorable outcomes}} = \frac{\textrm{p(success)}}{\textrm{p(failure) }} = \frac{p}{q} = \frac{p}{1-p} $$ $$ \textrm{log(odds)} = \textrm{log}_e \left( \frac{p}{q} \right) = \textrm{log}_e \left( \frac{p}{1-p} \right) $$ --- ```r prob <- c(0.001, 0.01, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.999, 0.9999) odds <- (prob/(1-prob)) %>% round(3) log_odds <- log(odds) %>% round(3) ``` .pull-left[ ```r data.frame(prob, odds, log_odds) ``` ``` prob odds log_odds 1 0.0010 0.001 -6.908 2 0.0100 0.010 -4.605 3 0.1500 0.176 -1.737 4 0.2000 0.250 -1.386 5 0.2500 0.333 -1.100 6 0.3000 0.429 -0.846 7 0.3500 0.538 -0.620 8 0.4000 0.667 -0.405 9 0.4500 0.818 -0.201 *10 0.5000 1.000 0.000 11 0.5500 1.222 0.200 12 0.6000 1.500 0.405 13 0.6500 1.857 0.619 14 0.7000 2.333 0.847 15 0.7500 3.000 1.099 *16 0.8000 4.000 1.386 17 0.8500 5.667 1.735 18 0.9000 9.000 2.197 19 0.9990 999.000 6.907 20 0.9999 9999.000 9.210 ``` ] -- .pull-left[ ```r # convert prob to odds & log odds by hand # if prob = 0.5 .5/.5 log(.5/.5) # if prob = 0.8 .8/.2 log(.8/.2) # if prob = 0.75? # if prob = 0.20? # if prob = 0.25? # if prob = 0.90? ``` ] --- class: center, middle # With multiple regression, # how do we plot multiple dimensions # of data in two dimensions? --- class: center, middle # Florence Nightingale's # Rose Chart --- ## Recommended podcast on Florence Nightingale and visualization <img src="images/nightingale_tim_harford_podcast_cautionary_tales.png" width="55%" style="display: block; margin: auto;" /> .footnote[https://timharford.com/2021/03/cautionary-tales-florence-nightingale-and-her-geeks-declare-war-on-death/] --- background-image: url("images/nightingale_original.jpg") background-position: center background-size: contain --- background-image: url("images/nightingale_rose_plot.png") background-position: center background-size: contain --- background-image: url("images/kelly_cotton_tweet_nightingale.png") background-position: center background-size: contain --- background-image: url("images/rose_diagram_mortality_2020.jpg") background-position: center background-size: contain --- ## Charles Joseph Minard: Napolean's March (1869) <img src="images/tufte_minard_napoleans_march_Figure527_lite.jpg" width="100%" style="display: block; margin: auto;" /> .footnote[Source: Edward Tufte, *The Visual Display of Quantitative Information*] --- class: center, middle # W.E.B. Du Bois Charts # at World's Fair --- <img src="images/kieran_healy_dv-cover-executive-b.jpg" width="58%" style="display: block; margin: auto;" /> --- ## Data visualization .vertical-center[ * About the Cover Image: Large detail from `The amalgamation of the white and black elements of the population in the United States', by Atlanta University students and W.E.B. Du Bois. Chart prepared for the Negro Exhibit of the American Section at the Paris Exposition Universelle in 1900 to show the economic and social progress of African Americans since emancipation. LOT 11931, no. 54 (M), http://www.loc.gov/pictures/item/2014645360.] --- background-image: url("images/dubois_worlds_fair_exhibition.png") background-position: center background-size: contain --- background-image: url("images/dubois_title_page.png") background-position: center background-size: contain --- <img src="images/dubois_worlds_fair_slaves_and_free_negroes.jpg" width="65%" style="display: block; margin: auto;" /> --- <img src="images/dubois_worlds_fair_city_urban.jpg" width="65%" style="display: block; margin: auto;" /> --- background-image: url("images/dubois_enslaved_orig_service-pnp-ppmsca-33900-33913v.jpg") background-position: center background-size: contain --- background-image: url("images/dubois_tidy_enslaved_plot.png") background-position: center background-size: contain --- class: center, middle # Sushi chef theory of statistics --- class: center, middle # Questions? --- class: center, middle # How to pick the right model? --- ## What's the type or distribution of our `\(Y\)`? * Type - Continuous? - Categorical or nominal? - Binary? - Interval? - Ordinal? * Distribution - Skewed? - Symmetrical? .footnote[See: https://stats.idre.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-interval-variables/] --- ## What's the data generation process? .vertical-center[ .large[ - Censored or truncated? - Excess zeros? - Missingness? - Attrition? - Selection bias? ] ] --- ## Censoring example: Math study from Chapter 4 .left-code[ ```r math_study <- Sleuth3::case0402 %>% clean_names() ggplot(math_study) + aes(x = time) + geom_histogram(bins = 15) + facet_grid(~treatment) ``` ] .right-plot[ <img src="week11_02_files/figure-html/math_study_plot-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Overview of (Some) Regression Methods <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Category </th> <th style="text-align:left;"> Test Name </th> <th style="text-align:left;"> Example Y Data </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> OLS </td> <td style="text-align:left;"> Linear </td> <td style="text-align:left;"> `-\infty` `\ldots` `+\infty` </td> </tr> <tr> <td style="text-align:left;"> Binary & Categorical </td> <td style="text-align:left;"> Logit </td> <td style="text-align:left;"> 0/1, True/False </td> </tr> <tr> <td style="text-align:left;"> Binary & Categorical </td> <td style="text-align:left;"> Ordered Logit </td> <td style="text-align:left;"> Small, Medium, Large </td> </tr> <tr> <td style="text-align:left;"> Binary & Categorical </td> <td style="text-align:left;"> Multinomial </td> <td style="text-align:left;"> Trump, Stein, Clinton </td> </tr> <tr> <td style="text-align:left;"> Count </td> <td style="text-align:left;"> Poisson </td> <td style="text-align:left;"> 0 to `+\infty`, integer </td> </tr> <tr> <td style="text-align:left;"> Count </td> <td style="text-align:left;"> Zero-Inflated Poisson </td> <td style="text-align:left;"> 0 to `+\infty`, integer </td> </tr> <tr> <td style="text-align:left;"> Count </td> <td style="text-align:left;"> Negative binomial </td> <td style="text-align:left;"> 0 to `+\infty`, integer </td> </tr> </tbody> </table> .footnote[See: https://stats.idre.ucla.edu/r/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-r/] --- ## OLS <table class="table" style="margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;"> Category </td> <td style="text-align:left;"> OLS </td> </tr> <tr> <td style="text-align:left;"> Test Name </td> <td style="text-align:left;"> Linear </td> </tr> <tr> <td style="text-align:left;"> Distribution </td> <td style="text-align:left;"> Normal </td> </tr> <tr> <td style="text-align:left;"> Guidelines </td> <td style="text-align:left;"> Approx continous </td> </tr> <tr> <td style="text-align:left;"> Example Y Data </td> <td style="text-align:left;"> `-\infty` `\ldots` `+\infty` </td> </tr> <tr> <td style="text-align:left;"> Command </td> <td style="text-align:left;"> lm </td> </tr> <tr> <td style="text-align:left;"> Package </td> <td style="text-align:left;"> base </td> </tr> <tr> <td style="text-align:left;"> Syntax </td> <td style="text-align:left;"> lm(y ~ x, data = data) </td> </tr> <tr> <td style="text-align:left;"> Link </td> <td style="text-align:left;"> NA </td> </tr> </tbody> </table> --- ## Logit <table class="table" style="margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;"> Category </td> <td style="text-align:left;"> Binary & Categorical </td> </tr> <tr> <td style="text-align:left;"> Test Name </td> <td style="text-align:left;"> Logit </td> </tr> <tr> <td style="text-align:left;"> Distribution </td> <td style="text-align:left;"> Binomial </td> </tr> <tr> <td style="text-align:left;"> Guidelines </td> <td style="text-align:left;"> Binary or logical </td> </tr> <tr> <td style="text-align:left;"> Example Y Data </td> <td style="text-align:left;"> 0/1, True/False </td> </tr> <tr> <td style="text-align:left;"> Command </td> <td style="text-align:left;"> glm </td> </tr> <tr> <td style="text-align:left;"> Package </td> <td style="text-align:left;"> base </td> </tr> <tr> <td style="text-align:left;"> Syntax </td> <td style="text-align:left;"> glm(y ~ x, family = "binomial", data = data) </td> </tr> <tr> <td style="text-align:left;"> Link </td> <td style="text-align:left;"> https://stats.idre.ucla.edu/r/dae/logit-regression/ </td> </tr> </tbody> </table> --- ## Ordered Logit <table class="table" style="margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;"> Category </td> <td style="text-align:left;"> Binary & Categorical </td> </tr> <tr> <td style="text-align:left;"> Test Name </td> <td style="text-align:left;"> Ordered Logit </td> </tr> <tr> <td style="text-align:left;"> Distribution </td> <td style="text-align:left;"> NA </td> </tr> <tr> <td style="text-align:left;"> Guidelines </td> <td style="text-align:left;"> Multiple levels, ordered </td> </tr> <tr> <td style="text-align:left;"> Example Y Data </td> <td style="text-align:left;"> Small, Medium, Large </td> </tr> <tr> <td style="text-align:left;"> Command </td> <td style="text-align:left;"> polr </td> </tr> <tr> <td style="text-align:left;"> Package </td> <td style="text-align:left;"> MASS </td> </tr> <tr> <td style="text-align:left;"> Syntax </td> <td style="text-align:left;"> polr(y ~ x, data = data) </td> </tr> <tr> <td style="text-align:left;"> Link </td> <td style="text-align:left;"> https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ </td> </tr> </tbody> </table> --- ## Multinomial <table class="table" style="margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;"> Category </td> <td style="text-align:left;"> Binary & Categorical </td> </tr> <tr> <td style="text-align:left;"> Test Name </td> <td style="text-align:left;"> Multinomial </td> </tr> <tr> <td style="text-align:left;"> Distribution </td> <td style="text-align:left;"> Multinomial </td> </tr> <tr> <td style="text-align:left;"> Guidelines </td> <td style="text-align:left;"> Multiple categories, unordered </td> </tr> <tr> <td style="text-align:left;"> Example Y Data </td> <td style="text-align:left;"> Trump, Stein, Clinton </td> </tr> <tr> <td style="text-align:left;"> Command </td> <td style="text-align:left;"> multinom </td> </tr> <tr> <td style="text-align:left;"> Package </td> <td style="text-align:left;"> nnet </td> </tr> <tr> <td style="text-align:left;"> Syntax </td> <td style="text-align:left;"> multinom(y ~ x, data = data) </td> </tr> <tr> <td style="text-align:left;"> Link </td> <td style="text-align:left;"> https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/ </td> </tr> </tbody> </table> --- ## Poisson <table class="table" style="margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;"> Category </td> <td style="text-align:left;"> Count </td> </tr> <tr> <td style="text-align:left;"> Test Name </td> <td style="text-align:left;"> Poisson </td> </tr> <tr> <td style="text-align:left;"> Distribution </td> <td style="text-align:left;"> Poisson </td> </tr> <tr> <td style="text-align:left;"> Guidelines </td> <td style="text-align:left;"> Count data </td> </tr> <tr> <td style="text-align:left;"> Example Y Data </td> <td style="text-align:left;"> 0 to `+\infty`, integer </td> </tr> <tr> <td style="text-align:left;"> Command </td> <td style="text-align:left;"> glm </td> </tr> <tr> <td style="text-align:left;"> Package </td> <td style="text-align:left;"> base </td> </tr> <tr> <td style="text-align:left;"> Syntax </td> <td style="text-align:left;"> glm(y ~ x, family = "poisson", data = data) </td> </tr> <tr> <td style="text-align:left;"> Link </td> <td style="text-align:left;"> https://stats.idre.ucla.edu/r/dae/poisson-regression/ </td> </tr> </tbody> </table> --- ## Visualizing Poisson Distributed Count Data - Count of # of paragraphs in *New York Times* articles about protests .left-code[ ```r ggplot(dca) + aes(x = paragraph) + geom_histogram(bins = 15) ``` ] .right-plot[ <img src="week11_02_files/figure-html/paragraph_hist_plot-1.png" width="100%" style="display: block; margin: auto;" /> ] .footnote[ Protest data from: https://web.stanford.edu/group/collectiveaction/cgi-bin/drupal/] --- ## Modeling with Poisson ```r # police4 = highest level of police violence dummy # viold = protester violence dummy # state1 = state # evyy = event year p1 <- glm(stories ~ police_violence * protester_violence + state + year, family = poisson, data = dca) p2 <- glm(paragraph ~ police_violence * protester_violence + state + year, family = poisson, data = dca) p3 <- glm(page ~ police_violence * protester_violence + state + year, family = poisson, data = dca) ``` --- ## First solution: Regression Table .left-table[ ```r stargazer( p1, p2, p3, type = 'html', header = FALSE, font.size = "scriptsize", digits = 2, omit.stat = c("f", "ser", "adj.rsq", "ll", "aic"), single.row = TRUE, omit = c("state", "year")) ``` ] .right-table[ <table style="text-align:center"><tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3"><em>Dependent variable:</em></td></tr> <tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr> <tr><td style="text-align:left"></td><td>stories</td><td>paragraph</td><td>page</td></tr> <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td></tr> <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">police_violence1</td><td>0.15<sup>*</sup> (0.09)</td><td>0.37<sup>***</sup> (0.03)</td><td>-0.21<sup>***</sup> (0.02)</td></tr> <tr><td style="text-align:left">protester_violence1</td><td>0.06 (0.04)</td><td>0.02 (0.01)</td><td>-0.11<sup>***</sup> (0.01)</td></tr> <tr><td style="text-align:left">police_violence1:protester_violence1</td><td>0.51<sup>***</sup> (0.10)</td><td>0.37<sup>***</sup> (0.03)</td><td>-0.01 (0.03)</td></tr> <tr><td style="text-align:left">Constant</td><td>0.05 (0.31)</td><td>2.22<sup>***</sup> (0.09)</td><td>3.13<sup>***</sup> (0.07)</td></tr> <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>3,668</td><td>3,669</td><td>3,652</td></tr> <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr> </table> ] --- ## Second solution: Marginal effects table <img src="images/leading_table18.png" width="85%" style="display: block; margin: auto;" /> --- ## Visualizing stories vs police_violence .left-code[ ```r plot_model( p1, type = "eff", terms = c("police_violence", "protester_violence")) ``` ] .right-plot[ <img src="week11_02_files/figure-html/dca_plot1-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Visualizing paragraphs vs police_violence .left-code[ ```r plot_model( p2, type = "eff", terms = c("police_violence", "protester_violence")) ``` ] .right-plot[ <img src="week11_02_files/figure-html/dca_plot2-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Visualizing page vs police_violence .left-code[ ```r plot_model( p3, type = "eff", terms = c("police_violence", "protester_violence")) ``` ] .right-plot[ <img src="week11_02_files/figure-html/dca_plot3-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Fourth solution: Plot with continuous `\(x_1\)` & categorical moderator `\(x_2\)` .vertical-center[ <img src="images/nyt_marginal_effects_plots-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Zero-Inflated Poisson <table class="table" style="margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;"> Category </td> <td style="text-align:left;"> Count </td> </tr> <tr> <td style="text-align:left;"> Test Name </td> <td style="text-align:left;"> Zero-Inflated Poisson </td> </tr> <tr> <td style="text-align:left;"> Distribution </td> <td style="text-align:left;"> Poisson + Binomial </td> </tr> <tr> <td style="text-align:left;"> Guidelines </td> <td style="text-align:left;"> Count data that has an excess of zero counts </td> </tr> <tr> <td style="text-align:left;"> Example Y Data </td> <td style="text-align:left;"> 0 to `+\infty`, integer </td> </tr> <tr> <td style="text-align:left;"> Command </td> <td style="text-align:left;"> zeroinfl </td> </tr> <tr> <td style="text-align:left;"> Package </td> <td style="text-align:left;"> pscl </td> </tr> <tr> <td style="text-align:left;"> Syntax </td> <td style="text-align:left;"> zeroinfl(y ~ x1 | x2, data = data) </td> </tr> <tr> <td style="text-align:left;"> Link </td> <td style="text-align:left;"> https://stats.idre.ucla.edu/r/dae/zip/ </td> </tr> </tbody> </table> --- ## Negative Binomial <table class="table" style="margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;"> Category </td> <td style="text-align:left;"> Count </td> </tr> <tr> <td style="text-align:left;"> Test Name </td> <td style="text-align:left;"> Negative binomial </td> </tr> <tr> <td style="text-align:left;"> Distribution </td> <td style="text-align:left;"> Negative binomial </td> </tr> <tr> <td style="text-align:left;"> Guidelines </td> <td style="text-align:left;"> Count data with over-dispersion </td> </tr> <tr> <td style="text-align:left;"> Example Y Data </td> <td style="text-align:left;"> 0 to `+\infty`, integer </td> </tr> <tr> <td style="text-align:left;"> Command </td> <td style="text-align:left;"> glm.nb </td> </tr> <tr> <td style="text-align:left;"> Package </td> <td style="text-align:left;"> MASS </td> </tr> <tr> <td style="text-align:left;"> Syntax </td> <td style="text-align:left;"> glm.nb(y ~ x, data = data) </td> </tr> <tr> <td style="text-align:left;"> Link </td> <td style="text-align:left;"> https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/ </td> </tr> </tbody> </table>