POL90: Statistics

# POL90: Statistics
## Logistic regression & Data visualization
### Prof. Wasow, Politics</br>Pomona College
### 2022-04-06

---

.regression12 table {
  font-size: 12px;     
}

.regression14 table {
  font-size: 14px;     
}

</style>

# Announcements

- Report 2
  - Often multiple versions of questions
  - Common to have follow-up if someone answers "I don't know"
  - Be careful to avoid second or third question in sequence
  
- *Statistical Sleuth*

- This week: Skim Chapter 20
      - http://appliedstats.org/chapter20.html

]

---
# Schedule

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Week </th>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:left;"> Day </th>
   <th style="text-align:left;"> Title </th>
   <th style="text-align:right;"> Chapter </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:left;"> Mar 23 </td>
   <td style="text-align:left;"> Wed </td>
   <td style="text-align:left;"> Multiple regression </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:left;"> Mar 28 </td>
   <td style="text-align:left;"> Mon </td>
   <td style="text-align:left;"> Interaction terms </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:left;"> Mar 30 </td>
   <td style="text-align:left;"> Wed </td>
   <td style="text-align:left;"> Interaction terms </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:left;"> Apr 4 </td>
   <td style="text-align:left;"> Mon </td>
   <td style="text-align:left;"> Logistic regression </td>
   <td style="text-align:right;"> 20 </td>
  </tr>
  <tr>
   <td style="text-align:right;color: black !important;background-color: yellow !important;"> 12 </td>
   <td style="text-align:left;color: black !important;background-color: yellow !important;"> Apr 6 </td>
   <td style="text-align:left;color: black !important;background-color: yellow !important;"> Wed </td>
   <td style="text-align:left;color: black !important;background-color: yellow !important;"> Logistic regression </td>
   <td style="text-align:right;color: black !important;background-color: yellow !important;"> 20 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:left;"> Apr 11 </td>
   <td style="text-align:left;"> Mon </td>
   <td style="text-align:left;"> Missing data </td>
   <td style="text-align:right;"> Handout </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:left;"> Apr 13 </td>
   <td style="text-align:left;"> Wed </td>
   <td style="text-align:left;"> Missing data </td>
   <td style="text-align:right;"> Handout </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:left;"> Apr 18 </td>
   <td style="text-align:left;"> Mon </td>
   <td style="text-align:left;"> Matching </td>
   <td style="text-align:right;"> Handout </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:left;"> Apr 20 </td>
   <td style="text-align:left;"> Wed </td>
   <td style="text-align:left;"> Matching </td>
   <td style="text-align:right;"> Handout </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:left;"> Apr 25 </td>
   <td style="text-align:left;"> Mon </td>
   <td style="text-align:left;"> Causal inference: Panel data </td>
   <td style="text-align:right;"> Handout </td>
  </tr>
</tbody>
</table>

---
## Assignment schedule

---
class: center, middle

# Why Logistic Regression?

---
## Simple example: college sports with 64 teams

- Imagine 64 universities with student populations from 1000 to 50000

```r
# 64 draws from a random uniform distribution
size <- runif(64, 1000, 50000)
size <- size[order(size)]
size
```

```
 [1]  1959  4934  4952  5730  6835  7500  8820  9133  9211  9423  9621  9900
[13] 10606 10683 10719 12046 12327 14106 15429 17393 17415 17736 18703 18739
[25] 20172 22298 24411 25824 26252 26625 26859 27208 29183 30096 30665 34230
[37] 35325 35896 37240 37307 37694 37925 38218 39083 40091 40101 41374 41887
[49] 42015 42840 43251 43587 43709 44012 44054 44115 44162 44882 44902 45360
[61] 47666 48255 49405 49698
```

---
## Imagine probability of winning is a function of size

```r
# Assume probability of winning is determined by school size
# And, assume schools with below 25000 pop lose at higher rates
prob <- case_when(
* size < 25000 ~ size/100000,
* TRUE         ~ size/ 60000
  )  %>% round(2)
prob
```

```
 [1] 0.02 0.05 0.05 0.06 0.07 0.07 0.09 0.09 0.09 0.09 0.10 0.10 0.11 0.11 0.11
[16] 0.12 0.12 0.14 0.15 0.17 0.17 0.18 0.19 0.19 0.20 0.22 0.24 0.43 0.44 0.44
[31] 0.45 0.45 0.49 0.50 0.51 0.57 0.59 0.60 0.62 0.62 0.63 0.63 0.64 0.65 0.67
[46] 0.67 0.69 0.70 0.70 0.71 0.72 0.73 0.73 0.73 0.73 0.74 0.74 0.75 0.75 0.76
[61] 0.79 0.80 0.82 0.83
```

```r
# Now each team plays one "game" and we toss a coin 
# with the pre-assigned probability to determine the winner
win <- rbinom(64, 1, prob)
win
```

```
 [1] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0
[39] 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1
```

---

```r
# Create a data.frame of our data
teams <- data.frame(size, prob, win)
```

```r
# some small school teams win
head(teams, 20)
```

```
    size prob win
1   1959 0.02   0
2   4934 0.05   0
3   4952 0.05   0
4   5730 0.06   0
5   6835 0.07   0
6   7500 0.07   0
7   8820 0.09   0
8   9133 0.09   0
*9   9211 0.09   1
10  9423 0.09   0
11  9621 0.10   0
12  9900 0.10   0
13 10606 0.11   0
14 10683 0.11   0
15 10719 0.11   0
16 12046 0.12   0
17 12327 0.12   0
18 14106 0.14   0
19 15429 0.15   0
*20 17393 0.17   1
```
]

```r
# some big school teams lose
tail(teams, 20)
```

```
    size prob win
45 40091 0.67   0
46 40101 0.67   1
47 41374 0.69   1
48 41887 0.70   1
49 42015 0.70   1
50 42840 0.71   1
51 43251 0.72   1
52 43587 0.73   1
53 43709 0.73   1
54 44012 0.73   1
55 44054 0.73   1
*56 44115 0.74   0
57 44162 0.74   1
58 44882 0.75   1
59 44902 0.75   1
*60 45360 0.76   0
61 47666 0.79   1
62 48255 0.80   1
63 49405 0.82   1
64 49698 0.83   1
```
]

---

```r
# plot with body size on x-axis and win/loss (1 or 0) on y-axis
plot(win  ~ size, 
     data = teams, 
     xlab = "Student Body Size", 
     ylab = "Lost = 0, Won = 1") 
```

---
## How to model?

```r
plot(win ~ size, data = teams, xlab = "Student Body Size", ylab = "Lost = 0, Won = 1") 
lm_out <- lm(win  ~ size, data = teams)
abline(lm_out)
```

---
## Are residuals approximately normally distributed?

---
## Challenges of binary outcome variables

<br><br><br>
.large[

- Error term not normally distributed

- Conditional on `$X$`, predictions can fall outside of 0 / 1 
 
- Relationship between `$Y$` and `$X$` not linear
 
- Scale issues: what's 0.7 death?
 
- Other issues beyond scope of this class

]
 
---
## Recall OLS: Ideal, Normal, Simple Linear Regression

.center[
<img src="images/ss_display_7_5.png" width="60%" style="display: block; margin: auto;" />
]
.footnote[Source: *Statistical Sleuth*, 3e, Display 7.5]

---

```r
# calculate logistic regression using 1. glm() and 2. family = binomial
*logit_out <- glm(win ~ size, family = binomial, teams)
# see ?glm
summary(logit_out)
```

```

Call:
glm(formula = win ~ size, family = binomial, data = teams)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.113  -0.509  -0.214   0.575   2.545

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.552159   1.107121   -4.11  3.9e-05 ***
size         0.000147   0.000033    4.45  8.5e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 88.160  on 63  degrees of freedom
Residual deviance: 50.064  on 62  degrees of freedom
AIC: 54.06

Number of Fisher Scoring iterations: 5
```

---
.top-code[

```r
# draws a curve based on prediction from logistic regression model
plot(win ~ size, data = teams, xlab = "Student Body Size") 
curve(predict(logit_out, data.frame(size=x), type = "resp"), add=TRUE)
```
]

.bottom-plot[
<img src="week11_02_files/figure-html/unnamed-chunk-16-1.png" width="50%" style="display: block; margin: auto;" />
]

---

# show unobserved probabilities of winning (we only see 0/1)
points(size, fitted(logit_out), pch=20) 
```
]

.bottom-plot[
<img src="week11_02_files/figure-html/unnamed-chunk-17-1.png" width="50%" style="display: block; margin: auto;" />
]

---
## Advantages of logistic regression with binary outcomes

- Predicted values of `$y$` can be interpreted as predicted probabilities

- With logistic regression, `$Y$` is a linear function of `$X$` 
]
]

---
class: middle, center

# Interpreting and Transforming
# Odds, log(odds) and Probabilities

---

## What are odds & log(odds)?

$$ 
\textrm{odds}  = \frac{\textrm{\# favorable outcomes}}{\textrm{\# unfavorable outcomes}} = \frac{\textrm{p(success)}}{\textrm{p(failure) }} = \frac{p}{q} = \frac{p}{1-p} 
$$

$$
\textrm{log(odds)} =  \textrm{log}_e \left( \frac{p}{q} \right) = \textrm{log}_e \left( \frac{p}{1-p} \right)
$$

---

```r
prob <- c(0.001, 0.01, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.999, 0.9999)
odds     <- (prob/(1-prob)) %>% round(3)
log_odds <- log(odds)       %>% round(3)
```
.pull-left[

```r
data.frame(prob, odds, log_odds)
```

```
     prob     odds log_odds
1  0.0010    0.001   -6.908
2  0.0100    0.010   -4.605
3  0.1500    0.176   -1.737
4  0.2000    0.250   -1.386
5  0.2500    0.333   -1.100
6  0.3000    0.429   -0.846
7  0.3500    0.538   -0.620
8  0.4000    0.667   -0.405
9  0.4500    0.818   -0.201
*10 0.5000    1.000    0.000
11 0.5500    1.222    0.200
12 0.6000    1.500    0.405
13 0.6500    1.857    0.619
14 0.7000    2.333    0.847
15 0.7500    3.000    1.099
*16 0.8000    4.000    1.386
17 0.8500    5.667    1.735
18 0.9000    9.000    2.197
19 0.9990  999.000    6.907
20 0.9999 9999.000    9.210
```
]

```r
# convert prob to odds & log odds by hand
# if prob = 0.5
.5/.5
log(.5/.5)

# if prob = 0.8
.8/.2
log(.8/.2)

# if prob = 0.75?

# if prob = 0.20?

# if prob = 0.25?

# if prob = 0.90?
```
]

---

# With multiple regression,
# how do we plot multiple dimensions 
# of data in two dimensions?

---

# Florence Nightingale's
# Rose Chart

---
## Recommended podcast on Florence Nightingale and visualization

.footnote[https://timharford.com/2021/03/cautionary-tales-florence-nightingale-and-her-geeks-declare-war-on-death/]

---
background-image: url("images/nightingale_original.jpg")
background-position: center
background-size: contain

---
background-image: url("images/nightingale_rose_plot.png")
background-position: center
background-size: contain

---
background-image: url("images/kelly_cotton_tweet_nightingale.png")
background-position: center
background-size: contain

---
background-image: url("images/rose_diagram_mortality_2020.jpg")
background-position: center
background-size: contain

---

## Charles Joseph Minard: Napolean's March (1869)

---

# W.E.B. Du Bois Charts
# at World's Fair

---

---

## Data visualization

.vertical-center[
* About the Cover Image: Large detail from `The amalgamation of the white and black elements of the population in the United States', by Atlanta University students and W.E.B. Du Bois. Chart prepared for the Negro Exhibit of the American Section at the Paris Exposition Universelle in 1900 to show the economic and social progress of African Americans since emancipation. LOT 11931, no. 54 (M), http://www.loc.gov/pictures/item/2014645360.]

---
background-image: url("images/dubois_worlds_fair_exhibition.png")
background-position: center
background-size: contain

---
background-image: url("images/dubois_title_page.png")
background-position: center
background-size: contain

---

---

---
background-image: url("images/dubois_enslaved_orig_service-pnp-ppmsca-33900-33913v.jpg")
background-position: center
background-size: contain

---
background-image: url("images/dubois_tidy_enslaved_plot.png")
background-position: center
background-size: contain

---
class: center, middle

# Sushi chef theory of statistics

---
class: center, middle

# Questions?

---
class: center, middle

# How to pick the right model?

---
## What's the type or distribution of our `$Y$`?

* Type
  
    - Continuous?
  
    - Categorical or nominal?
  
    - Binary?

- Interval?
  
    - Ordinal?
  
  * Distribution
  
    - Skewed? 
    
    - Symmetrical?

.footnote[See: https://stats.idre.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-interval-variables/]

---
## What's the data generation process?

.vertical-center[
.large[
  - Censored or truncated?
  
  - Excess zeros?
  
  - Missingness?
  
  - Attrition?
  
  - Selection bias?
]
]
---
## Censoring example: Math study from Chapter 4

```r
math_study <- Sleuth3::case0402 %>%
  clean_names()

ggplot(math_study) +
  aes(x = time) +
  geom_histogram(bins = 15) +
  facet_grid(~treatment)
```
]

.right-plot[
<img src="week11_02_files/figure-html/math_study_plot-1.png" width="100%" style="display: block; margin: auto;" />
]
  
---
## Overview of (Some) Regression Methods

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Category </th>
   <th style="text-align:left;"> Test Name </th>
   <th style="text-align:left;"> Example Y Data </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> OLS </td>
   <td style="text-align:left;"> Linear </td>
   <td style="text-align:left;"> `-\infty` `\ldots` `+\infty` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Binary &amp; Categorical </td>
   <td style="text-align:left;"> Logit </td>
   <td style="text-align:left;"> 0/1, True/False </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Binary &amp; Categorical </td>
   <td style="text-align:left;"> Ordered Logit </td>
   <td style="text-align:left;"> Small, Medium, Large </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Binary &amp; Categorical </td>
   <td style="text-align:left;"> Multinomial </td>
   <td style="text-align:left;"> Trump, Stein, Clinton </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Count </td>
   <td style="text-align:left;"> Poisson </td>
   <td style="text-align:left;"> 0 to `+\infty`, integer </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Count </td>
   <td style="text-align:left;"> Zero-Inflated Poisson </td>
   <td style="text-align:left;"> 0 to `+\infty`, integer </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Count </td>
   <td style="text-align:left;"> Negative binomial </td>
   <td style="text-align:left;"> 0 to `+\infty`, integer </td>
  </tr>
</tbody>
</table>

.footnote[See: https://stats.idre.ucla.edu/r/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-r/]

---
## OLS

<table class="table" style="margin-left: auto; margin-right: auto;">
<tbody>
  <tr>
   <td style="text-align:left;"> Category </td>
   <td style="text-align:left;"> OLS </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Test Name </td>
   <td style="text-align:left;"> Linear </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Distribution </td>
   <td style="text-align:left;"> Normal </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Guidelines </td>
   <td style="text-align:left;"> Approx continous </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Example Y Data </td>
   <td style="text-align:left;"> `-\infty` `\ldots` `+\infty` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Command </td>
   <td style="text-align:left;"> lm </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Package </td>
   <td style="text-align:left;"> base </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Syntax </td>
   <td style="text-align:left;"> lm(y ~ x, data = data) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Link </td>
   <td style="text-align:left;"> NA </td>
  </tr>
</tbody>
</table>

---
## Logit

<table class="table" style="margin-left: auto; margin-right: auto;">
<tbody>
  <tr>
   <td style="text-align:left;"> Category </td>
   <td style="text-align:left;"> Binary &amp; Categorical </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Test Name </td>
   <td style="text-align:left;"> Logit </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Distribution </td>
   <td style="text-align:left;"> Binomial </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Guidelines </td>
   <td style="text-align:left;"> Binary or logical </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Example Y Data </td>
   <td style="text-align:left;"> 0/1, True/False </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Command </td>
   <td style="text-align:left;"> glm </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Package </td>
   <td style="text-align:left;"> base </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Syntax </td>
   <td style="text-align:left;"> glm(y ~ x, family = "binomial", data = data) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Link </td>
   <td style="text-align:left;"> https://stats.idre.ucla.edu/r/dae/logit-regression/ </td>
  </tr>
</tbody>
</table>

---
## Ordered Logit

<table class="table" style="margin-left: auto; margin-right: auto;">
<tbody>
  <tr>
   <td style="text-align:left;"> Category </td>
   <td style="text-align:left;"> Binary &amp; Categorical </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Test Name </td>
   <td style="text-align:left;"> Ordered Logit </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Distribution </td>
   <td style="text-align:left;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Guidelines </td>
   <td style="text-align:left;"> Multiple levels, ordered </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Example Y Data </td>
   <td style="text-align:left;"> Small, Medium, Large </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Command </td>
   <td style="text-align:left;"> polr </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Package </td>
   <td style="text-align:left;"> MASS </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Syntax </td>
   <td style="text-align:left;"> polr(y ~ x, data = data) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Link </td>
   <td style="text-align:left;"> https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ </td>
  </tr>
</tbody>
</table>

---
## Multinomial

<table class="table" style="margin-left: auto; margin-right: auto;">
<tbody>
  <tr>
   <td style="text-align:left;"> Category </td>
   <td style="text-align:left;"> Binary &amp; Categorical </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Test Name </td>
   <td style="text-align:left;"> Multinomial </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Distribution </td>
   <td style="text-align:left;"> Multinomial </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Guidelines </td>
   <td style="text-align:left;"> Multiple categories, unordered </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Example Y Data </td>
   <td style="text-align:left;"> Trump, Stein, Clinton </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Command </td>
   <td style="text-align:left;"> multinom </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Package </td>
   <td style="text-align:left;"> nnet </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Syntax </td>
   <td style="text-align:left;"> multinom(y ~ x, data = data) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Link </td>
   <td style="text-align:left;"> https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/ </td>
  </tr>
</tbody>
</table>

---
## Poisson

<table class="table" style="margin-left: auto; margin-right: auto;">
<tbody>
  <tr>
   <td style="text-align:left;"> Category </td>
   <td style="text-align:left;"> Count </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Test Name </td>
   <td style="text-align:left;"> Poisson </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Distribution </td>
   <td style="text-align:left;"> Poisson </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Guidelines </td>
   <td style="text-align:left;"> Count data </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Example Y Data </td>
   <td style="text-align:left;"> 0 to `+\infty`, integer </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Command </td>
   <td style="text-align:left;"> glm </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Package </td>
   <td style="text-align:left;"> base </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Syntax </td>
   <td style="text-align:left;"> glm(y ~ x, family = "poisson", data = data) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Link </td>
   <td style="text-align:left;"> https://stats.idre.ucla.edu/r/dae/poisson-regression/ </td>
  </tr>
</tbody>
</table>
---
## Visualizing Poisson Distributed Count Data

- Count of # of paragraphs in *New York Times* articles about protests

```r
ggplot(dca) + 
  aes(x = paragraph) + 
  geom_histogram(bins = 15)
```
]

.right-plot[
<img src="week11_02_files/figure-html/paragraph_hist_plot-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## Modeling with Poisson

```r
# police4 = highest level of police violence dummy
# viold = protester violence dummy
# state1 = state
# evyy = event year

p1 <- glm(stories ~ police_violence * protester_violence + state + year, 
          family = poisson, data = dca)

p2 <- glm(paragraph ~ police_violence * protester_violence + state + year, 
          family = poisson, data = dca)

p3 <- glm(page ~ police_violence * protester_violence + state + year, 
          family = poisson, data = dca) 
```

---
## First solution: Regression Table

```r
stargazer(
  p1, p2, p3, 
  type = 'html', 
  header = FALSE, 
  font.size = "scriptsize",
  digits = 2, 
  omit.stat = c("f", "ser", "adj.rsq", "ll", "aic"), 
  single.row = TRUE, 
  omit = c("state", "year"))
```
]

<table style="text-align:center"><tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3"><em>Dependent variable:</em></td></tr>
<tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr>
<tr><td style="text-align:left"></td><td>stories</td><td>paragraph</td><td>page</td></tr>
<tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">police_violence1</td><td>0.15<sup>*</sup> (0.09)</td><td>0.37<sup>***</sup> (0.03)</td><td>-0.21<sup>***</sup> (0.02)</td></tr>
<tr><td style="text-align:left">protester_violence1</td><td>0.06 (0.04)</td><td>0.02 (0.01)</td><td>-0.11<sup>***</sup> (0.01)</td></tr>
<tr><td style="text-align:left">police_violence1:protester_violence1</td><td>0.51<sup>***</sup> (0.10)</td><td>0.37<sup>***</sup> (0.03)</td><td>-0.01 (0.03)</td></tr>
<tr><td style="text-align:left">Constant</td><td>0.05 (0.31)</td><td>2.22<sup>***</sup> (0.09)</td><td>3.13<sup>***</sup> (0.07)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>3,668</td><td>3,669</td><td>3,652</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
</table>
]

---
## Second solution: Marginal effects table

---
## Visualizing stories vs police_violence

```r
plot_model(
  p1, 
  type = "eff", 
  terms = c("police_violence", "protester_violence"))
```
]

.right-plot[
<img src="week11_02_files/figure-html/dca_plot1-1.png" width="100%" style="display: block; margin: auto;" />
]
---
## Visualizing paragraphs vs police_violence

```r
plot_model(
  p2, 
  type = "eff", 
  terms = c("police_violence", "protester_violence"))
```
]

.right-plot[
<img src="week11_02_files/figure-html/dca_plot2-1.png" width="100%" style="display: block; margin: auto;" />
]
---
## Visualizing page vs police_violence

```r
plot_model(
  p3, 
  type = "eff", 
  terms = c("police_violence", "protester_violence"))
```
]

.right-plot[
<img src="week11_02_files/figure-html/dca_plot3-1.png" width="100%" style="display: block; margin: auto;" />
]
---
## Fourth solution: Plot with continuous `$x_1$` & categorical moderator `$x_2$`

.vertical-center[
<img src="images/nyt_marginal_effects_plots-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## Zero-Inflated Poisson

<table class="table" style="margin-left: auto; margin-right: auto;">
<tbody>
  <tr>
   <td style="text-align:left;"> Category </td>
   <td style="text-align:left;"> Count </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Test Name </td>
   <td style="text-align:left;"> Zero-Inflated Poisson </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Distribution </td>
   <td style="text-align:left;"> Poisson + Binomial </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Guidelines </td>
   <td style="text-align:left;"> Count data that has an excess of zero counts </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Example Y Data </td>
   <td style="text-align:left;"> 0 to `+\infty`, integer </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Command </td>
   <td style="text-align:left;"> zeroinfl </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Package </td>
   <td style="text-align:left;"> pscl </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Syntax </td>
   <td style="text-align:left;"> zeroinfl(y ~ x1 | x2, data = data) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Link </td>
   <td style="text-align:left;"> https://stats.idre.ucla.edu/r/dae/zip/ </td>
  </tr>
</tbody>
</table>

---
## Negative Binomial

<table class="table" style="margin-left: auto; margin-right: auto;">
<tbody>
  <tr>
   <td style="text-align:left;"> Category </td>
   <td style="text-align:left;"> Count </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Test Name </td>
   <td style="text-align:left;"> Negative binomial </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Distribution </td>
   <td style="text-align:left;"> Negative binomial </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Guidelines </td>
   <td style="text-align:left;"> Count data with over-dispersion </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Example Y Data </td>
   <td style="text-align:left;"> 0 to `+\infty`, integer </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Command </td>
   <td style="text-align:left;"> glm.nb </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Package </td>
   <td style="text-align:left;"> MASS </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Syntax </td>
   <td style="text-align:left;"> glm.nb(y ~ x, data = data) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Link </td>
   <td style="text-align:left;"> https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/ </td>
  </tr>
</tbody>
</table>