POL90: Statistics

class: center, middle, inverse, title-slide

# POL90: Statistics
## Logistic regression & Data visualization
### Prof. Wasow, Politics</br>Pomona College
### 2022-04-05

---

.regression12 table {
  font-size: 12px;     
}

.regression14 table {
  font-size: 14px;     
}

</style>

# Announcements

.large[

- Report 2
  - Doesn't have to be feeling thermometer
  - Doesn't have to be 2020
  - Should be "plausibly analogous to an experiment"
    - Do not use an "immutable characteristic"
  
- *Statistical Sleuth*

- This week: Skim Chapter 20
      - http://appliedstats.org/chapter20.html

]

---
# Schedule

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Week </th>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:left;"> Day </th>
   <th style="text-align:left;"> Title </th>
   <th style="text-align:right;"> Chapter </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:left;"> Mar 21 </td>
   <td style="text-align:left;"> Mon </td>
   <td style="text-align:left;"> Null hypothesis, R-squared </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:left;"> Mar 23 </td>
   <td style="text-align:left;"> Wed </td>
   <td style="text-align:left;"> Multiple regression </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:left;"> Mar 28 </td>
   <td style="text-align:left;"> Mon </td>
   <td style="text-align:left;"> Interaction terms </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:left;"> Mar 30 </td>
   <td style="text-align:left;"> Wed </td>
   <td style="text-align:left;"> Interaction terms </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:right;color: black !important;background-color: yellow !important;"> 12 </td>
   <td style="text-align:left;color: black !important;background-color: yellow !important;"> Apr 4 </td>
   <td style="text-align:left;color: black !important;background-color: yellow !important;"> Mon </td>
   <td style="text-align:left;color: black !important;background-color: yellow !important;"> Logistic regression </td>
   <td style="text-align:right;color: black !important;background-color: yellow !important;"> 20 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:left;"> Apr 6 </td>
   <td style="text-align:left;"> Wed </td>
   <td style="text-align:left;"> Logistic regression </td>
   <td style="text-align:right;"> 20 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:left;"> Apr 11 </td>
   <td style="text-align:left;"> Mon </td>
   <td style="text-align:left;"> Missing data </td>
   <td style="text-align:right;"> Handout </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:left;"> Apr 13 </td>
   <td style="text-align:left;"> Wed </td>
   <td style="text-align:left;"> Missing data </td>
   <td style="text-align:right;"> Handout </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:left;"> Apr 18 </td>
   <td style="text-align:left;"> Mon </td>
   <td style="text-align:left;"> Matching </td>
   <td style="text-align:right;"> Handout </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:left;"> Apr 20 </td>
   <td style="text-align:left;"> Wed </td>
   <td style="text-align:left;"> Matching </td>
   <td style="text-align:right;"> Handout </td>
  </tr>
</tbody>
</table>

---
## Assignment schedule

---
class: center, middle

# Interactions with Albuquerque
# Real Estate Data

---
## Example using Albuquerque real estate data

```r
# read in Albuquerque data
alb_real_estate <- read.table(here("data", "alb.dat.txt"), sep = "\t", header = TRUE)
head(alb_real_estate, 3)
```

```
  price sqft age feats ne cust cor  tax
1  2050 2650  13     7  1    1   0 1639
2  2080 2600  NA     4  1    1   0 1088
3  2150 2664   6     5  1    1   0 1193
```

```r
# Transform ne dummy and select three columns
alb <- alb_real_estate %>% 
* mutate(ne_fct = ifelse(ne == 1, "yes", "no") %>% as.factor()) %>%
  rename(ne_bin = ne) %>% 
  select(price, sqft, ne_bin, ne_fct)

head(alb, 3)
```

```
  price sqft ne_bin ne_fct
1  2050 2650      1    yes
2  2080 2600      1    yes
3  2150 2664      1    yes
```

---
## Visualize Albuquerque real estate data

.left-code[

```r
ggplot(data = alb) +
  aes(x = sqft,
      y = price,
      color = ne_fct) + 
  geom_point()
```
]

.right-plot[
<img src="week11_01_files/figure-html/alb_plot-1.png" width="432" style="display: block; margin: auto;" />
]

---
## We could subset and run two separate models

.left-code[

```r
ggplot(data = alb) +
  aes(x = sqft,
      y = price,
      color = ne_fct) + 
  geom_point() +
  geom_smooth(method = "lm")
```
]

.right-plot[
<img src="week11_01_files/figure-html/alb_plot2-1.png" width="432" style="display: block; margin: auto;" />
]

---
## Subset for `ne_fct == 'yes'` and model

```r
alb_ne_yes <- alb %>%  filter(ne_fct == 'yes')

lm(price ~ sqft, data = alb_ne_yes) %>% summary()
```

```

Call:
lm(formula = price ~ sqft, data = alb_ne_yes)

Residuals:
   Min     1Q Median     3Q    Max 
-433.2  -89.4   -9.7   65.3  466.6

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.1691    46.8781    0.79     0.43    
sqft          0.6180     0.0221   27.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 159 on 106 degrees of freedom
Multiple R-squared:  0.881,	Adjusted R-squared:  0.879 
F-statistic:  782 on 1 and 106 DF,  p-value: <2e-16
```

---
## Subset for `ne_fct == 'no'` and model

```r
alb_ne_no <- alb %>%  filter(ne_fct == 'no')

lm(price ~ sqft, data = alb_ne_no) %>% summary()
```

```

Call:
lm(formula = price ~ sqft, data = alb_ne_no)

Residuals:
   Min     1Q Median     3Q    Max 
-290.9 -106.1   -1.0   66.3  923.6

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  293.104     63.011    4.65  1.6e-05 ***
sqft           0.417      0.031   13.48  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 183 on 66 degrees of freedom
Multiple R-squared:  0.734,	Adjusted R-squared:  0.729 
F-statistic:  182 on 1 and 66 DF,  p-value: <2e-16
```

---
## Compare with interaction model

```r
lm(price ~ sqft + ne_fct + sqft * ne_fct, data = alb) %>% summary()
```

```

Call:
lm(formula = price ~ sqft + ne_fct + sqft * ne_fct, data = alb)

Residuals:
   Min     1Q Median     3Q    Max 
-433.2  -94.3   -5.1   65.6  923.6

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     293.1036    58.0879    5.05  1.1e-06 ***
sqft              0.4175     0.0286   14.62  < 2e-16 ***
ne_fctyes      -255.9345    76.4422   -3.35    0.001 ***
sqft:ne_fctyes    0.2005     0.0369    5.43  1.9e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 168 on 172 degrees of freedom
Multiple R-squared:  0.848,	Adjusted R-squared:  0.846 
F-statistic:  320 on 3 and 172 DF,  p-value: <2e-16
```

---
class: center, middle

# Reporting Logistic Results

---

## Example: Survival in the Donner Party

From *Statistical Sleuth*, Chapter 20 (p 602, 3e):

>In 1846 the Donner and Reed families left Springfield, Illinois, for California by covered wagon. In July, the Donner Party, as it became known, reached Fort Bridger, Wyoming. There its leaders decided to attempt a new and untested route to the Sacramento Valley. Having reached its full size of 87 people and 20 wagons, the party was delayed by a difficult crossing of the Wasatch Range and again in the crossing of the desert west of the Great Salt Lake. The group became stranded in the eastern Sierra Nevada mountains when the region was hit by heavy snows in late October. By the time the last survivor was rescued on April 21, 1847, 40 of the 87 members had died from famine and exposure to extreme cold.

.footnote[From *Statistical Sleuth*, Chapter 20 (p 602, 3e):]

---
## Load Donner data

```r
# load data
donner <- Sleuth3::case2001 %>% 
  clean_names()

# convert Survived/Died to ones/zeros
donner <- donner %>%
  mutate(
    survived_bin = as.numeric(status == "Survived"),
    # male = reference category
    sex = sex %>% fct_relevel("Male")
    )

# view data
head(donner)
```

```
  age    sex   status survived_bin
1  23   Male     Died            0
2  40 Female Survived            1
3  40   Male Survived            1
4  30   Male     Died            0
5  28   Male     Died            0
6  40   Male     Died            0
```

---
## Visualizing Donner data

.left-code[

```r
ggplot(data = donner) +
   aes(x = age,
       y = survived_bin) +
   geom_point(size = 3,
              alpha = 0.5)     
```
]

.right-plot[
<img src="week11_01_files/figure-html/donner_plot1-1.png" width="100%" style="display: block; margin: auto;" />
]

---
## Logistic regression in `R`

```r
# run logistic regression with family = binomial
glm_sex <- glm(survived_bin ~ sex,       family = binomial, data = donner)
glm_age <- glm(survived_bin ~ age,       family = binomial, data = donner)
glm_sa  <- glm(survived_bin ~ sex + age, family = binomial, data = donner)

glm_sa %>% broom::tidy() %>% kable(digits = 2)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 1.63 </td>
   <td style="text-align:right;"> 1.11 </td>
   <td style="text-align:right;"> 1.47 </td>
   <td style="text-align:right;"> 0.14 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sexFemale </td>
   <td style="text-align:right;"> 1.60 </td>
   <td style="text-align:right;"> 0.76 </td>
   <td style="text-align:right;"> 2.11 </td>
   <td style="text-align:right;"> 0.03 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> age </td>
   <td style="text-align:right;"> -0.08 </td>
   <td style="text-align:right;"> 0.04 </td>
   <td style="text-align:right;"> -2.10 </td>
   <td style="text-align:right;"> 0.04 </td>
  </tr>
</tbody>
</table>

---
## Comparing Donner models with `stargazer`

.regression12[

```r
# create regression table
stargazer(glm_sex, glm_age, glm_sa, type = 'html', header = FALSE, single.row = TRUE, omit.stat = c("aic", "ll"))
```

<table style="text-align:center"><tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3"><em>Dependent variable:</em></td></tr>
<tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr>
<tr><td style="text-align:left"></td><td colspan="3">survived_bin</td></tr>
<tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">sexFemale</td><td>1.386<sup>**</sup> (0.671)</td><td></td><td>1.597<sup>**</sup> (0.755)</td></tr>
<tr><td style="text-align:left">age</td><td></td><td>-0.066<sup>**</sup> (0.032)</td><td>-0.078<sup>**</sup> (0.037)</td></tr>
<tr><td style="text-align:left">Constant</td><td>-0.693<sup>*</sup> (0.387)</td><td>1.819<sup>*</sup> (0.999)</td><td>1.633 (1.110)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>45</td><td>45</td><td>45</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
</table>
]

---
## Explaining results

.large[
From *Statistical Sleuth*, (p 608, 3e)

The fit of a logistic regression model to the Donner Party data, where `$\pi$` represents survival probability, gives
$$
logit(\hat{\pi}) = 1.63 - 0.078 \times age + 1.60 \times female
$$
]

---

## What are odds & log(odds)?

$$ 
\textrm{odds}  = \frac{\textrm{\# favorable outcomes}}{\textrm{\# unfavorable outcomes}} = \frac{\textrm{p(success)}}{\textrm{p(failure) }} = \frac{p}{q} = \frac{p}{1-p} 
$$

$$
\textrm{log(odds)} =  \textrm{log}_e \left( \frac{p}{q} \right) = \textrm{log}_e \left( \frac{p}{1-p} \right)
$$

---

```r
p <- c(0.001, 0.01, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.999, 0.9999)
odds <- p/(1-p)
log_odds <- log(odds)
```
.pull-left[

```r
data.frame(p, odds, log_odds)
```

```
        p      odds log_odds
1  0.0010 1.001e-03  -6.9068
2  0.0100 1.010e-02  -4.5951
3  0.1500 1.765e-01  -1.7346
4  0.2000 2.500e-01  -1.3863
5  0.2500 3.333e-01  -1.0986
6  0.3000 4.286e-01  -0.8473
7  0.3500 5.385e-01  -0.6190
8  0.4000 6.667e-01  -0.4055
9  0.4500 8.182e-01  -0.2007
*10 0.5000 1.000e+00   0.0000
11 0.5500 1.222e+00   0.2007
12 0.6000 1.500e+00   0.4055
13 0.6500 1.857e+00   0.6190
14 0.7000 2.333e+00   0.8473
15 0.7500 3.000e+00   1.0986
*16 0.8000 4.000e+00   1.3863
17 0.8500 5.667e+00   1.7346
18 0.9000 9.000e+00   2.1972
19 0.9990 9.990e+02   6.9068
20 0.9999 9.999e+03   9.2102
```
]

.pull-left[

```r
# convert p to odds & log odds by hand
# if p = 0.5
.5/.5
log(.5/.5)

# if p = 0.8
.8/.2
log(.8/.2)

# if p = 0.75?

# if p = 0.20?

# if p = 0.25?

# if p = 0.90?
```
]

---
## Exercises

.large[
 - Compare women 50 years old with women 20 years old. Report the odds ratio for survival.

- Compare a woman (`female = 1`) with a man (`female = 0`) of the same age.

] 
$$
logit(\hat{\pi}) = 1.63 - 0.078 \times age + 1.60 \times female
$$

---
## Exercises

.large[
- Log odds of survival for women 50 years old 
]

```r
w50 <- 1.63 - 0.078 * 50 + 1.60 * 1
w50
```

```
[1] -0.67
```

.large[
- Log odds of survival for women 20 years old 
]

```r
w20 <- 1.63 - 0.078 * 20 + 1.60 * 1
w20
```

```
[1] 1.67
```

```r
exp(w50 - w20)
```

```
[1] 0.09633
```

---
## Exercises

.large[
- Can also be calculated as a simultaneous equation 
]

```r
exp(-0.078 * (50 - 20))
```

```
[1] 0.09633
```

.large[

- Summary: Odds of survival for 50 vs 20 years old is estimated to be `0.096`, or about 1/10. 
 
 - So the odds of a 20-year-old woman surviving were about 10 times the odds of a 50-year-old woman surviving. 
]

---
## How about male vs female at age 30

.large[
 - The estimated odds ratio of moving from male to female is `exp(1.60(1 - 0)) = 4.95`
 ]

```r
female_log_odds <- 1.63 - 0.078 * 30 + 1.60 * 1
female_log_odds
```

```
[1] 0.89
```

```r
male_log_odds   <- 1.63 - 0.078 * 30 + 1.60 * 0
male_log_odds
```

```
[1] -0.71
```

```r
exp(female_log_odds - male_log_odds)
```

```
[1] 4.953
```
 
.large[
 - A woman's odds of survival were about 5 times the odds of a man of the same age.
]

---
## How to report logistic results?

.vertical-center[

-  **Continuous `$X_1$`, binary `$Y$` **: with logistic regression: A one unit change in `$X_1$` is associated with/causes the odds that `$Y = 1$` to change by a multiplicative factor of exp `$(\beta_1)$`, holding all other variables constant.

- **Binary `$X_1$`, binary `$Y$` **: with logistic regression: moving from group 1 to group 2 is associated with/causes the odds that `$Y = 1$` to change by a multiplicative factor of exp `$(\beta_1)$`, holding all other variables constant.

]

---

## Comparing models

- For ANOVA with logistic regression, need to specify test = "Chisq"

```r
# for ANOVA with logistic regression, need to specify test = "Chisq"
*anova(glm_sex, glm_sa, test = "Chisq")
```

```
Analysis of Deviance Table

Model 1: survived_bin ~ sex
Model 2: survived_bin ~ sex + age
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1        43       57.3                       
2        42       51.3  1     6.03    0.014 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
## Creating tables with `stargazer` and `exp()`

.regression12[

```r
# create regression table
stargazer(glm_sex, glm_age, glm_sa, type = 'html', header = FALSE, single.row = TRUE, omit.stat = c("aic", "ll"), 
*  apply.coef = function(x) {exp(x)})
```

<table style="text-align:center"><tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3"><em>Dependent variable:</em></td></tr>
<tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr>
<tr><td style="text-align:left"></td><td colspan="3">survived_bin</td></tr>
<tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">sexFemale</td><td>4.000<sup>***</sup> (0.671)</td><td></td><td>4.940<sup>***</sup> (0.755)</td></tr>
<tr><td style="text-align:left">age</td><td></td><td>0.936<sup>***</sup> (0.032)</td><td>0.925<sup>***</sup> (0.037)</td></tr>
<tr><td style="text-align:left">Constant</td><td>0.500 (0.387)</td><td>6.163<sup>***</sup> (0.999)</td><td>5.120<sup>***</sup> (1.110)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>45</td><td>45</td><td>45</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
</table>
]

---

## Visualizing "effect" of sex with `sjPlot`

- The `sjPlot` package offers a `plot_model` command which offers a number of options to easily plot predicted probabilities from logistic regression models.

---

## Visualizing "effect" of sex and age with `sjPlot`

---

## Visualizing "effect" of age and sex as moderator

---
## Statistical Conclusion

>While it generally appears that older adults were less likely than younger adults to survive, the data provide highly suggestive evidence that the age effect is more pronounced for females than for males (the two-sided *p*-value for interaction of gender and age is 0.05). Ignoring the interaction leads to an informal conclusion about the overall difference between men and women: The data provide highly suggestive evidence that the survival probability was higher for women (in the Donner Party) than for the men (two-sided *p*-value = 0.02 from a drop-in-deviance test). The odds of survival for women are estimated to be 5 times the odds of survival for similarly aged men (95% confidence interval: 1.2 times to 25.2 times, from a drop-in-deviance-based confidence interval).

---

## Scope of Inference
>Since the data are observational, the result cannot be used as proof that women were more apt to survive than men; the possibility of confounding variables cannot be excluded. Furthermore, since the 45 individuals were not drawn at random from any population, inference to a broader population is not justified.

---
class: middle, center

# Questions?