class: center, middle, inverse, title-slide # POL90: Statistics ## Null Hypothesis, R-squared ### Prof. Wasow, PoliticsPomona College ### 2022-03-22 --- <style type="text/css"> .regression10 table { font-size: 10px; } .regression12 table { font-size: 12px; } .regression14 table { font-size: 14px; } </style> # Announcements .large[ * Assignments + PS07 + Report 2 ] -- .large[ * Statistical Sleuth + Read Chapter 8 + Supplement - http://appliedstats.org/chapter8.html ] --- # Schedule <table> <thead> <tr> <th style="text-align:right;"> Week </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Title </th> <th style="text-align:right;"> Chapter </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Mar 7 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Simple Linear Regression </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Mar 9 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Regression by Calculation </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Mar 14 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Spring Recess </td> <td style="text-align:right;"> - </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Mar 16 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Spring Recess </td> <td style="text-align:right;"> - </td> </tr> <tr> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 10 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Mar 21 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Mon </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Null hypothesis, R-squared </td> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> Mar 23 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Multiple regression </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> Mar 28 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Interaction terms </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> Mar 30 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Interaction terms </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:left;"> Apr 4 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Logistic regression </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:left;"> Apr 6 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Logistic regression </td> <td style="text-align:right;"> 20 </td> </tr> </tbody> </table> --- ## Assignment schedule <table> <thead> <tr> <th style="text-align:right;"> Week </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Assignment </th> <th style="text-align:right;"> Percent </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Mar 11 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS06 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Mar 18 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> Spring break </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 10 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Mar 25 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Fri </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> PS07 </td> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> Apr 1 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS08 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:left;"> Apr 8 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> Report2 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> Apr 15 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS09 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:left;"> Apr 22 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS10 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:left;"> Apr 29 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> Report3 </td> <td style="text-align:right;"> 10 </td> </tr> </tbody> </table> --- class: middle, center # Report 1: Did Conant v. Walters (2002) # influence public opinion in # favor of legalization? --- ## Akhter, Mandel, Park & Zhai (2022) <img src="images/Akhter_Mandel_Park_Zhai_1.png" width="856" style="display: block; margin: auto;" /> --- ## Akhter, Mandel, Park & Zhai (2022) <img src="images/Akhter_Mandel_Park_Zhai_1.png" width="856" style="display: block; margin: auto;" /> --- ## Akhter, Mandel, Park & Zhai (2022) <img src="images/Akhter_Mandel_Park_Zhai_2.png" width="858" style="display: block; margin: auto;" /> --- ## Akhter, Mandel, Park & Zhai (2022) <img src="images/Akhter_Mandel_Park_Zhai_3.png" width="763" style="display: block; margin: auto;" /> --- ## Akhter, Mandel, Park & Zhai (2022) <img src="images/Akhter_Mandel_Park_Zhai_4.png" width="763" style="display: block; margin: auto;" /> --- ## Akhter, Mandel, Park & Zhai (2022) <img src="images/Akhter_Mandel_Park_Zhai_5.png" width="856" style="display: block; margin: auto;" /> --- ## Akhter, Mandel, Park & Zhai (2022) <img src="images/Akhter_Mandel_Park_Zhai_6.png" width="866" style="display: block; margin: auto;" /> --- class: center, middle # Permutation for intuition about # hypothesis testing with regression --- ## Revisiting sampling and assignment <img src="images/randomization_selection_assignment.jpg" width="90%" style="display: block; margin: auto;" /> .pull-right[ .footnote[Source: *Statistical Sleuth*, Display 1.5] ] --- ## Revisiting sampling and assignment <img src="images/randomization_selection_assignment_markedup.jpg" width="90%" style="display: block; margin: auto;" /> .pull-right[ .footnote[Source: *Statistical Sleuth*, Display 1.5] ] --- ## Island Area & Species .vertical-center[ * Does reducing the area of wild land reduce ecological diversity? * E.O. Wilson offered a conservatively optimistic guess that the current rate of environmental destruction amounts to 27,000 species vanishing each year and that 20% of all plant and animal species currently on earth will be extinct by the year 2022. * Biologists have noticed a consistent relation between the area of islands and the number of animal and plant species living on them. If `\(S\)` is the number of species and `\(A\)` is the area, then `\(S = CA^{\gamma}\)` (roughly), where `\(C\)` is a constant and `\(\gamma\)` is a biologically meaningful parameter that depends on the group of organisms (birds, reptiles, or grasses, for example). ] --- ## Island Area & Species ```r habitat <- Sleuth3::case0801 %>% clean_names() habitat %>% kable() %>% kable_styling(font_size = 16, full_width = FALSE) ``` <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> area </th> <th style="text-align:right;"> species </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 44218 </td> <td style="text-align:right;"> 100 </td> </tr> <tr> <td style="text-align:right;"> 29371 </td> <td style="text-align:right;"> 108 </td> </tr> <tr> <td style="text-align:right;"> 4244 </td> <td style="text-align:right;"> 45 </td> </tr> <tr> <td style="text-align:right;"> 3435 </td> <td style="text-align:right;"> 53 </td> </tr> <tr> <td style="text-align:right;"> 32 </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 7 </td> </tr> </tbody> </table> --- ## Island Area & Species .left-code[ ```r habitat %>% ggplot() + aes(x = area, y = species) + geom_point() + geom_smooth( method = 'loess', se = FALSE, * span = 2 ) ``` ] .right-plot[ <img src="week09_01_files/figure-html/habitat_unlogged_plot-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Island Area & Species (logged) .left-code[ ```r habitat <- habitat %>% mutate( log_area = log(area), log_species = log(species) ) habitat %>% ggplot() + aes(x = log_area, y = log_species) + geom_point() + geom_smooth( method = 'lm', se = FALSE ) ``` ] .right-plot[ <img src="week09_01_files/figure-html/habitat_logged_plot-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Island Area & Species (logged in ggplot) .left-code[ ```r habitat %>% ggplot() + aes(x = area, y = species) + geom_point() + geom_smooth( method = 'lm', se = FALSE ) + * scale_y_log10() + * scale_x_log10() ``` ] .right-plot[ <img src="week09_01_files/figure-html/habitat_gglogged_plot-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Island Area & Species .vertical-center[ * The parameter `\(\gamma\)` in the species-area relation, Median `\(\{S|A\}= CA^{\gamma}\)`, is estimated to be 0.250 (a 95% confidence interval is 0.219 to 0.281). <mark>It is estimated that the median number of species increases by 19% with each doubling of area.</mark> * Recall from rules of logarithms that if we take the log of both sides of `\(S= CA^{\gamma}\)`, we'll have `\(log(S) = \gamma \cdot (C + log(A)) = \gamma C + \gamma log(A)\)` * The statistical association from <mark>these observational data cannot be used to establish a causal connection.</mark> Furthermore, any generalization of these results to islands in other parts of the world or to a wider population, like "islands of rain forest", is purely speculative. Nevertheless, the results from these data are consistent with results for other islands and with small-scale randomized experiments.] --- ```r lm_out <- lm(log_species ~ log_area, data = habitat) stargazer(lm_out, type = 'html', header = FALSE, single.row = TRUE) ``` <table style="text-align:center"><tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td><em>Dependent variable:</em></td></tr> <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr> <tr><td style="text-align:left"></td><td>log_species</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">log_area</td><td>0.250<sup>***</sup> (0.012)</td></tr> <tr><td style="text-align:left">Constant</td><td>1.937<sup>***</sup> (0.088)</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>7</td></tr> <tr><td style="text-align:left">R<sup>2</sup></td><td>0.988</td></tr> <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.986</td></tr> <tr><td style="text-align:left">Residual Std. Error</td><td>0.128 (df = 5)</td></tr> <tr><td style="text-align:left">F Statistic</td><td>425.300<sup>***</sup> (df = 1; 5)</td></tr> <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr> </table> --- ## How did we get from `\(\beta_1\)` = 0.250 to 19%? <img src="images/statistical_style_guide_screen_grab1.png" width="90%" style="display: block; margin: auto;" /> --- ## How did we get from `\(\beta_1\)` = 0.250 to 19%? <img src="images/statistical_style_guide_screen_grab2.png" width="100%" style="display: block; margin: auto;" /> --- ## Interpretation after log transformations * Where possible, best to back-transform coefficients into original units. * From [*The Statistical Sleuth*, 3ed](http://www.statisticalsleuth.com), "Interpretation After Log Transformations," (p 216). * When Both the Response and Explanatory Variables are Logged: - If `\(\mu\{\textrm{log}(Y)|\textrm{log}(X)\} = \beta_0 + \beta_1\textrm{log}(X)\)`, - then Median `\(\{Y|X\} = \textrm{exp}(\beta_0)X^{\beta_1}\)` - A doubling of `\(X\)` is associated with a multiplicative change of `\(2^{\beta_1}\)` in the median of `\(Y\)`. Or, a ten-fold increase in `\(X\)` is associated with a `\(10^{\beta_1}\)`-fold change in the median of `\(Y\)`. ```r 2^0.250 ``` ``` [1] 1.189 ``` --- class: center, middle # Null hypothesis with linear regression --- ## What is the null hypothesis with linear regression? - Imagine slope if no relationship btwn Species & Area .left-code[ ```r habitat %>% ggplot() + aes(x = area, y = species) + geom_point() + geom_smooth( method = 'lm', se = FALSE ) + scale_y_log10() + scale_x_log10() + geom_hline( yintercept = 30, linetype = 2 ) ``` ] .right-plot[ <img src="week09_01_files/figure-html/habitat_gglogged_plot2-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Permute the data, sort by area ```r # use infer to generate a permutation of the original data habitat_permuted <- habitat %>% specify(log_species ~ log_area) %>% hypothesize(null = "independence") %>% generate(reps = 1, type = "permute") ``` .pull-left[ ### Original data <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> log_species </th> <th style="text-align:right;"> log_area </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 4.605 </td> <td style="text-align:right;"> 10.697 </td> </tr> <tr> <td style="text-align:right;"> 4.682 </td> <td style="text-align:right;"> 10.288 </td> </tr> <tr> <td style="text-align:right;"> 3.807 </td> <td style="text-align:right;"> 8.353 </td> </tr> <tr> <td style="text-align:right;"> 3.970 </td> <td style="text-align:right;"> 8.142 </td> </tr> <tr> <td style="text-align:right;"> 2.773 </td> <td style="text-align:right;"> 3.466 </td> </tr> <tr> <td style="text-align:right;"> 2.398 </td> <td style="text-align:right;"> 1.609 </td> </tr> <tr> <td style="text-align:right;"> 1.946 </td> <td style="text-align:right;"> 0.000 </td> </tr> </tbody> </table> ] .pull-right[ ### Permuted data <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> log_species </th> <th style="text-align:right;"> log_area </th> <th style="text-align:right;"> replicate </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 3.970 </td> <td style="text-align:right;"> 10.697 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 4.682 </td> <td style="text-align:right;"> 10.288 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 2.773 </td> <td style="text-align:right;"> 8.353 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 1.946 </td> <td style="text-align:right;"> 8.142 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 3.807 </td> <td style="text-align:right;"> 3.466 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 4.605 </td> <td style="text-align:right;"> 1.609 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 2.398 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> ] --- ## Plot one permutation of the data. .left-code[ ```r habitat_permuted %>% ggplot() + aes(x = log_area, y = log_species) + geom_point() + geom_smooth( method = 'lm', se = FALSE ) ``` ] .right-plot[ <img src="week09_01_files/figure-html/habitat_permuted_plot-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Generate two permutations ```r habitat_permuted <- habitat %>% specify(log_species ~ log_area) %>% hypothesize(null = "independence") %>% generate(reps = 2, type = "permute") ``` .pull-left[ ```r habitat_permuted %>% filter(replicate == 1) ``` ``` # A tibble: 7 × 3 # Groups: replicate [1] log_species log_area replicate <dbl> <dbl> <int> 1 2.40 10.7 1 2 3.97 10.3 1 3 4.68 8.35 1 4 3.81 8.14 1 5 2.77 3.47 1 6 1.95 1.61 1 7 4.61 0 1 ``` ] .pull-right[ ```r habitat_permuted %>% filter(replicate == 2) ``` ``` # A tibble: 7 × 3 # Groups: replicate [1] log_species log_area replicate <dbl> <dbl> <int> 1 1.95 10.7 2 2 2.40 10.3 2 3 3.97 8.35 2 4 4.68 8.14 2 5 2.77 3.47 2 6 3.81 1.61 2 7 4.61 0 2 ``` ] --- ## Plot two permutations .left-code[ ```r habitat_permuted %>% ggplot() + aes(x = log_area, y = log_species, color = factor(replicate)) + geom_jitter( width = .1, height = .1, alpha = .3) + geom_smooth( method = 'lm', se = FALSE) ``` ] .right-plot[ <img src="week09_01_files/figure-html/habitat_permuted_plot5_2-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Generate five permutations ```r habitat_permuted <- habitat %>% specify(log_species ~ log_area) %>% hypothesize(null = "independence") %>% generate(reps = 5, type = "permute") ``` .pull-left[ ```r habitat_permuted %>% filter(replicate == 4) ``` ``` # A tibble: 7 × 3 # Groups: replicate [1] log_species log_area replicate <dbl> <dbl> <int> 1 3.97 10.7 4 2 4.68 10.3 4 3 3.81 8.35 4 4 2.40 8.14 4 5 4.61 3.47 4 6 2.77 1.61 4 7 1.95 0 4 ``` ] .pull-right[ ```r habitat_permuted %>% filter(replicate == 5) ``` ``` # A tibble: 7 × 3 # Groups: replicate [1] log_species log_area replicate <dbl> <dbl> <int> 1 4.61 10.7 5 2 2.40 10.3 5 3 3.81 8.35 5 4 4.68 8.14 5 5 1.95 3.47 5 6 2.77 1.61 5 7 3.97 0 5 ``` ] --- ## Plot five permutations .left-code[ ```r habitat_permuted %>% ggplot() + aes(x = log_area, y = log_species, color = factor(replicate)) + geom_jitter( width = .1, height = .1, alpha = .3) + geom_smooth( method = 'lm', se = FALSE) ``` ] .right-plot[ <img src="week09_01_files/figure-html/habitat_permuted_plot5_5-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Plot ten permutations .left-code[ ```r habitat_permuted <- habitat %>% specify(log_species ~ log_area) %>% hypothesize(null = "independence") %>% generate(reps = 10, type = "permute") habitat_permuted %>% ggplot() + aes(x = log_area, y = log_species, color = factor(replicate)) + geom_jitter( width = .1, height = .1, alpha = .3) + geom_smooth( method = 'lm', se = FALSE) ``` ] .right-plot[ <img src="week09_01_files/figure-html/habitat_permuted_plot5_10-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Plot twenty permutations .left-code[ ```r habitat_permuted <- habitat %>% specify(log_species ~ log_area) %>% hypothesize(null = "independence") %>% generate(reps = 20, type = "permute") habitat_permuted %>% ggplot() + aes(x = log_area, y = log_species, color = factor(replicate)) + geom_jitter( width = .1, height = .1, alpha = .3) + geom_smooth( method = 'lm', se = FALSE) ``` ] .right-plot[ <img src="week09_01_files/figure-html/habitat_permuted_plot5_20-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Plot forty permutations .left-code[ ```r habitat_permuted <- habitat %>% specify(log_species ~ log_area) %>% hypothesize(null = "independence") %>% generate(reps = 40, type = "permute") habitat_permuted %>% ggplot() + aes(x = log_area, y = log_species, color = factor(replicate)) + geom_jitter( width = .1, height = .1, alpha = .3) + geom_smooth( method = 'lm', se = FALSE) ``` ] .right-plot[ <img src="week09_01_files/figure-html/habitat_permuted_plot5_40-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## What happens without `factor(replicate)`? .left-code[ ```r habitat_permuted <- habitat %>% specify(log_species ~ log_area) %>% hypothesize(null = "independence") %>% generate(reps = 40, type = "permute") habitat_permuted %>% ggplot() + aes(x = log_area, y = log_species, color = replicate) + geom_jitter( width = .1, height = .1, alpha = .3) + geom_smooth( method = 'lm', se = FALSE) ``` ] .right-plot[ <img src="week09_01_files/figure-html/habitat_permuted_plot5_40_no_factor-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## View 10 slopes for permuted data ```r # use infer to permute the data and calculate one slope habitat_permuted %>% calculate(stat = "slope") %>% head(10) ``` ``` Response: log_species (numeric) Explanatory: log_area (numeric) Null Hypothesis: independence # A tibble: 10 × 2 replicate stat <int> <dbl> 1 1 -0.0489 2 2 0.0550 3 3 0.0583 4 4 -0.0108 5 5 -0.205 6 6 -0.0852 7 7 -0.0928 8 8 0.128 9 9 -0.0521 10 10 -0.0102 ``` --- ## Estimate 2,000 slopes ```r habitat_permuted <- habitat %>% specify(log_species ~ log_area) %>% hypothesize(null = "independence") %>% generate(reps = 2000, type = "permute") %>% calculate(stat = "slope") head(habitat_permuted, 10) ``` ``` Response: log_species (numeric) Explanatory: log_area (numeric) Null Hypothesis: independence # A tibble: 10 × 2 replicate stat <int> <dbl> 1 1 0.160 2 2 0.00963 3 3 0.117 4 4 0.0290 5 5 0.217 6 6 -0.0613 7 7 -0.0555 8 8 -0.0239 9 9 -0.0982 10 10 0.0519 ``` --- ## Plot 2,000 slopes .left-code[ ```r habitat_permuted %>% ggplot() + aes(x = stat) + geom_histogram(bins = 20) ``` ] .right-plot[ <img src="week09_01_files/figure-html/habitat_permuted_hist_plot-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Compare observed slope to permuted data slopes ```r broom::tidy(lm_out) %>% kable() %>% kable_styling(font_size = 16, full_width = FALSE) ``` <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.9365 </td> <td style="text-align:right;"> 0.0881 </td> <td style="text-align:right;"> 21.97 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> log_area </td> <td style="text-align:right;"> 0.2497 </td> <td style="text-align:right;"> 0.0121 </td> <td style="text-align:right;"> 20.62 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> ```r observed_coef <- lm_out$coefficients[2] observed_coef ``` ``` log_area 0.2497 ``` --- ## How extreme is our observed slope if null is true? ```r habitat_permuted$stat %>% head() ``` ``` log_area log_area log_area log_area log_area log_area 0.159894 0.009632 0.116808 0.029049 0.216536 -0.061322 ``` ```r # how many permuted data slopes are as extreme # or more extreme than our observed slope? number_extreme <- sum(abs(habitat_permuted$stat) >= abs(observed_coef)) number_extreme ``` ``` [1] 0 ``` ```r # what percent of permuted data slopes are as extreme # or more extreme than our observed slope? options(scipen = 10) # set a penalty for scientific notation number_extreme / 2000 ``` ``` [1] 0 ``` --- ## Visualize observed slope vs permuted data slopes .left-code[ ```r # plot with infer::visualize habitat_permuted %>% visualize(bins = 30) + geom_vline( xintercept = c(-observed_coef, observed_coef), col = "red") ``` ] .right-plot[ <img src="week09_01_files/figure-html/habitat_permuted_hist_ci_plot-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Summarizing bootstrapping * With bootstrapping, we shuffle the composition of the sample but keep the relationship between key variables the same for each observation (e.g., data within rows remains intact). - We simulate different samples to generate a range of plausible estimates of the relationship between `\(X\)` and `\(Y\)`. - We're asking "how might our estimate differ with a slightly different sample?" - This generates a simulation-based *sampling distribution* --- ## Summarizing regression permutation test * With permutation or randomization tests, we keep the original sample but assume no true relationship between key variables (e.g., we shuffle data within a column or between treated and control groups). - We simulate plausible alternate relationships between some `\(X\)` and `\(Y\)` under the null hypothesis of no relationship. - We're asking "under the assumption that there is no relationship between our *x* and our *y*, how extreme is our estimate?" - This generates a simulation-based *null distribution* --- ## In summary <br><br> .large[ - What is a slope of zero? ] -- .large[ - Why is a null hypothesis with OLS a slope of zero? ] -- .large[ - What is a *p*-value in regression? ] --- class: center, middle # Chapter 8: R-squared --- ## Components of Sums of Squares <img src="images/components_sums_of_squares.png" width="70%" style="display: block; margin: auto;" /> --- ## Sums of Squares * Total Sum of Squares: how much variance in `\(Y_i\)` is there to explain? `\begin{align*} \mathrm{TSS} & = \sum_{i=1}^N ( Y_i - \bar{Y})^2 \end{align*}` * Estimated Sum of Squares: how much this variance do we explain? `\begin{align*} \mathrm{ESS} & = \sum_{i=1}^N ( \hat{Y_i} - \bar{Y})^2 \end{align*}` * Residual Sum of Squares: how much variance is unexplained? `\begin{align*} \mathrm{RSS} & = \sum_{i=1}^N ( Y_i - \hat{Y_i})^2 \end{align*}` --- ## `\(R^2\)`: The Proportion of Variation Explained .vertical-center[ * <span style="color:red"> `\(R^2\)` </span>: The Proportion of Variation Explained * Definition: The `\(R^2\)` statistic, or coefficient of determination, is the percentage of the total response explained by the explanatory variable. `\begin{align*} R^2 & = 100 \left( \frac{\textrm{Total Sum of Squares - Residual Sum of Squares} } {\textrm{Total Sum of Squares}} \right) \% \end{align*}` ] --- ## `\(R^2\)`: The Proportion of Variation Explained .vertical-center[ * Interpretation: - A judgement about what constitutes "good" values for `\(R^2\)` depends on the context of the study. In precise laboratory work, `\(R^2\)` values under 90% may be low enough to require refinements in technique or the inclusion of other explanatory information. In some social science contexts, a low `\(R^2\)` may be considered remarkably good - `\(R^2\)` should not be used for inference and it should not be used to assess the adequacy of the straight-line model, because `\(R^2\)` can be quite large even when the simple linear regression model is inadequate ] --- ## Simulate data to assess `\(R^2\)` as "goodness of fit" ```r # generate a y with a linear relationship to x set.seed(321) n <- 1000 x <- rnorm(n) hist(x) ``` <img src="week09_01_files/figure-html/goodness_of_fit-1.png" width="45%" style="display: block; margin: auto;" /> --- ## Simulate data to assess `\(R^2\)` as "goodness of fit" ```r # generate y with linear relationship to x y_linear <- x + rnorm(n) lm_example1 <- lm(y_linear ~ x) plot(y_linear ~ x) ``` <img src="week09_01_files/figure-html/unnamed-chunk-26-1.png" width="45%" style="display: block; margin: auto;" /> --- ## Simulate data to assess `\(R^2\)` as "goodness of fit" ```r # generate y with non-linear relationship to x y_nonlinear <- exp(x) + rnorm(n) lm_example2 <- lm(y_nonlinear ~ x) plot(y_nonlinear ~ x) ``` <img src="week09_01_files/figure-html/unnamed-chunk-27-1.png" width="45%" style="display: block; margin: auto;" /> --- ## Does a higher `\(R^2\)` mean better "goodness of fit"? <img src="week09_01_files/figure-html/unnamed-chunk-28-1.png" width="720" style="display: block; margin: auto;" /> --- ## Really important to plot your data .regression10[ ```r stargazer(lm_example1, lm_example2, type = 'html', font.size = "small", omit.stat = c("f", "ser")) ``` <table style="text-align:center"><tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="2"><em>Dependent variable:</em></td></tr> <tr><td></td><td colspan="2" style="border-bottom: 1px solid black"></td></tr> <tr><td style="text-align:left"></td><td>y_linear</td><td>y_nonlinear</td></tr> <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td></tr> <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">x</td><td>0.996<sup>***</sup></td><td>1.666<sup>***</sup></td></tr> <tr><td style="text-align:left"></td><td>(0.032)</td><td>(0.048)</td></tr> <tr><td style="text-align:left"></td><td></td><td></td></tr> <tr><td style="text-align:left">Constant</td><td>0.016</td><td>1.619<sup>***</sup></td></tr> <tr><td style="text-align:left"></td><td>(0.032)</td><td>(0.048)</td></tr> <tr><td style="text-align:left"></td><td></td><td></td></tr> <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>1,000</td><td>1,000</td></tr> <tr><td style="text-align:left">R<sup>2</sup></td><td>0.487</td><td>0.547</td></tr> <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.487</td><td>0.547</td></tr> <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="2" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr> </table> ] --- ## Adjusted `\(R^2\)` * Adjusted `\(R^2\)`: In theory, using an infinite number of explanatory variables to explain the change in a response variable would result in an `\(R^2\)` of one * The Adjusted `\(R^2\)` value is an attempt to correct this shortcoming by adjusting both the numerator and the denominator by their respective degrees of freedom `\begin{align*} \textrm{Adjusted } R^2 & = 1 - \left( \frac{ \frac{RSS}{(n - p - 1)} } { \frac{TSS}{n-1}}\right) \\ & = 1 - (1 - R^2) \left( \frac{n-1}{n - p -1} \right) \end{align*}` where `\(p\)` is the number of explanatory variables --- ## Adjusted `\(R^2\)` .vertical-center[ * Unlike the `\(R^2\)`, the adjusted `\(R^2\)` can decline in value if the contribution to the explained deviation by the additional variable is less than the impact on the degrees of freedom * Note, however, that while the `\(R^2\)` is a percent, the adjusted `\(R^2\)` is not and should not be referred to as a proportion] --- ## How to report `\(R^2\)`? .large[ - Will depend on discipline but in political science increasingly *not* reported in prose or plots unless there's something noteworthy. Standad in tables. - Can be useful for comparing very similar models with exact same data. Like ANOVA, we can note that adding one term significantly increases `\(R^2\)` (or not) - Adjusted `\(R^2\)` can be included in tables but rarely discussed - Many other measures of models beyond scope of this class - Can convert other measures to "Pseudo `\(R^2\)`" with packages like `DescTools`. For more, see https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/ ] --- ## Let's play <img src="images/guess_the_correlation.png" width="100%" style="display: block; margin: auto auto auto 0;" /> .footnote[http://guessthecorrelation.com] --- class: center, middle # Questions?