class: center, middle, inverse, title-slide # POL90: Statistics ### Prof Wasow
Assistant Professor, Politics
Pomona College ### 2022-02-14 --- # Announcements .large[ * Assignments + PS04 due <mark>Friday</mark> + Report 1 ] -- .large[ * Statistical Sleuth + Important to read along + Skim Chapter 4 ] --- # Schedule <table> <thead> <tr> <th style="text-align:right;"> Week </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Title </th> <th style="text-align:right;"> Chapter </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Jan 31 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Inference Using t-Distributions </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Feb 2 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Inference Using t-Distributions </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Feb 7 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Inference Using t-Distributions </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Feb 9 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Inference Using t-Distributions </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 5 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Feb 14 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Mon </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> A Closer Look at Assumptions </td> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Feb 16 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> A Closer Look at Assumptions </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> Feb 21 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Alternatives to the t-Tools </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> Feb 23 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Alternatives to the t-Tools </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> Feb 28 </td> <td style="text-align:left;"> Mon </td> <td style="text-align:left;"> Comparison Among Several Samples </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> Mar 2 </td> <td style="text-align:left;"> Wed </td> <td style="text-align:left;"> Comparison Among Several Samples </td> <td style="text-align:right;"> 5 </td> </tr> </tbody> </table> --- ## Assignment schedule <table> <thead> <tr> <th style="text-align:right;"> Week </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Assignment </th> <th style="text-align:right;"> Percent </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Feb 11 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS03 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 5 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Feb 18 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Fri </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> PS04 </td> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> Feb 25 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS05 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 7 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Mar 4 </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Fri </td> <td style="text-align:left;color: black !important;background-color: yellow !important;"> Report1 </td> <td style="text-align:right;color: black !important;background-color: yellow !important;"> 6 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Mar 11 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS06 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Mar 18 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> Spring break </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> Mar 25 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS07 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> Apr 1 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS08 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:left;"> Apr 8 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> Report2 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> Apr 15 </td> <td style="text-align:left;"> Fri </td> <td style="text-align:left;"> PS09 </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> --- class: center, middle, inverse # Sports News --- <img src="images/nathan_chen.png" width="55%" style="display: block; margin: auto;" /> --- class: center, middle, inverse # Review --- class: center, middle, inverse # Revisiting the Standard Error & # Pooled Sample Standard Deviation --- ## Are Population Dist. and Sampling Dist. the Same? <img src="images/ss_display_2_4_blank_1_2_3_mu.png" width="80%" style="display: block; margin: auto;" /> .footnote[Source: Statistical Sleuth 3e, Display 2.4] --- ## Are Population Mean and Sampling Average the Same? <img src="images/ss_display_2_4_blank_1_2_3_mu.png" width="80%" style="display: block; margin: auto;" /> .footnote[Source: Statistical Sleuth 3e, Display 2.4] --- ## Are Population Mean and Sampling Average the Same? <img src="images/ss_display_2_4_blank_2_3.png" width="80%" style="display: block; margin: auto;" /> .footnote[Source: Statistical Sleuth 3e, Display 2.4] --- ## Statkey: Sampling Average Distribution of the Mean <img src="images/statkey_sampling_dist_mean_home.png" width="100%" style="display: block; margin: auto;" /> .footnote[https://www.lock5stat.com/StatKey/index.html] --- ## Statkey: Sampling Average Distribution of the Mean <img src="images/statkey_sampling_dist_mean_baseball.png" width="90%" style="display: block; margin: auto;" /> .footnote[https://www.lock5stat.com/StatKey/index.html] --- ## Are Population `\(\sigma\)` and Sampling Dist. SD the Same? <img src="images/ss_display_2_4_blank_2_3.png" width="80%" style="display: block; margin: auto;" /> .footnote[Source: Statistical Sleuth 3e, Display 2.4] --- ## Are Population `\(\sigma\)` and Sampling Dist. SD the Same? <img src="images/ss_display_2_4_blank_3_question_150.png" width="80%" style="display: block; margin: auto;" /> .footnote[Source: Statistical Sleuth 3e, Display 2.4] --- ## Standard Error for a Sample Average .vertical-center[ \begin{aligned}\text{SE}(\bar{Y})&=\dfrac{s}{\sqrt{n}}\newline d.f&=(n-1)\end{aligned}] --- ## Why do we pool SD with two samples? <img src="images/ss_display_2_7.png" width="65%" style="display: block; margin: auto;" /> .footnote[Source: Statistical Sleuth 3e, Display 2.7] --- ## <mark>Standard Error</mark> for the Difference .vertical-center[ `\begin{eqnarray*} \textrm{SD}(\bar{Y}_2 - \bar{Y}_1) & = & \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} } \end{eqnarray*}` * Under assumption of equal variance (e.g. `\(\sigma_1 = \sigma_2\)`), formula reduces to `\begin{eqnarray*} \textrm{SE}(\bar{Y}_2 - \bar{Y}_1) & = & s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2} } \end{eqnarray*}` ] --- ## <mark>Pooled Standard Deviation</mark> for Two Samples .vertical-center[ `\begin{eqnarray*} s_p & = & \sqrt{\frac{(n_1 -1)s^2_1 + (n_2-1)s^2_2}{(n_1 + n_2 -2)} }, \\ \textrm{d.f.} & = & (n_1 + n_2 - 2) \end{eqnarray*}` ] --- ## How to think about CI with repeated samples? - `\(\gamma\)` is the parameter describing space curvature - Estimates and confidence intervals for `\(\gamma\)`, the deflection of light around the sun, from 20 experiments <img src="images/ss_display_2_13.png" width="60%" style="display: block; margin: auto;" /> .footnote[Source: Statistical Sleuth 3e, Section 2.5.2, Display 2.13] --- class: center, middle, inverse # Chapter 3 --- ## Case 1: Cloud seeding experiment .large[ * On each of 52 days that were deemed suitable for cloud seeding, a random mechanism was used to decide whether to seed the target cloud on that day or to leave it unseeded as a control. * An airplane flew through the cloud in both cases, since the experimenters and the pilot were themselves unaware of whether on any particular day the seeding mechanism in the plane was loaded or not (that is, they were blind to the treatment). * Precipitation was measured as the total rain volume falling from the cloud base following the airplane seeding run, as measured by radar. * Did cloud seeding have an effect on rainfall in this experiment? ] --- ## Case 1: Cloud seeding experiment .large[ Data were collected in southern Florida between 1968 and 1972 to test a hypothesis that massive injection of silver iodide into cumulus clouds can lead to increased rainfall. (Data from J. Simpson, A. Olsen, and J. Eden, "A Bayesian Analysis of a Multiplicative Treatment Effect in Weather Modification," *Technometrics* 17 (1975): 161-66.) ] ```r cloud <- Sleuth3::case0301 %>% clean_names() head(cloud, 4) ``` ``` rainfall treatment 1 1202.6 Unseeded 2 830.1 Unseeded 3 372.4 Unseeded 4 345.5 Unseeded ``` --- ## Case 1: Cloud seeding experiment .left-code[ ```r cloud <- Sleuth3::case0301%>% clean_names() ggplot(data = cloud) + aes(x = treatment) + aes(y = rainfall) + geom_boxplot() ``` ] .right-plot[ <img src="week05_01_files/figure-html/cloud_seeding_plot1-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Cloud seeding experiment (logged) .left-code[ ```r cloud <- Sleuth3::case0301 %>% clean_names() ggplot(data = cloud) + aes(x = treatment) + aes(y = rainfall) + geom_boxplot() + * scale_y_log10() # log scale ``` ] .right-plot[ <img src="week05_01_files/figure-html/cloud_seeding_log_plot1-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Cloud seeding experiment (ridgeplot) .left-code[ ```r suppressMessages( # extends ggplot * library(ggridges) ) ggplot(data = cloud) + aes(x = rainfall) + aes(y = treatment) + # new geom * geom_density_ridges( * stat = "binline") ``` ] .right-plot[ <img src="week05_01_files/figure-html/cloud_seeding_density_plot-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Cloud seeding experiment (ridgeplot logged) .left-code[ ```r suppressMessages( library(ggridges) ) ggplot(data = cloud) + aes(x = rainfall) + aes(y = treatment) + geom_density_ridges( stat = "binline") + * scale_x_log10() # log scale ``` ] .right-plot[ <img src="week05_01_files/figure-html/cloud_seeding_density_log_plot-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Cloud seeding *t*-test ```r t.test(rainfall ~ treatment, var.equal = TRUE, data = cloud) ``` ``` Two Sample t-test data: rainfall by treatment t = 1.9982, df = 50, p-value = 0.05114 alternative hypothesis: true difference in means between group Seeded and group Unseeded is not equal to 0 95 percent confidence interval: -1.431856 556.224164 sample estimates: mean in group Seeded mean in group Unseeded 441.9846 164.5885 ``` --- ## Cloud seeding *t*-test ```r t.test(rainfall ~ treatment, var.equal = TRUE, data = cloud) ``` ``` Two Sample t-test data: rainfall by treatment *t = 1.9982, df = 50, p-value = 0.05114 alternative hypothesis: true difference in means between group Seeded and group Unseeded is not equal to 0 95 percent confidence interval: -1.431856 556.224164 sample estimates: mean in group Seeded mean in group Unseeded 441.9846 164.5885 ``` --- ## How should we interpret this *t*-test? ```r t.test(rainfall ~ treatment, var.equal = TRUE, data = cloud) ``` ``` Two Sample t-test data: rainfall by treatment t = 1.9982, df = 50, p-value = 0.05114 alternative hypothesis: true difference in means between group Seeded and group Unseeded is not equal to 0 95 percent confidence interval: -1.431856 556.224164 sample estimates: mean in group Seeded mean in group Unseeded * 441.9846 164.5885 ``` --- ## How should we interpret this *t*-test? ```r t.test(rainfall ~ treatment, var.equal = TRUE, data = cloud) ``` ``` Two Sample t-test data: rainfall by treatment t = 1.9982, df = 50, p-value = 0.05114 alternative hypothesis: true difference in means between group Seeded and group Unseeded is not equal to 0 95 percent confidence interval: * -1.431856 556.224164 sample estimates: mean in group Seeded mean in group Unseeded * 441.9846 164.5885 ``` --- ## Interpreting the size of a *p*-value <br> <br> <img src="images/ss_display_2_12.png" width="522" style="display: block; margin: auto;" /> .footnote[Source: Statistical Sleuth 3e, Display 2.12] --- ## Cloud seeding *t*-test with log ```r *t.test(I(log(rainfall)) ~ treatment, # I() = Isolate var.equal = TRUE, data = cloud) ``` ``` Two Sample t-test data: I(log(rainfall)) by treatment t = 2.5444, df = 50, p-value = 0.01408 alternative hypothesis: true difference in means between group Seeded and group Unseeded is not equal to 0 95 percent confidence interval: 0.240865 2.046697 sample estimates: mean in group Seeded mean in group Unseeded 5.134187 3.990406 ``` --- ## How to interpret *t*-test with log? ```r t.test(I(log(rainfall)) ~ treatment, var.equal = TRUE, data = cloud) ``` ``` Two Sample t-test data: I(log(rainfall)) by treatment t = 2.5444, df = 50, p-value = 0.01408 alternative hypothesis: true difference in means between group Seeded and group Unseeded is not equal to 0 95 percent confidence interval: * 0.240865 2.046697 sample estimates: mean in group Seeded mean in group Unseeded * 5.134187 3.990406 ``` --- ## Using `exp()` to Report in Original Units ```r # difference in means 5.134187 - 3.990406 # mean of each group ``` ``` [1] 1.143781 ``` ```r exp(5.134187 - 3.990406) ``` ``` [1] 3.138613 ``` ```r # 95% confidence interval c(exp(0.240865), exp(2.046697)) ``` ``` [1] 1.272349 7.742286 ``` --- ## Reporting Results of Cloud Seeding Experiment .large[ * "It is estimated that the volume of rainfall on days when clouds were seeded was 3.1 times as large as when not seeded. A 95% confidence interval for this multiplicative effect is 1.3 times to 7.7 times. Since randomization was used to determine whether any particular suitable day was seeded or not, it is safe to interpret this as evidence that the seeding caused the larger rainfall amount." ] --- class: center, middle # Log Transformation --- ## Mathematical Assumptions on `\(t\)`-Tests and CIs .vertical-center[ .large[ * Two samples are independent * Each sample is a random sample from a normal population * The population variances are equal ] ] --- ## What do We Mean by Independent? <img src="images/cartoon_guide_independence.png" width="100%" style="display: block; margin: auto;" /> .footnote[Source: Gonick & Smith (2005), *Cartoon Guide to Statistics*, p 43] --- ## Cases Where Assumptions May Not be Satisfied .vertical-center[ .large[ * The shape of the population distribution may not resemble the shape of a normal distribution * The population variances may not be comparable * The independence assumption may not be met: - cluster effects and serial effects ] ] --- ## Visualizing the Log Transformation <img src="images/ss_display_3_8.png" width="65%" style="display: block; margin: auto;" /> .footnote[Source: Statistical Sleuth, Display 3.8] --- ## Visualizing the Log Transformation in `R` .left-code[ ```r # start with standard normal norm_dist <- rnorm( n = 10000, mean = 0, sd = 1 ) hist(norm_dist) ``` ] .right-plot[ <img src="week05_01_files/figure-html/visualizing_log_plot1-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Visualizing the Log Transformation in `R` .left-code[ ```r # start with standard normal norm_dist <- rnorm( n = 10000, mean = 0, sd = 1 ) plot(density(norm_dist)) ``` ] .right-plot[ <img src="week05_01_files/figure-html/visualizing_log_plot2-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Exponentiate Standard Normal .left-code[ ```r # exponentiate standard normal exp_norm <- exp(norm_dist) hist(exp_norm) ``` ] .right-plot[ <img src="week05_01_files/figure-html/visualizing_log_plot3-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Exponentiate Standard Normal .left-code[ ```r # exponentiate standard normal exp_norm <- exp(norm_dist) plot(density(exp_norm)) ``` ] .right-plot[ <img src="week05_01_files/figure-html/visualizing_log_plot4-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Back Transform Standard Normal .left-code[ ```r # back transform log_trans <- log(exp_norm) plot(density(log_trans)) ``` ] .right-plot[ <img src="week05_01_files/figure-html/visualizing_log_plot5-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Log transformation -- - Many real data more closely meet the ideal model after log transformation (e.g., a data set that exhibits higher level of variation for large values) -- - It permits interpretation about multiplicative effects -- - Example of <span style="color:red">additive effect</span>: the mean salary for Pomona grads is $500 more than the mean for Stanford grads -- - Example of <span style="color:red">multiplicative effects</span>: the median salary for Pomona grads is 15% more than the median for Stanford grads -- - What indicates that a log transformation might help? -- - Distributions are skewed to the right - Spread is greater in the distribution with larger center - Multiplicative statement is desirable --- class: center, middle ## Reporting Results --- ## Reporting Results of Cloud Seeding Experiment * "It is estimated that the <mark>volume of rainfall</mark> on days when clouds were seeded was 3.1 times as large as when not seeded. A 95% confidence interval for this multiplicative effect is 1.3 times to 7.7 times. Since randomization was used to determine whether any particular suitable day was seeded or not, it is safe to interpret this as evidence that the seeding caused the larger rainfall amount." * Elements of statistical writing - Outcome of interest: volume of rainfall --- ## Reporting Results of Cloud Seeding Experiment * "It is estimated that the volume of rainfall on <mark>days when clouds were seeded</mark> was 3.1 times as large as <mark>when not seeded</mark>. A 95% confidence interval for this multiplicative effect is 1.3 times to 7.7 times. Since randomization was used to determine whether any particular suitable day was seeded or not, it is safe to interpret this as evidence that the seeding caused the larger rainfall amount." * Elements of statistical writing - Outcome of interest: volume of rainfall - Groups - 'treated' group: days when clouds were seeded - 'control' group: when not seeded --- ## Reporting Results of Cloud Seeding Experiment * "It is estimated that the volume of rainfall on days when clouds were seeded was <mark>3.1</mark> <mark>times</mark> <mark>as large</mark> as when not seeded. A 95% confidence interval for this multiplicative effect is 1.3 times to 7.7 times. Since randomization was used to determine whether any particular suitable day was seeded or not, it is safe to interpret this as evidence that the seeding caused the larger rainfall amount." * Elements of statistical writing - Outcome of interest: volume of rainfall - Groups - 'treated' group: days when clouds were seeded - 'control' group: when not seeded - Magnitude - estimate : 3.1 - additive or multiplicative?: times - sign: as large (positive) --- ## Reporting Results of Cloud Seeding Experiment * "It is estimated that the volume of rainfall on days when clouds were seeded was 3.1 times as large as when not seeded. A 95% confidence interval for this multiplicative effect is 1.3 times to 7.7 times. Since randomization was used to determine whether any particular suitable day was seeded or not, it is safe to interpret this as evidence that <mark>the seeding caused</mark> the larger rainfall amount." * Elements of statistical writing - Outcome of interest: volume of rainfall - Groups - 'treated' group: days when clouds were seeded - 'control' group: when not seeded - Magnitude - estimate : 3.1 - additive or multiplicative?: times - sign: as large (positive) - Causality - associated with or caused?: the seeding caused --- ## Reporting Results of Cloud Seeding Experiment * "It is estimated that the volume of rainfall on days when clouds were seeded was 3.1 times as large as when not seeded. A 95% confidence interval for this multiplicative effect is 1.3 times to 7.7 times. Since randomization was used to determine whether any particular suitable day was seeded or not, it is safe to interpret this as evidence that the seeding caused the larger rainfall amount." * Elements of statistical writing - Outcome of interest: volume of rainfall - Groups - 'treated' group: days when clouds were seeded - 'control' group: when not seeded - Magnitude - estimate : 3.1 - additive or multiplicative?: times - sign: as large (positive) - Causality? - associated with or caused?: the seeding caused - Generalizability? - <mark>was random sampling used?</mark> --- ## Reporting Results in General * Writing statistical conclusions - Multiplicative effect: - It is estimated that the (outcome of interest) on (treated group) (is associated with / caused) (estimate) times (increase / decrease) versus the (control group). - One general form: - Moving from ('control' group) to ('treated' group) is (associated with / causes) a (estimate) (increase / decrease) (additive / multiplicative) change in the (mean / median) (outcome of interest) as measured in (units). --- class: center, middle ## Questions? --- class: middle, center background-color: #000000 <iframe width="1120" height="630" src="https://www.youtube.com/embed/bpwKeyg0v9w" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <!-- --- --> <!-- class: middle, center --> <!-- background-color: #000000 --> <!-- <iframe width="1120" height="630" src="https://www.youtube.com/embed/6YDHBFVIvIs" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --> <!-- --- --> <!-- class: middle, center --> <!-- background-color: #000000 --> <!-- <iframe width="1120" height="630" src="https://www.youtube.com/embed/gyC9B8O_EKs" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --> <!-- # Virtual Galton Board --> <!-- * Open up RStudio --> <!-- * Type `rbinom(n = 1, size = 10, prob = 0.5)` --> <!-- * Go to http://bit.ly/346galton --> <!-- * Record number of 'successes' --> --- ## Log Transformation Appendix -- - Procedure for two samples: -- - Initial inspection may suggest trying log(Y ). -- - Transform to get two new columns: `\begin{eqnarray*} Z_1 &= & \textrm{log}(Y_1)\\ Z_2 &= & \textrm{log}(Y_2) \end{eqnarray*}` -- - Graphically examine `\(Z_1\)` and `\(Z_2\)` -- - If appropriate, use `\(t\)`-tools on `\(Z_1\)` and `\(Z_2\)` -- - Interpret results on original scale. --- ## Log transformation <br/><br/> -- - Interpretation: statement about the ratio of the population median -- - First, note that `\begin{eqnarray*} E[\textrm{log}(Y)] & \neq & \textrm{log}(E[Y]) \end{eqnarray*}` -- - In fact, `\begin{eqnarray*} E[\textrm{log}(Y)] &\leq& \textrm{log}(E[Y]) \end{eqnarray*}` for any random variable `\(Y\)` -- - On the other hand, for the population median `\(m\)`, `\begin{eqnarray*} m(\textrm{log}(Y)) & = & \textrm{log}(m(Y)) \end{eqnarray*}` --- ## Log, mean, median ```r measure_a <- seq(from = 1, to = 1000, by = 1) head(measure_a) ``` ``` [1] 1 2 3 4 5 6 ``` ```r log(head(measure_a)) ``` ``` [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 ``` ```r mean(log(measure_a)) ``` ``` [1] 5.912128 ``` ```r log(mean(measure_a)) ``` ``` [1] 6.215608 ``` --- ## Why doesn't log(mean) = mean(log)? * Linear function vs non-linear function ```r ## linear function, no problem mean(measure_a * 2) ``` ``` [1] 1001 ``` ```r 2 * (mean(measure_a)) ``` ``` [1] 1001 ``` ```r # non-linear function, issues mean(log(measure_a)) ``` ``` [1] 5.912128 ``` ```r log(mean(measure_a)) ``` ``` [1] 6.215608 ``` --- ## Why doesn't log(mean) = mean(log)? .left-code[ ```r ## linear function, no problem double_a <- measure_a * 2 plot(double_a ~ measure_a) ``` ] .right-plot[ <img src="week05_01_files/figure-html/linear_function_plot-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Why doesn't log(mean) = mean(log)? .left-code[ ```r ## non-linear function log_a <- log(measure_a) plot(log_a ~ measure_a) ``` ] .right-plot[ <img src="week05_01_files/figure-html/nonlinear_function_plot-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Does log(median) = median(log)? * `log` function preserves rank ```r median(log(measure_a)) ``` ``` [1] 6.215607 ``` ```r log(median(measure_a)) ``` ``` [1] 6.215608 ``` --- ## Log transformation -- - `\(\bar{Z}_2 - \bar{Z}_1\)` estimates `\(E[\textrm{log}(Y_2)] - E[\textrm{log}(Y_1)]\)` -- - If the distribution of `\(\log(Y)\)` is symmetric, -- `\begin{eqnarray*} E [\textrm{log}(Y)] & = & m(\textrm{log}(Y)) = \textrm{log}(m(Y)). \end{eqnarray*}` -- - Therefore, if the distributions of `\(Z_2\)` and `\(Z_1\)` are symmetric, `\(\bar{Z}_2 - \bar{Z}_1\)` estimates -- `\begin{eqnarray*} \textrm{log}(m(Y_2)) - \textrm{log}(m(Y_1))& = & \textrm{log}\left( \frac{m(Y_2)}{m(Y_1)} \right) \end{eqnarray*}` - Exponentiating both sides, we conclude that `\(\exp(\bar{Z}_2 - \bar{Z}_1)\)` estimates `\(\displaystyle \frac{m(Y_2)}{m(Y_1)}\)`, the ratio of the population median --- ## Log transformation <br/><br/> -- - State the conclusion in the original scale: `\begin{eqnarray*} w & = &\exp(\bar{Z}_2 - \bar{Z}_1) \end{eqnarray*}` -- - Randomization study: It is estimated that the response of an experimental unit to treatment 2 will be `\(w\)` times as large as its response to treatment 1 - Observational study: It is estimated that the median for population 2 is `\(w\)` times as large as the median for population 1