Supplement to POL90
The first table in almost every published study we read gives descriptive statistics. Depending on the field, type of project, and publication, there will be some variation in the design of these tables. Broadly, however, they serve the same purpose: to help readers get a sense of the distribution of the variables used in the study. Measures included will also vary; some authors will include median and IQR, while others will use mean and standard deviation, and others will include all 4. Generally, the number of observations for each variable will be included; this helps get a sense of where there is missingnessness. Summary tables also help us, the researchers to identify issues in how variables are coded and to interpret our own coefficients.
In this document, we will be looking at a selection of variables from
the 2016 American National Election Survey. To make our tables, we will
use two packages: arsenal
and knitr
. You will
by now be familiar with the latter’s kable()
function. When
paired with arsenal::tableby()
, it can be used to make
informative, well-formatted summary tables with ease.
NOTE: library(kableExtra)
which extends
kable()
may cause formatting issues with
library(arsenal)
. If you want to use both, we recommend
loading library(kableExtra)
only after you’ve finished
creating your summary statistics table with the tableby()
command.
For more details on the tableby()
function, consult
this great resource.
# set global options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(tidyverse)
library(knitr)
library(foreign)
library(stringr)
library(MatchIt)
library(naniar)
# install.packages("arsenal")
library(arsenal)
# NOTE library(kableExtra) may cause output issues with library(arsenal)
The hardest part of assembling a summary table is to have the variables you’ll use exactly as you want them. Here, I selected a series of variables from the 2016 ANES and cleaned them.
anes <- read.dta("anes_timeseries_2016_Stata12.dta")
anes <- anes %>%
mutate(
race = V161310x,
age = V161267,
death_penalty = V161233x,
vaccines = V162162x,
scientists = V162112,
obedient_self_reliant = V162241,
considerate_behaved = V162242,
stocks = V161350,
pid = V161158x,
educ = V161270,
married = V161269
)
anes1 <- anes %>%
dplyr::select(race, age, death_penalty, vaccines, scientists,
obedient_self_reliant,
stocks, pid, educ, married) # keep selected variables
anes1 <- anes1 %>%
# mutate_if conducts a logical test (.predicate) like is.factor(),
# if TRUE, runs function (.funs), here convert to character
mutate_if(.predicate = is.factor, .funs = as.character)
# remove negative values -- those are just NAs
anes1 <- anes1 %>%
# mutate_all applies a function to all columns
# here we run a case_when function on each column, one at a time
# str_sub function checks if first character == "-", assign NA,
# otherwise return original data in as.character() format
mutate_all(
.funs = function(x)
case_when(
str_sub(x, start = 1, end = 1) == "-" ~ NA_character_,
TRUE ~ as.character(x)
)
)
# I want to remove the numbers at the beginning of strings
# make vaccines, death_penalty, age, scientists thermometer numeric
# and make educ and pid indicator variables
anes1 <- anes1 %>%
# mutate_at applies a function to specfic columns
# str_sub extracts and replaces substrings from a character vector
mutate_at(.vars = c("vaccines", "death_penalty"),
.funs = function(x) str_sub(x, start = 1, end = 2)) %>%
mutate_at(.vars = c("age", "scientists", "vaccines", "death_penalty"),
.funs = as.numeric) %>%
mutate_at(.vars = c("stocks", "married", "race", "obedient_self_reliant"),
.funs = function(x) str_sub(x, start = 4))
anes1 <- anes1 %>%
mutate(
educ_BA = ifelse(str_sub(educ, start = 1, end = 2) %in% c("13", "14", "15", "16"),
"BA or Higher", "Less than BA"),
republican = ifelse(pid %in% c("7. Strong Republican",
"6. Not very strong Republican",
"5. Independent-Republican"),
"Yes", "No") %>% as.factor(),
race = ifelse(str_sub(race, start = 1, end = 5) == "Other", "Other", race),
scientists = ifelse(scientists > 100, NA, scientists)
)
anes1 <- anes1 %>%
dplyr::select(age, race, educ_BA, obedient_self_reliant,
republican, stocks, scientists, vaccines, death_penalty)
After cleaning, the data looks like this. Note that some variables
are numeric (such as age
and scientists
, the
feeling thermometer towards scientists). All others are character types.
As of now, none of these are factor variables.
head(anes1, 4)
age race educ_BA obedient_self_reliant
1 29 White, non-Hispanic Less than BA Obedience
2 26 White, non-Hispanic BA or Higher Self-reliance
3 23 White, non-Hispanic Less than BA Self-reliance
4 58 White, non-Hispanic Less than BA Self-reliance
republican stocks scientists vaccines death_penalty
1 Yes No 70 4 1
2 Yes No 70 1 4
3 No No 70 5 1
4 Yes Yes 50 4 1
Now let’s look at a very simple summary table using
tableby
. You’ll note that tableby
uses the
same syntax as lm()
. Anything to the left of the tilde
(~
) will be a column, and anything on the right will be a
row. In this first example, I leave the left side blank. This means that
summary statistics will be presented for the full data, not
separating by group.
tableby( ~ republican + race + age + educ_BA +
stocks + vaccines + scientists + obedient_self_reliant,
data = anes1,
digits = 2) %>%
summary() %>%
kable()
Overall (N=4270) | |
---|---|
republican | |
No | 2541 (59.5%) |
Yes | 1729 (40.5%) |
race | |
N-Miss | 33 |
Asian, native Hawaiian or other Pacif Islr,non-Hispanic | 148 (3.5%) |
Black, non-Hispanic | 397 (9.4%) |
Hispanic | 450 (10.6%) |
Native American or Alaska Native, non-Hispanic | 27 (0.6%) |
Other | 177 (4.2%) |
White, non-Hispanic | 3038 (71.7%) |
age | |
N-Miss | 121 |
Mean (SD) | 49.58 (17.58) |
Range | 18.00 - 90.00 |
educ_BA | |
BA or Higher | 1635 (38.3%) |
Less than BA | 2635 (61.7%) |
stocks | |
N-Miss | 102 |
No | 2218 (53.2%) |
Yes | 1950 (46.8%) |
vaccines | |
N-Miss | 671 |
Mean (SD) | 2.33 (1.67) |
Range | 1.00 - 7.00 |
scientists | |
N-Miss | 655 |
Mean (SD) | 76.70 (19.44) |
Range | 0.00 - 100.00 |
obedient_self_reliant | |
N-Miss | 646 |
Both | 70 (1.9%) |
Obedience | 1684 (46.5%) |
Self-reliance | 1870 (51.6%) |
For numeric variables, like age
and
vaccines
, we see the mean and sd. Character types, by
contrast, are treated as categorical variables. For them, we receive a
count and proportion.
If you want to break down your explanatory and control variables by
subgroups, doing so is extremely easy. Just put the variable you want to
group by before the tilde (~
).
tableby(republican ~ race + age + educ_BA + stocks +
vaccines + scientists + obedient_self_reliant,
data = anes1,
test = FALSE, # do not do t-tests or give p-values
digits = 2) %>%
summary() %>%
kable()
No (N=2541) | Yes (N=1729) | Total (N=4270) | |
---|---|---|---|
race | |||
N-Miss | 22 | 11 | 33 |
Asian, native Hawaiian or other Pacif Islr,non-Hispanic | 93 (3.7%) | 55 (3.2%) | 148 (3.5%) |
Black, non-Hispanic | 380 (15.1%) | 17 (1.0%) | 397 (9.4%) |
Hispanic | 347 (13.8%) | 103 (6.0%) | 450 (10.6%) |
Native American or Alaska Native, non-Hispanic | 19 (0.8%) | 8 (0.5%) | 27 (0.6%) |
Other | 115 (4.6%) | 62 (3.6%) | 177 (4.2%) |
White, non-Hispanic | 1565 (62.1%) | 1473 (85.7%) | 3038 (71.7%) |
age | |||
N-Miss | 70 | 51 | 121 |
Mean (SD) | 48.13 (17.46) | 51.73 (17.54) | 49.58 (17.58) |
Range | 18.00 - 90.00 | 18.00 - 90.00 | 18.00 - 90.00 |
educ_BA | |||
BA or Higher | 952 (37.5%) | 683 (39.5%) | 1635 (38.3%) |
Less than BA | 1589 (62.5%) | 1046 (60.5%) | 2635 (61.7%) |
stocks | |||
N-Miss | 69 | 33 | 102 |
No | 1479 (59.8%) | 739 (43.6%) | 2218 (53.2%) |
Yes | 993 (40.2%) | 957 (56.4%) | 1950 (46.8%) |
vaccines | |||
N-Miss | 405 | 266 | 671 |
Mean (SD) | 2.39 (1.70) | 2.25 (1.61) | 2.33 (1.67) |
Range | 1.00 - 7.00 | 1.00 - 7.00 | 1.00 - 7.00 |
scientists | |||
N-Miss | 391 | 264 | 655 |
Mean (SD) | 79.32 (19.27) | 72.84 (19.06) | 76.70 (19.44) |
Range | 0.00 - 100.00 | 0.00 - 100.00 | 0.00 - 100.00 |
obedient_self_reliant | |||
N-Miss | 389 | 257 | 646 |
Both | 38 (1.8%) | 32 (2.2%) | 70 (1.9%) |
Obedience | 903 (42.0%) | 781 (53.1%) | 1684 (46.5%) |
Self-reliance | 1211 (56.3%) | 659 (44.8%) | 1870 (51.6%) |
# change variable names:
var_names <- list(
republican = "Republican",
race = "Race",
stocks = "Stock Market Participation",
death_penalty = "Support/Oppose Death Penalty"
)
# and add title (always do this):
tableby(republican ~ race + stocks + death_penalty,
data = anes1,
test = FALSE,
digits = 2) %>%
summary(labelTranslations = var_names) %>% # variable names
kable(caption = "Selected Variables by Republican Party ID") # title
No (N=2541) | Yes (N=1729) | Total (N=4270) | |
---|---|---|---|
Race | |||
N-Miss | 22 | 11 | 33 |
Asian, native Hawaiian or other Pacif Islr,non-Hispanic | 93 (3.7%) | 55 (3.2%) | 148 (3.5%) |
Black, non-Hispanic | 380 (15.1%) | 17 (1.0%) | 397 (9.4%) |
Hispanic | 347 (13.8%) | 103 (6.0%) | 450 (10.6%) |
Native American or Alaska Native, non-Hispanic | 19 (0.8%) | 8 (0.5%) | 27 (0.6%) |
Other | 115 (4.6%) | 62 (3.6%) | 177 (4.2%) |
White, non-Hispanic | 1565 (62.1%) | 1473 (85.7%) | 3038 (71.7%) |
Stock Market Participation | |||
N-Miss | 69 | 33 | 102 |
No | 1479 (59.8%) | 739 (43.6%) | 2218 (53.2%) |
Yes | 993 (40.2%) | 957 (56.4%) | 1950 (46.8%) |
Support/Oppose Death Penalty | |||
N-Miss | 69 | 25 | 94 |
Mean (SD) | 2.23 (1.23) | 1.54 (0.93) | 1.95 (1.17) |
Range | 1.00 - 4.00 | 1.00 - 4.00 | 1.00 - 4.00 |
# change which summary statistics are displayed
tableby(republican ~ race + vaccines + scientists,
numeric.stats = c("N", "median", "q1q3"), # give us count, median, and IQR for numeric
data = anes1,
test = FALSE,
digits = 2) %>%
summary() %>%
kable(caption = "Selected Variables by Republican Party ID") # title
No (N=2541) | Yes (N=1729) | Total (N=4270) | |
---|---|---|---|
race | |||
N-Miss | 22 | 11 | 33 |
Asian, native Hawaiian or other Pacif Islr,non-Hispanic | 93 (3.7%) | 55 (3.2%) | 148 (3.5%) |
Black, non-Hispanic | 380 (15.1%) | 17 (1.0%) | 397 (9.4%) |
Hispanic | 347 (13.8%) | 103 (6.0%) | 450 (10.6%) |
Native American or Alaska Native, non-Hispanic | 19 (0.8%) | 8 (0.5%) | 27 (0.6%) |
Other | 115 (4.6%) | 62 (3.6%) | 177 (4.2%) |
White, non-Hispanic | 1565 (62.1%) | 1473 (85.7%) | 3038 (71.7%) |
vaccines | |||
N | 2136 | 1463 | 3599 |
Median | 2.00 | 2.00 | 2.00 |
Q1, Q3 | 1.00, 4.00 | 1.00, 3.00 | 1.00, 4.00 |
scientists | |||
N | 2150 | 1465 | 3615 |
Median | 85.00 | 70.00 | 85.00 |
Q1, Q3 | 70.00, 99.00 | 60.00, 85.00 | 60.00, 93.00 |
There is also a way to conduct t.tests
and other
significance tests. To do so, set test = TRUE
. For this
option, I also recommend using results = "asis"
in the
chunk header instead of using kable()
, and
setting pfootnote
in the summary function to
TRUE
. This gives details of what kind of test was used to
establish significance. Finally, to set a title on this table without
kable
, use title = "something"
in the
summary()
function.
# is there a significant difference between groups? Chi-square and t-test options
tableby(republican ~ age + death_penalty + vaccines + scientists,
numeric.stats = c("N", "mean", "sd"),
data = anes1,
test = TRUE, # set test to true (and remember to pair with mean/sd)
numeric.test = "kwt", cat.test = "chisq", # set test types (linear ANOVA default)
digits = 2) %>%
summary(title = "Selected Variables by Republican Party ID",
pfootnote = TRUE) # display methods used
No (N=2541) | Yes (N=1729) | Total (N=4270) | p value | |
---|---|---|---|---|
age | < 0.0011 | |||
N | 2471 | 1678 | 4149 | |
Mean | 48.13 | 51.73 | 49.58 | |
SD | 17.46 | 17.54 | 17.58 | |
death_penalty | < 0.0011 | |||
N | 2472 | 1704 | 4176 | |
Mean | 2.23 | 1.54 | 1.95 | |
SD | 1.23 | 0.93 | 1.17 | |
vaccines | 0.1391 | |||
N | 2136 | 1463 | 3599 | |
Mean | 2.39 | 2.25 | 2.33 | |
SD | 1.70 | 1.61 | 1.67 | |
scientists | < 0.0011 | |||
N | 2150 | 1465 | 3615 | |
Mean | 79.32 | 72.84 | 76.70 | |
SD | 19.27 | 19.06 | 19.44 |
Finally, after conducting matching, it is possible to use
tableby()
to show that your covariates are now balanced. In
this example, I match the treatment, republican
, on two
variables, age
and stocks
. Stocks, here, I am
using as a proxy for status/wealth. Notice that I must do some data
cleaning before matching. I had to (1) change the character vectors to
dummy variables, and (b) omit any NAs from this subsample of the data. I
conducted 3 t-tests and noted a statisically significant difference for
stocks
and age
on probability of being
republican, but not on educ_BA
.
anes1 <- anes1 %>%
mutate(
educ_BA_bin = ifelse(educ_BA == "BA or Higher", 1, 0),
republican = ifelse(republican == "Yes", 1, 0),
stocks = ifelse(stocks == "Yes", 1, 0),
id = row_number()
) %>%
na.omit()
with(anes1, t.test(age ~ republican))
Welch Two Sample t-test
data: age by republican
t = -6, df = 2920, p-value = 6e-09
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-4.79 -2.38
sample estimates:
mean in group 0 mean in group 1
47.6 51.2
Welch Two Sample t-test
data: stocks by republican
t = -9, df = 2946, p-value <2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-0.186 -0.118
sample estimates:
mean in group 0 mean in group 1
0.411 0.563
Welch Two Sample t-test
data: educ_BA_bin by republican
t = -1, df = 2947, p-value = 0.3
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-0.0521 0.0153
sample estimates:
mean in group 0 mean in group 1
0.392 0.410
# another way to show this:
# summary(glm(republican ~ educ_BA_bin + stocks + age, data = anes1, family = binomial()))
matched <- matchit(republican ~ age + stocks + educ_BA_bin,
data = anes1,
method = "nearest")
anes_matched <- match.data(matched)
Now, I only need to feed the matched data into our previous
tableby()
output:
# using tableby
tableby(republican ~ age + stocks + educ_BA_bin,
numeric.stats = "meansd", # only one summary stat
numeric.simplify = TRUE, # put on the same row as variable name
numeric.test = "kwt", # manually set test types
cat.test = "chisq", # manually set test types
data = anes_matched,
test = TRUE, # set test to true (remember to pair w/ mean/sd)
digits = 2) %>%
summary(title = "Covariate Balance on Matched Characteristics",
pfootnote = TRUE)
0 (N=1378) | 1 (N=1378) | Total (N=2756) | p value | |
---|---|---|---|---|
age | 50.79 (17.03) | 51.22 (17.67) | 51.01 (17.35) | 0.6071 |
stocks | 0.56 (0.50) | 0.56 (0.50) | 0.56 (0.50) | 0.9691 |
educ_BA_bin | 0.44 (0.50) | 0.41 (0.49) | 0.42 (0.49) | 0.1331 |
# using tableby to create table with three categories
tableby(
age_fct ~ stocks + educ_BA_bin,
numeric.stats = "meansd", # only one summary stat
numeric.simplify = TRUE, # put on the same row as variable name
numeric.test = "kwt", # manually set test types
cat.test = "chisq", # manually set test types
data = anes_matched,
test = TRUE, # set test to true (remember to pair w/ mean/sd)
digits = 2
) %>%
summary(title = "Covariate Balance on Matched Characteristics",
pfootnote = TRUE)
middle (N=2033) | senior (N=634) | youth (N=89) | Total (N=2756) | p value | |
---|---|---|---|---|---|
stocks | 0.56 (0.50) | 0.63 (0.48) | 0.17 (0.38) | 0.56 (0.50) | < 0.0011 |
educ_BA_bin | 0.44 (0.50) | 0.42 (0.49) | 0.03 (0.18) | 0.42 (0.49) | < 0.0011 |
This supplement was put together by Beatriz Barros. Please email any questions, corrections or concerns to omar.wasow@pomona.edu.