POL90: Intro to Summary Tables

Introduction

The first table in almost every published study we read gives descriptive statistics. Depending on the field, type of project, and publication, there will be some variation in the design of these tables. Broadly, however, they serve the same purpose: to help readers get a sense of the distribution of the variables used in the study. Measures included will also vary; some authors will include median and IQR, while others will use mean and standard deviation, and others will include all 4. Generally, the number of observations for each variable will be included; this helps get a sense of where there is missingnessness. Summary tables also help us, the researchers to identify issues in how variables are coded and to interpret our own coefficients.

In this document, we will be looking at a selection of variables from the 2016 American National Election Survey. To make our tables, we will use two packages: arsenal and knitr. You will by now be familiar with the latter’s kable() function. When paired with arsenal::tableby(), it can be used to make informative, well-formatted summary tables with ease.

NOTE: library(kableExtra) which extends kable() may cause formatting issues with library(arsenal). If you want to use both, we recommend loading library(kableExtra) only after you’ve finished creating your summary statistics table with the tableby() command.

For more details on the tableby() function, consult this great resource.

Loading packages

# set global options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

library(tidyverse)
library(knitr)
library(foreign)
library(stringr)
library(MatchIt)
library(naniar)

# install.packages("arsenal")
library(arsenal)

# NOTE library(kableExtra) may cause output issues with library(arsenal)

Loading and cleaning the data

The hardest part of assembling a summary table is to have the variables you’ll use exactly as you want them. Here, I selected a series of variables from the 2016 ANES and cleaned them.

anes <- read.dta("anes_timeseries_2016_Stata12.dta")

anes <- anes %>% 
  mutate(
    race                  = V161310x,
    age                   = V161267,
    death_penalty         = V161233x,
    vaccines              = V162162x,
    scientists            = V162112,
    obedient_self_reliant = V162241,
    considerate_behaved   = V162242,
    stocks                = V161350,
    pid                   = V161158x,
    educ                  = V161270,
    married               = V161269
    )

anes1 <- anes %>% 
  dplyr::select(race, age, death_penalty, vaccines, scientists, 
         obedient_self_reliant,
         stocks, pid, educ, married) # keep selected variables

anes1 <- anes1 %>% 
  # mutate_if conducts a logical test (.predicate) like is.factor(), 
  # if TRUE, runs function (.funs), here convert to character
  mutate_if(.predicate = is.factor, .funs = as.character) 

# remove negative values -- those are just NAs
anes1 <- anes1 %>% 
  # mutate_all applies a function to all columns
  # here we run a case_when function on each column, one at a time 
  # str_sub function checks if first character == "-", assign NA,
  # otherwise return original data in as.character() format
  mutate_all(
    .funs = function(x)
      case_when(
        str_sub(x, start = 1, end = 1) == "-" ~ NA_character_, 
        TRUE ~ as.character(x)
      )
    )

# I want to remove the numbers at the beginning of strings
# make vaccines, death_penalty, age, scientists thermometer numeric
# and make educ and pid indicator variables

anes1 <- anes1 %>% 
  # mutate_at applies a function to specfic columns
  # str_sub extracts and replaces substrings from a character vector
  mutate_at(.vars = c("vaccines", "death_penalty"), 
            .funs = function(x) str_sub(x, start = 1, end = 2)) %>% 
  mutate_at(.vars = c("age", "scientists", "vaccines", "death_penalty"),
            .funs = as.numeric) %>% 
  mutate_at(.vars = c("stocks", "married", "race", "obedient_self_reliant"), 
            .funs = function(x) str_sub(x, start = 4))


anes1 <- anes1 %>% 
  mutate(
    educ_BA    = ifelse(str_sub(educ, start = 1, end = 2) %in% c("13", "14", "15", "16"), 
                          "BA or Higher", "Less than BA"),
    republican = ifelse(pid %in% c("7. Strong Republican", 
                                   "6. Not very strong Republican", 
                                   "5. Independent-Republican"), 
                                    "Yes", "No") %>% as.factor(),
    race       = ifelse(str_sub(race, start = 1, end = 5) == "Other", "Other", race),
    scientists = ifelse(scientists > 100, NA, scientists)
    )

anes1 <- anes1 %>% 
  dplyr::select(age, race, educ_BA, obedient_self_reliant,
         republican, stocks, scientists, vaccines, death_penalty)

After cleaning, the data looks like this. Note that some variables are numeric (such as age and scientists, the feeling thermometer towards scientists). All others are character types. As of now, none of these are factor variables.

head(anes1, 4)

  age                race      educ_BA obedient_self_reliant
1  29 White, non-Hispanic Less than BA             Obedience
2  26 White, non-Hispanic BA or Higher         Self-reliance
3  23 White, non-Hispanic Less than BA         Self-reliance
4  58 White, non-Hispanic Less than BA         Self-reliance
  republican stocks scientists vaccines death_penalty
1        Yes     No         70        4             1
2        Yes     No         70        1             4
3         No     No         70        5             1
4        Yes    Yes         50        4             1

Simple summary table

Now let’s look at a very simple summary table using tableby. You’ll note that tableby uses the same syntax as lm(). Anything to the left of the tilde (~) will be a column, and anything on the right will be a row. In this first example, I leave the left side blank. This means that summary statistics will be presented for the full data, not separating by group.

tableby( ~ republican + race + age +  educ_BA + 
           stocks + vaccines + scientists + obedient_self_reliant, 
         data   = anes1,
         digits = 2) %>% 
  summary() %>% 
  kable()

	Overall (N=4270)
republican
No	2541 (59.5%)
Yes	1729 (40.5%)
race
N-Miss	33
Asian, native Hawaiian or other Pacif Islr,non-Hispanic	148 (3.5%)
Black, non-Hispanic	397 (9.4%)
Hispanic	450 (10.6%)
Native American or Alaska Native, non-Hispanic	27 (0.6%)
Other	177 (4.2%)
White, non-Hispanic	3038 (71.7%)
age
N-Miss	121
Mean (SD)	49.58 (17.58)
Range	18.00 - 90.00
educ_BA
BA or Higher	1635 (38.3%)
Less than BA	2635 (61.7%)
stocks
N-Miss	102
No	2218 (53.2%)
Yes	1950 (46.8%)
vaccines
N-Miss	671
Mean (SD)	2.33 (1.67)
Range	1.00 - 7.00
scientists
N-Miss	655
Mean (SD)	76.70 (19.44)
Range	0.00 - 100.00
obedient_self_reliant
N-Miss	646
Both	70 (1.9%)
Obedience	1684 (46.5%)
Self-reliance	1870 (51.6%)

For numeric variables, like age and vaccines, we see the mean and sd. Character types, by contrast, are treated as categorical variables. For them, we receive a count and proportion.

Subgroups

If you want to break down your explanatory and control variables by subgroups, doing so is extremely easy. Just put the variable you want to group by before the tilde (~).

tableby(republican ~ race + age +  educ_BA + stocks + 
           vaccines + scientists + obedient_self_reliant, 
         data   = anes1, 
         test   = FALSE, # do not do t-tests or give p-values
         digits = 2) %>% 
  summary() %>% 
  kable()

	No (N=2541)	Yes (N=1729)	Total (N=4270)
race
N-Miss	22	11	33
Asian, native Hawaiian or other Pacif Islr,non-Hispanic	93 (3.7%)	55 (3.2%)	148 (3.5%)
Black, non-Hispanic	380 (15.1%)	17 (1.0%)	397 (9.4%)
Hispanic	347 (13.8%)	103 (6.0%)	450 (10.6%)
Native American or Alaska Native, non-Hispanic	19 (0.8%)	8 (0.5%)	27 (0.6%)
Other	115 (4.6%)	62 (3.6%)	177 (4.2%)
White, non-Hispanic	1565 (62.1%)	1473 (85.7%)	3038 (71.7%)
age
N-Miss	70	51	121
Mean (SD)	48.13 (17.46)	51.73 (17.54)	49.58 (17.58)
Range	18.00 - 90.00	18.00 - 90.00	18.00 - 90.00
educ_BA
BA or Higher	952 (37.5%)	683 (39.5%)	1635 (38.3%)
Less than BA	1589 (62.5%)	1046 (60.5%)	2635 (61.7%)
stocks
N-Miss	69	33	102
No	1479 (59.8%)	739 (43.6%)	2218 (53.2%)
Yes	993 (40.2%)	957 (56.4%)	1950 (46.8%)
vaccines
N-Miss	405	266	671
Mean (SD)	2.39 (1.70)	2.25 (1.61)	2.33 (1.67)
Range	1.00 - 7.00	1.00 - 7.00	1.00 - 7.00
scientists
N-Miss	391	264	655
Mean (SD)	79.32 (19.27)	72.84 (19.06)	76.70 (19.44)
Range	0.00 - 100.00	0.00 - 100.00	0.00 - 100.00
obedient_self_reliant
N-Miss	389	257	646
Both	38 (1.8%)	32 (2.2%)	70 (1.9%)
Obedience	903 (42.0%)	781 (53.1%)	1684 (46.5%)
Self-reliance	1211 (56.3%)	659 (44.8%)	1870 (51.6%)

Added features

# change variable names:
var_names <- list(
  republican    = "Republican",
  race          = "Race",
  stocks        = "Stock Market Participation",
  death_penalty = "Support/Oppose Death Penalty"
)

# and add title (always do this):
tableby(republican ~ race + stocks + death_penalty, 
         data   = anes1, 
         test   = FALSE, 
         digits = 2) %>% 
  summary(labelTranslations = var_names) %>%   # variable names
  kable(caption = "Selected Variables by Republican Party ID") # title

Table 1: Selected Variables by Republican Party ID
	No (N=2541)	Yes (N=1729)	Total (N=4270)
Race
N-Miss	22	11	33
Asian, native Hawaiian or other Pacif Islr,non-Hispanic	93 (3.7%)	55 (3.2%)	148 (3.5%)
Black, non-Hispanic	380 (15.1%)	17 (1.0%)	397 (9.4%)
Hispanic	347 (13.8%)	103 (6.0%)	450 (10.6%)
Native American or Alaska Native, non-Hispanic	19 (0.8%)	8 (0.5%)	27 (0.6%)
Other	115 (4.6%)	62 (3.6%)	177 (4.2%)
White, non-Hispanic	1565 (62.1%)	1473 (85.7%)	3038 (71.7%)
Stock Market Participation
N-Miss	69	33	102
No	1479 (59.8%)	739 (43.6%)	2218 (53.2%)
Yes	993 (40.2%)	957 (56.4%)	1950 (46.8%)
Support/Oppose Death Penalty
N-Miss	69	25	94
Mean (SD)	2.23 (1.23)	1.54 (0.93)	1.95 (1.17)
Range	1.00 - 4.00	1.00 - 4.00	1.00 - 4.00

# change which summary statistics are displayed
tableby(republican ~ race + vaccines + scientists,
        numeric.stats = c("N", "median", "q1q3"), # give us count, median, and IQR for numeric
        data   = anes1, 
        test   = FALSE, 
        digits = 2) %>% 
  summary() %>% 
  kable(caption = "Selected Variables by Republican Party ID") # title

Table 2: Selected Variables by Republican Party ID
	No (N=2541)	Yes (N=1729)	Total (N=4270)
race
N-Miss	22	11	33
Asian, native Hawaiian or other Pacif Islr,non-Hispanic	93 (3.7%)	55 (3.2%)	148 (3.5%)
Black, non-Hispanic	380 (15.1%)	17 (1.0%)	397 (9.4%)
Hispanic	347 (13.8%)	103 (6.0%)	450 (10.6%)
Native American or Alaska Native, non-Hispanic	19 (0.8%)	8 (0.5%)	27 (0.6%)
Other	115 (4.6%)	62 (3.6%)	177 (4.2%)
White, non-Hispanic	1565 (62.1%)	1473 (85.7%)	3038 (71.7%)
vaccines
N	2136	1463	3599
Median	2.00	2.00	2.00
Q1, Q3	1.00, 4.00	1.00, 3.00	1.00, 4.00
scientists
N	2150	1465	3615
Median	85.00	70.00	85.00
Q1, Q3	70.00, 99.00	60.00, 85.00	60.00, 93.00

Significance tests

There is also a way to conduct t.tests and other significance tests. To do so, set test = TRUE. For this option, I also recommend using results = "asis" in the chunk header instead of using kable(), and setting pfootnote in the summary function to TRUE. This gives details of what kind of test was used to establish significance. Finally, to set a title on this table without kable, use title = "something" in the summary() function.

# is there a significant difference between groups? Chi-square and t-test options
tableby(republican ~ age + death_penalty + vaccines + scientists,
        numeric.stats = c("N", "mean", "sd"),
        data          = anes1, 
        test          = TRUE, # set test to true (and remember to pair with mean/sd)
        numeric.test  = "kwt", cat.test = "chisq", # set test types (linear ANOVA default)
        digits        = 2) %>% 
  summary(title = "Selected Variables by Republican Party ID",
          pfootnote = TRUE) # display methods used

Table 3: Selected Variables by Republican Party ID
	No (N=2541)	Yes (N=1729)	Total (N=4270)	p value
age				< 0.001¹
N	2471	1678	4149
Mean	48.13	51.73	49.58
SD	17.46	17.54	17.58
death_penalty				< 0.001¹
N	2472	1704	4176
Mean	2.23	1.54	1.95
SD	1.23	0.93	1.17
vaccines				0.139¹
N	2136	1463	3599
Mean	2.39	2.25	2.33
SD	1.70	1.61	1.67
scientists				< 0.001¹
N	2150	1465	3615
Mean	79.32	72.84	76.70
SD	19.27	19.06	19.44

Kruskal-Wallis rank sum test

Showing balance on covariates after matching:

Finally, after conducting matching, it is possible to use tableby() to show that your covariates are now balanced. In this example, I match the treatment, republican, on two variables, age and stocks. Stocks, here, I am using as a proxy for status/wealth. Notice that I must do some data cleaning before matching. I had to (1) change the character vectors to dummy variables, and (b) omit any NAs from this subsample of the data. I conducted 3 t-tests and noted a statisically significant difference for stocks and age on probability of being republican, but not on educ_BA.

anes1 <- anes1 %>% 
  mutate(
    educ_BA_bin = ifelse(educ_BA == "BA or Higher", 1, 0),
    republican  = ifelse(republican == "Yes", 1, 0),
    stocks      = ifelse(stocks == "Yes", 1, 0),
    id          = row_number()
    ) %>% 
  na.omit()

with(anes1, t.test(age         ~ republican))


    Welch Two Sample t-test

data:  age by republican
t = -6, df = 2920, p-value = 6e-09
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -4.79 -2.38
sample estimates:
mean in group 0 mean in group 1 
           47.6            51.2

with(anes1, t.test(stocks      ~ republican))


    Welch Two Sample t-test

data:  stocks by republican
t = -9, df = 2946, p-value <2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.186 -0.118
sample estimates:
mean in group 0 mean in group 1 
          0.411           0.563

with(anes1, t.test(educ_BA_bin ~ republican)) # not statistically significant


    Welch Two Sample t-test

data:  educ_BA_bin by republican
t = -1, df = 2947, p-value = 0.3
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.0521  0.0153
sample estimates:
mean in group 0 mean in group 1 
          0.392           0.410

# another way to show this:
# summary(glm(republican ~ educ_BA_bin + stocks + age, data = anes1, family = binomial()))

matched <- matchit(republican ~ age + stocks + educ_BA_bin,
                   data   = anes1,
                   method = "nearest")

anes_matched <- match.data(matched)

Now, I only need to feed the matched data into our previous tableby() output:

# using tableby
tableby(republican ~ age + stocks + educ_BA_bin,
        numeric.stats    = "meansd", # only one summary stat
        numeric.simplify = TRUE,     # put on the same row as variable name
        numeric.test     = "kwt",    # manually set test types
        cat.test         = "chisq",  # manually set test types
        data             = anes_matched, 
        test             = TRUE,     # set test to true (remember to pair w/ mean/sd)
        digits           = 2) %>% 
  summary(title = "Covariate Balance on Matched Characteristics",
          pfootnote = TRUE)

Table 4: Covariate Balance on Matched Characteristics
	0 (N=1378)	1 (N=1378)	Total (N=2756)	p value
age	50.79 (17.03)	51.22 (17.67)	51.01 (17.35)	0.607¹
stocks	0.56 (0.50)	0.56 (0.50)	0.56 (0.50)	0.969¹
educ_BA_bin	0.44 (0.50)	0.41 (0.49)	0.42 (0.49)	0.133¹

Kruskal-Wallis rank sum test

More than two categories

# recode age to three categories
anes_matched <- anes_matched %>% 
  mutate(
    age_fct = case_when(
      age <= 21            ~ "youth", 
      age > 21 & age <= 65 ~ "middle", 
      age > 65             ~ "senior")
    )

# using tableby to create table with three categories
tableby(
  age_fct ~ stocks + educ_BA_bin,
  numeric.stats    = "meansd", # only one summary stat
  numeric.simplify = TRUE,     # put on the same row as variable name
  numeric.test     = "kwt",    # manually set test types
  cat.test         = "chisq",  # manually set test types
  data             = anes_matched, 
  test             = TRUE,     # set test to true (remember to pair w/ mean/sd)
  digits           = 2
  ) %>% 
  summary(title = "Covariate Balance on Matched Characteristics",
          pfootnote = TRUE)

Table 5: Covariate Balance on Matched Characteristics
	middle (N=2033)	senior (N=634)	youth (N=89)	Total (N=2756)	p value
stocks	0.56 (0.50)	0.63 (0.48)	0.17 (0.38)	0.56 (0.50)	< 0.001¹
educ_BA_bin	0.44 (0.50)	0.42 (0.49)	0.03 (0.18)	0.42 (0.49)	< 0.001¹

Kruskal-Wallis rank sum test

This supplement was put together by Beatriz Barros. Please email any questions, corrections or concerns to omar.wasow@pomona.edu.