3 Descriptives and visualizations

3.1 Overview

In this section, I describe and visualize the variables. We got variables on the person-level and on the wave level.

Person-level

  • Age and gender (age, gender)
  • What region of the UK participants are from (region)
  • Whether they’ve used a medium in the three months before the first wave (variables starting with filter)
  • To what extent they identify with a medium (variables ending with identity)

Wave-level

  • How much they used a medium per wave (variables ending with time)
  • Their well-being per wave (well_being)

3.2 Person-level

Let’s have a look at the final sample. Overall, our sample size is N = 2159. Participants, on average, were M = 47 years old, SD = 14.6, see Figure 3.1. The gender distribution is pretty equal (1123 women (52.01%), 1035 men, and 1 non-binary participants).
Distribution of age

Figure 3.1: Distribution of age

Alright, next let’s have a look again (now that we have the final sample) at how many people had used a medium in the three months before wave 1 (Table 3.1).
Table 3.1: Frequency of how much a medium was used at wave 1
Medium Used Not Used Proportion
audiobooks 274 1885 12.7
ebooks 1032 1127 47.8
films 1339 820 62.0
magazines 498 1661 23.1
music 1420 739 65.8
tv 1310 849 60.7
video_games 624 1535 28.9
Now let’s have a look at how much people identified with each medium. Note that people weren’t asked those identity questions if they indicated that they hadn’t used a medium at wave 1.
Table 3.2: Descriptive information for identity scales
variable N mean sd median min max range cilow cihigh
music_identity 1420 2.19 0.67 2.14 1 4 3 2.16 2.23
tv_identity 1310 2.16 0.63 2.14 1 4 3 2.13 2.20
films_identity 1339 2.26 0.67 2.14 1 4 3 2.22 2.29
games_identity 624 2.48 0.82 2.43 1 4 3 2.42 2.55
e_publishing_identity 1211 2.33 0.69 2.29 1 4 3 2.29 2.37
Figure 3.2 shows the distributions of those identity variables. Apparently gamers identified more with games than for example music listeners identified with music.
Distribution of identity variables

Figure 3.2: Distribution of identity variables

3.3 Wave-level

Let’s inspect the descriptive information for the different media. Table 3.3 shows that people used TV, music, and films the most, but spent little time on magazines and audio books.
Table 3.3: Descriptive information for use time (including zero estimates)
variable N mean sd median min max range cilow cihigh
music_time 10985 2.28 2.57 2.00 0 22.22 22.22 2.24 2.33
tv_time 10985 3.25 3.38 2.50 0 22.00 22.00 3.18 3.31
films_time 10985 2.33 2.62 2.00 0 22.25 22.25 2.28 2.37
games_time 10985 0.98 2.24 0.00 0 22.00 22.00 0.94 1.02
books_time 10985 1.09 1.96 0.17 0 21.00 21.00 1.05 1.12
magazines_time 10985 0.24 0.87 0.00 0 20.00 20.00 0.23 0.26
audiobooks_time 10985 0.16 0.78 0.00 0 20.00 20.00 0.15 0.18
Figure 3.3 shows the distributions of those time variables. Note that most distributions are zero-inflated with a heavy skew.
Distribution of time variables (with zeros)

Figure 3.3: Distribution of time variables (with zeros)

Let’s also inspect those summary statistics and distributions without zeros. Table 3.4 shows the same pattern was with zeros, but overall higher means. Interestingly, those who listen to audiobooks report to do so for a long time per day.
Table 3.4: Descriptive information for use time (excluding zero estimates)
variable N mean sd median min max range cilow cihigh
music_time 8251 3.04 2.54 2.25 0.02 22.22 22.20 2.99 3.09
tv_time 8537 4.18 3.29 3.00 0.02 22.00 21.98 4.11 4.25
films_time 7956 3.21 2.58 2.33 0.02 22.25 22.23 3.15 3.27
games_time 3873 2.78 3.04 2.00 0.02 22.00 21.98 2.69 2.88
books_time 5561 2.14 2.31 1.50 0.02 21.00 20.98 2.08 2.21
magazines_time 2442 1.09 1.58 0.67 0.02 20.00 19.98 1.03 1.16
audiobooks_time 1101 1.64 1.93 1.00 0.02 20.00 19.98 1.52 1.75
Figure 3.4 shows a less skewed distribution once we remove zeros.
Distribution of time variables (without zeros)

Figure 3.4: Distribution of time variables (without zeros)

As a next step, let’s see how use developed over time. Figure 3.5 shows that most media stay stable, except maybe an overall spike in wave 2.
Average use over time per medium

Figure 3.5: Average use over time per medium

Last, let’s see the overall descriptives of the well-being variables. Table 3.5 shows that both life satisfaction and affect were above the mid-point of the scale - even though for affect, it’s a close call.
Table 3.5: Well-being descriptives
variable N mean sd median min max range cilow cihigh
affect 10985 5.94 2.18 6 0 10 10 5.90 5.98
life_satisfaction 10985 6.40 2.08 7 0 10 10 6.37 6.44
Let’s plot the distribution of both affect and life satisfaction in Figure 3.6.
Distribution of well-being across waves and participants

Figure 3.6: Distribution of well-being across waves and participants

Last, let’s look at how well-being develops over time. Figure 3.7 shows that life satisfaction appears stable: note the y-axis is on decimal points (original range: 0 to 10). In contrast, affect has significantly decreased, mostly because anxiety has decreased (not shown on graph).
Average estimates over time per medium

Figure 3.7: Average estimates over time per medium

3.4 Correlation matrices

Let’s have a look at the row correlations between the media use variables and well-being. Figure 3.8 shows that a) most media use is correlated, and b) there are negative, but very small correlations between well-being and media use. The significance is negligible here just because the sample is so large. That said, we haven’t done any grouping yet, so the correlation matrix treats all observations as independent, which is clearly not the case.
Correlation matrix of hours spent with a medium and well-being

Figure 3.8: Correlation matrix of hours spent with a medium and well-being

3.5 Plots for Paper

For the paper, I want to avoid confronting readers with blocks of numbers. Therefore, I’ll try to put as much info as possible into figures rather than tables or in-text reports of numbers.

I think the following plots are necessary for the paper:

  1. A plot of the response rate per wave, simply as a descriptive measure.
  2. A plot showcasing the exclusions. I’ll describe the exclusion criteria in the paper, but don’t want to overwhelm readers with numbers for each.
  3. A plot showing the distributions, M, and SD of media use for different categories.
  4. Same plot, but for well-being.
  5. A plot of the results.

First, a plot for the response rate for each wave. We already showed response rates as a table in the processing section, so we can re-use the completion_table object for a plot. Not sure about the colors yet, might go all black for this one.

ggplot(
  data = completion_table %>% 
    rename(
      Wave = Waves,
      `Participants per wave (%)` = `Participants per wave`
    ) %>% 
    mutate(
      Wave = as.factor(Wave)
    ),
  aes(
    x = Wave,
    y = `Participants per wave (%)`#,
    # color = Wave,
    # fill = Wave
  )
) +
  geom_segment(
    aes(
      xend = Wave,
      y = 0,
      yend = `Participants per wave (%)`
    ),
    size = 1
  ) +
  geom_point(
    size = 2
  ) +
  geom_text(
    aes(
      label = paste0(`Participants per wave (%)`, "(", `Frequency per wave`, "%)"),
      y = `Participants per wave (%)` + 160
    )
  ) +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  theme(
    legend.position = "none"
  ) -> figure1

figure1

# create figure folder
dir.create("figures/", FALSE, TRUE)

ggsave(
  filename = here("figures", "figure1.png"),
  plot = figure1,
  width = 21 * 0.8,
  height = 29.7 * 0.4,
  units = "cm",
  dpi = 300
)

Next, I plot the exclusions. There’s two ways to go about this. Either I show the absolute number of participants and observations that we exclude with each criterion in relation to the total sample size. I do that below. Note that getting those exclusions in a nice format requires some wrangling. I already created the exclusion_plot_data object in the processing section. Here, I turn it into the long format first.

# to long format for plotting
exclusion_plot_data <-
  exclusion_plot_data %>%
  filter(!str_detect(Exclusion, "Total")) %>% # don't include the total numbers for now
  rename(Participants = PPs) %>% # nicer name
  pivot_longer(
    Participants:Observations,
    names_to = "Measure",
    values_to = "Value"
  ) %>% 
  mutate( # create new variable that's stable across exclusions
    Total = case_when(
      Measure == "Participants" ~ exclusion_plot_data %>% filter(Type == "Total") %>% pull(PPs),
      TRUE ~ exclusion_plot_data %>% filter(Type == "Total") %>% pull(Observations),
    ),
    `Total Before` = case_when(
      Measure == "Participants" ~ exclusion_plot_data %>% filter(Type == "Total Before") %>% pull(PPs),
      TRUE ~ exclusion_plot_data %>% filter(Type == "Total Before") %>% pull(Observations),
    ) 
  ) %>% 
  mutate(
    across(
      Type:Measure,
      as.factor
    )
  )

# https://stackoverflow.com/questions/11889625/annotating-text-on-individual-facet-in-ggplot2
# I want to show the total numbers once, not for each facet/criterion combination, which is why I create a data frame that only has a matching value for the positions in the plot that I want
label_positions <- 
  tibble(
    Type = rep("Wave-level", 2),
    Exclusion = c(3, 3), # on the right, where wave-level exclusions are
    Measure = c("Participants", "Observations")
  ) %>% 
  mutate(
    across(
      everything(),
      as.factor
    )
  )

# add the total values
label_positions <- 
  left_join(
    label_positions,
    exclusion_plot_data %>% select(-Value)
  )

# then plot
ggplot(
  exclusion_plot_data,
    aes(
      x = Exclusion,
      y = Value,
      color = Type
    )
  ) +
  geom_segment(
    aes(
      x = Exclusion,
      xend = Exclusion,
      y = 0,
      yend = Value
    ),
    alpha = 1,
    size = 1
  ) +
  facet_grid(Measure ~ Type, scales = "free") +
  geom_point(alpha = 1, size = 2) +
  geom_text(
    aes(
      label = paste0("-", `Total Before` - Value),
      x = as.numeric(Exclusion) + 0.35,
      y = Total * 0.1
    )
  ) +
  geom_hline(
    aes(
      yintercept = Total
    ),
    linetype = "dashed"
  ) +
  geom_hline(
    aes(
      yintercept = `Total Before`
    ),
    linetype = "solid"
  ) +
  geom_text(
    data = label_positions,
    aes(
      label = `Total`,
      x = 3.3,
      y = Total * 1.05
    ),
    color = "black"
  ) +
  geom_text(
    data = label_positions,
    aes(
      label = `Total Before`,
      x = 3.3,
      y = `Total Before` * 1.05
    ),
    color = "black"
  ) +
  xlab("Exclusion Criterion") +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  theme(
    axis.title.y = element_blank(),
    legend.position = "none",
    panel.background=element_blank(),
    panel.border=element_blank(),
    panel.grid.major=element_blank(),
    panel.grid.minor=element_blank(),
    plot.background=element_blank(),
    strip.background = element_blank()
  ) -> exclusion_figure1

exclusion_figure1

Alternatively, we could emphasize the contributions of each criterion in relation to the other criteria. Not sure I like this because it doesn’t show the total sample size and the reduction in total sample size. Then again, could simply add those to the figure captions.

ggplot(
  exclusion_plot_data %>% 
    mutate(
      Value = `Total Before` - Value
    ),
    aes(
      x = Exclusion,
      y = Value,
      color = Type
    )
  ) +
  geom_segment(
    aes(
      x = Exclusion,
      xend = Exclusion,
      y = 0,
      yend = Value
    ),
    alpha = 1,
    size = 1
  ) +
  facet_grid(Measure ~ Type, scales = "free") +
  geom_point(alpha = 1, size = 2) +
  geom_text(
    aes(
      label = paste0(round(Value/`Total Before`, digits = 4) * 100, "%"),
      y = Value
    ),
    vjust = 1,
    hjust = -0.2,
    size = 3
  ) +
  xlab("Exclusion Criterion") +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  theme(
    axis.title.y = element_blank(),
    legend.position = "none",
    panel.background=element_blank(),
    panel.border=element_blank(),
    panel.grid.major=element_blank(),
    panel.grid.minor=element_blank(),
    plot.background=element_blank(),
    strip.background = element_blank()
  ) -> exclusion_figure2

exclusion_figure2

Next, I plot the distributions of the use variables. I want to show distributions, but there are a couple of single large values, so violin or density plots will look strange with a long tail. Raincloud plots could solve that to a degree, but they duplicate information. I’ll go with a doptplot, namely beeswarm plots.

Like before, because I facet, I’ll turn the data into the long format first. I’m also not sure whether I like vertical or horizontal beeswarms, so I’ll do both.

# temporary data file in long format
dat <- 
  working_file %>% 
  pivot_longer(
    contains("_time"), # only the time variables
    names_to = "measure",
    values_to = "Hours per day"
  ) %>% 
  mutate(
    measure = as.factor(str_to_title(str_remove(measure, "_time"))), # prettier factor levels
    measure = fct_recode(
      measure,
      "TV" = "Tv"
    ),
    measure = fct_relevel(
      measure,
      c(
        "Music",
        "TV",
        "Films",
        "Games",
        "Books",
        "Magazines",
        "Audiobooks",
        "Total"
      )
    )
  )

# summary stats for plotting means over time
dat_summary <- 
  dat %>% 
  group_by(wave, measure) %>% 
  summarise(
    mean = mean(`Hours per day`, na.rm = T),
    sd = sd(`Hours per day`, na.rm = T)
  ) %>% 
  mutate(
    across(
      mean:sd,
      ~ round(.x, digits = 1)
    )
  )

# let's try horizontal first (the commented out section adds a vertical line where the mean is)
ggplot(
  dat,
  aes(
    x = `Hours per day`,
    y = 1,
    color = measure,
    fill = measure
  )
) +
  geom_quasirandom(groupOnX=FALSE, size = 0.1, alpha = 0.5) +
  # geom_vline(
  #   data = dat_summary,
  #   aes(
  #     xintercept = mean,
  #     color = measure
  #   ),
  #   linetype = "dashed"
  # ) +
  geom_point(
    data = dat_summary,
    aes(
      x = mean,
      y = 0.55
    ),
    shape = 25,
    size = 2
  ) +
  facet_grid(measure ~ wave) +
  geom_text(
    data = dat_summary,
    aes(
      x = 20,
      y = 1.4,
      label = paste0("M = ", mean)
    ),
    size = 3,
    color = "black"
  ) +
  geom_text(
    data = dat_summary,
    aes(
      x = 19.6,
      y = 1.3,
      label = paste0("SD = ", sd)
    ),
    size = 3,
    color = "black"
  ) +
  theme_cowplot() +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  theme(
    axis.text.y = element_blank(),
    axis.title.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.line.y = element_blank(),
    strip.background.x = element_blank(),
    strip.background.y = element_blank(),
    legend.position = "none"
  ) -> figure2.1

figure2.1

I guess the trendline makes it easier to see developments (or lack thereof) over time.

ggplot(
  dat,
  aes(
    x = as.factor(wave),
    y = `Hours per day`,
    color = measure,
    fill = measure,
    group = 1
  )
) +
  geom_quasirandom(size = 0.1, alpha = 0.2) +
  geom_line(
    data = dat_summary,
    aes(
      y = mean
    ),
    size = 1
  ) +
  facet_wrap(~ measure, ncol = 2) +
  geom_text(
    data = dat_summary,
    aes(
      x = as.factor(wave),
      y = -1.2,
      label = paste0("M = ", mean)
    ),
    size = 2.5,
    color = "black"
  ) +
  geom_text(
    data = dat_summary,
    aes(
      x = as.factor(wave),
      y = -2.9,
      label = paste0("SD = ", sd)
    ),
    size = 2.5,
    color = "black"
  ) +
  theme_cowplot() +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  xlab("Wave") +
  theme(
    strip.background.x = element_blank(),
    strip.background.y = element_blank(),
    legend.position = "none"
  ) -> figure2.2

figure2.2

Then again, for distributions with a small mean, the trendline blocks everything. Plus, because total time extends the y-axis, there’s lots of white space. I’ll give total time its separate plot to solve that issue. I also won’t have trendlines for the separate media, only for the total (too lazy to write the above into a function, so I’ll copy paste the code from above.).

ggplot(
  dat %>% 
    filter(measure != "Total"),
  aes(
    x = as.factor(wave),
    y = `Hours per day`,
    color = measure,
    fill = measure,
    group = 1
  )
) +
  geom_quasirandom(size = 0.1, alpha = 0.2) +
  # geom_line(
  #   data = dat_summary,
  #   aes(
  #     y = mean
  #   ),
  #   size = 1
  # ) +
  facet_wrap(~ measure, ncol = 2) +
  geom_text(
    data = dat_summary %>% 
      filter(measure != "Total"),
    aes(
      x = as.factor(wave),
      y = -1.2,
      label = paste0("M = ", mean)
    ),
    size = 2.5,
    color = "black"
  ) +
  geom_text(
    data = dat_summary %>% 
      filter(measure != "Total"),
    aes(
      x = as.factor(wave),
      y = -2.9,
      label = paste0("SD = ", sd)
    ),
    size = 2.5,
    color = "black"
  ) +
  theme_cowplot() +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  xlab("Wave") +
  theme(
    strip.background.x = element_blank(),
    strip.background.y = element_blank(),
    legend.position = "none"
  ) -> figure2.3

figure2.3

ggsave(
  here("figures", "figure2.png"),
  plot = figure2.3,
  width = 21 * 0.8,
  height = 29.7 * 0.8,
  units = "cm",
  dpi = 300
)

Then a separate plot for total time.

ggplot(
  dat %>% 
    filter(measure == "Total"),
  aes(
    x = as.factor(wave),
    y = `Hours per day`,
    group = 1
  )
) +
  geom_quasirandom(size = 0.1, alpha = 0.2, color = "#CC79A7") +
  geom_line(
    data = dat_summary %>% 
      filter(measure == "Total"),
    aes(
      y = mean
    ),
    size = 0.5,
    color = "#CC79A7"
  ) +
  geom_text(
    data = dat_summary %>% 
      filter(measure == "Total"),
    aes(
      x = as.factor(wave),
      y = -1.2,
      label = paste0("M = ", mean)
    ),
    size = 3,
    color = "black"
  ) +
  geom_text(
    data = dat_summary %>% 
      filter(measure == "Total"),
    aes(
      x = as.factor(wave),
      y = -2.4,
      label = paste0("SD = ", sd)
    ),
    size = 3,
    color = "black"
  ) +
  theme_cowplot() +
  xlab("Wave") +
  ylab("Total hours per day") +
  theme(
    strip.background.x = element_blank(),
    strip.background.y = element_blank(),
    legend.position = "none"
  ) -> figure3

figure3

ggsave(
  here("figures", "figure3.png"),
  plot = figure3,
  width = 21 * 0.9,
  height = 29.7 * 0.4,
  units = "cm",
  dpi = 300
)

Then, I need to think of a way to visualize use vs. nonuse over time. Bar graphs could work, but that would mean a lot of bar graphs (one per medium) per wave, which is probably overwhelming. Plus, if I show raw counts, it could be confusing as well, simply because height of the bar != number of participants. If everyone at wave 1 used only one medium, but at wave two everyone used all media, it’ll look like an increase in sample size.

So proportions make the most sense, simple and straightforward.

# get proportions of users per medium and wave
dat <- 
  working_file %>% 
  pivot_longer(
    contains("_time"), # only the time variables
    names_to = "measure",
    values_to = "Hours per day"
  ) %>% 
  mutate(
    # get use vs. no use in long format
    used = if_else(`Hours per day` == 0, 0, 1)
  ) %>% 
  # no need for total time
  filter(used == 1, measure != "total_time") %>%
  # remove "tome appendix
  mutate(measure = as.factor(str_to_title(str_remove(measure, "_time")))) %>% 
  # count how often a medium was used
  count(wave, measure, used) %>% 
  # add total N per wave (of only those that will be included in analysis)
  left_join(
    .,
    working_file %>% 
      count(wave, name = "N")
  ) %>% 
  # get proprotion
  mutate(
    proportion = round(n / N, digits = 2) * 100,
    measure = fct_recode(measure, "TV" = "Tv"),
    measure = fct_relevel(
      measure,
      c(
        "Music",
        "TV",
        "Films",
        "Games",
        "Books",
        "Magazines",
        "Audiobooks"
      )
    )
  )

ggplot(
  dat,
  aes(
    x = as.factor(wave),
    y = proportion,
    group = measure,
    color = measure
  )
) +
  geom_point(size = 3) +
  geom_line(size = 1) +
  scale_colour_manual(values=cb_palette) +
  scale_fill_manual(values = cb_palette) +
  ylab("% of people who used medium") +
  xlab("Wave") +
  geom_dl(aes(label = measure), method = list(dl.trans(x = x + 0.25), "last.points", cex = 0.6)) +
  theme(
    legend.position = "none"
  ) -> figure4

figure4

ggsave(
  here("figures", "figure4.png"),
  plot = figure4,
  width = 21 * 0.9,
  height = 29.7 * 0.4,
  units = "cm",
  dpi = 300
)

Next, I’ll do a beeswarm plot for well-being.

# temporary data file in long format
dat <- 
  working_file %>% 
  pivot_longer(
    c(affect, life_satisfaction), # only the aggregated well-being variables
    names_to = "measure",
    values_to = "Value"
  ) %>% 
  mutate(
    measure = fct_recode(
      as.factor(measure),
      "Affect" = "affect",
      "Life satisfaction" = "life_satisfaction"
    )
  )

# summary stats for plotting means over time
dat_summary <- 
  dat %>% 
  group_by(wave, measure) %>% 
  summarise(
    mean = mean(Value, na.rm = T),
    sd = sd(Value, na.rm = T)
  ) %>% 
  mutate(
    across(
      mean:sd,
      ~ round(.x, digits = 1)
    )
  )

ggplot(
  dat,
  aes(
    x = as.factor(wave),
    y = Value,
    color = measure,
    fill = measure,
    group = 1
  )
) +
  geom_quasirandom(size = 0.1, alpha = 0.2) +
  geom_line(
    data = dat_summary,
    aes(
      y = mean
    ),
    size = 0.5
  ) +
  facet_wrap(~ measure, ncol = 2) +
  geom_text(
    data = dat_summary,
    aes(
      x = as.factor(wave),
      y = -0.5,
      label = paste0("M = ", mean)
    ),
    size = 2.5,
    color = "black"
  ) +
  geom_text(
    data = dat_summary,
    aes(
      x = as.factor(wave),
      y = -0.9,
      label = paste0("SD = ", sd)
    ),
    size = 2.5,
    color = "black"
  ) +
  theme_cowplot() +
  scale_colour_manual(values=c("#E69F00", "#000000")) +
  scale_fill_manual(values = c("#E69F00", "#000000")) +
  xlab("Wave") +
  theme(
    axis.title.y = element_blank(),
    strip.background.x = element_blank(),
    strip.background.y = element_blank(),
    legend.position = "none"
  ) -> figure5

figure5

ggsave(
  here("figures", "figure5.png"),
  plot = figure5,
  width = 21 * 0.8,
  height = 29.7 * 0.4,
  units = "cm",
  dpi = 300
)