Data Visualization (1)

Data viz intro

Weekly design


Pre-class video


Data visualization (1)



# Data Visualization

# average
apply(anscombe, 1, mean)
 [1]  8.65250  7.45250 10.47125  8.56625  9.35875 10.49250  6.33750  7.03125
 [9]  9.71000  6.92625  5.75500
apply(anscombe, 2, mean)
      x1       x2       x3       x4       y1       y2       y3       y4 
9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909 
# Dispersion
apply(anscombe, 2, var)
       x1        x2        x3        x4        y1        y2        y3        y4 
11.000000 11.000000 11.000000 11.000000  4.127269  4.127629  4.122620  4.123249 
# Correlation (correlation coefficient)
cor(anscombe$x1, anscombe$y1)
[1] 0.8164205
cor(anscombe$x2, anscombe$y2)
[1] 0.8162365
cor(anscombe$x3, anscombe$y3)
[1] 0.8162867
cor(anscombe$x4, anscombe$y4)
[1] 0.8165214
library(gapminder)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
y <- gapminder %>% group_by(year, continent) %>% summarize(c_pop = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
head(y, 20)
# A tibble: 20 × 3
# Groups:   year [4]
    year continent      c_pop
   <int> <fct>          <dbl>
 1  1952 Africa     237640501
 2  1952 Americas   345152446
 3  1952 Asia      1395357351
 4  1952 Europe     418120846
 5  1952 Oceania     10686006
 6  1957 Africa     264837738
 7  1957 Americas   386953916
 8  1957 Asia      1562780599
 9  1957 Europe     437890351
10  1957 Oceania     11941976
11  1962 Africa     296516865
12  1962 Americas   433270254
13  1962 Asia      1696357182
14  1962 Europe     460355155
15  1962 Oceania     13283518
16  1967 Africa     335289489
17  1967 Americas   480746623
18  1967 Asia      1905662900
19  1967 Europe     481178958
20  1967 Oceania     14600414
plot(y$year, y$c_pop)

plot(y$year, y$c_pop, col = y$continent)

plot(y$year, y$c_pop, col = y$continent, pch = c(1:5))
plot(y$year, y$c_pop, col = y$continent, pch = c(1:length(levels(y$continent))))

# Specify the number of legends as a number
legend("topright", legend = levels((y$continent)), pch = c(1:5), col = c(1:5))

# Specify the number of legends to match the number of data
legend("bottomleft", legend = levels((y$continent)), pch = c(1:length(levels(y$continent))), col = c(1:length(levels(y$continent))) )

# 02 Basic features of visualization #
plot(gapminder$gdpPercap, gapminder$lifeExp, col = gapminder$continent)
legend("bottomright", 
       legend = levels((gapminder$continent)),
        pch = c(1:length(levels(gapminder$continent))),
        col = c(1:length(levels(y$continent))))

plot(log10(gapminder$gdpPercap), gapminder$lifeExp, col = gapminder$continent)
legend("bottomright", legend = levels((gapminder$continent)), pch = c(1:length(levels(gapminder$continent))), col = c(1:length(levels(y$continent))) )

# install.packages("ggplot2")
library(ggplot2)

gapminder %>% ggplot(,aes())

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, col = continent)) +
   geom_point() +
   scale_x_log10()

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, col = continent, size = pop)) +
   geom_point() +
   scale_x_log10()

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, col = continent, size = pop)) +
   geom_point(alpha = 0.5) +
   scale_x_log10()

table(gapminder$year)

1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
 142  142  142  142  142  142  142  142  142  142  142  142 
gapminder %>% filter(year==1977) %>%
   ggplot(., aes(x=gdpPercap, y=lifeExp, col=continent, size=pop)) +
   geom_point(alpha=0.5) +
   scale_x_log10()

gapminder %>% filter(year==2007) %>%
   ggplot(., aes(x=gdpPercap, y=lifeExp, col=continent, size=pop)) +
   geom_point(alpha=0.5) +
   scale_x_log10()

ggplot(gapminder, aes(x=gdpPercap, y=lifeExp, col=continent, size=pop)) +
   geom_point(alpha=0.5) +
   scale_x_log10() +
   facet_wrap(~year)

gapminder %>%
   filter(year == 1952 & continent =="Asia") %>%
   ggplot(aes(reorder(country, pop), pop)) +
   geom_bar(stat = "identity") +
   coord_flip()

gapminder %>% filter(year==1952 & continent== "Asia") %>% ggplot(aes(reorder(country, pop), pop)) + geom_bar(stat = "identity") + scale_y_log10() + coord_flip ()

gapminder %>%
   filter(country == "Korea, Rep.") %>%
   ggplot(aes(year, lifeExp, col = country)) +
   geom_point() +
   geom_line()

gapminder %>%
   filter(country == "Korea, Rep.") %>%
   ggplot(aes(year, lifeExp, col = country)) +
   # geom_point() +
   geom_line()

gapminder %>%
   ggplot(aes(x = year, y = lifeExp, col = continent)) +
   geom_point(alpha = 0.2) +
   geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

x = filter(gapminder, year == 1952)
hist(x$lifeExp, main = "Histogram of lifeExp in 1952")

x %>% ggplot(aes(lifeExp)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

x %>% ggplot(aes(continent, lifeExp)) + geom_boxplot()

plot(log10(gapminder$gdpPercap), gapminder$lifeExp)

Class


Data visualization is an essential skill in data science, helping to turn complex results into comprehensible insights. In R, one of the most powerful tools for creating professional and visually appealing graphs is ggplot2. This package, built on the principles of the Grammar of Graphics by Leland Wilkinson, allows users to create graphs that are both informative and attractive. Let’s delve into the concepts and practical applications of ggplot2 to enhance your data visualization skills.



Grammar of Graphics

ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

See the official home of ggplot2: https://ggplot2.tidyverse.org/


Understanding ggplot2’s Grammar of Graphics

Components of the Grammar

At its core, ggplot2 operates on a coherent set of principles known as the “Grammar of Graphics.” This framework allows you to specify graphs in terms of their underlying components:

  • Aesthetics (aes): These define how data is mapped to visual properties like size, shape, and color.

  • Geoms (geometric objects): These are the actual visual elements that represent data—points, lines, bars, etc.

  • Stats (statistical transformations): Some plots require transformations, such as calculating means or fitting a regression line, which are handled by stats.

  • Scales: These control how data values are mapped to visual properties.

  • Coordinate systems: These define how plots are oriented, with Cartesian coordinates being the most common, but others like polar coordinates are available for specific needs.

  • Facets: Faceting allows you to generate multiple plots based on a grouping variable, creating a matrix of panels.


Let me explain with the official introduction of ggplot2: https://ggplot2.tidyverse.org/articles/ggplot2.html


Setting Up Your Environment

Before diving into creating plots, you need to install and load ggplot2 in your R environment:

# ggplot2 is a package belongs to tidyverse
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0
✔ readr     2.1.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


Test if it works.

 ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point()


Practical Examples

Basic Plots

Let’s start with a basic scatter plot to examine the relationship between two variables in the mtcars dataset:

ggplot(data = mtcars, aes(x = wt, y = mpg)) + geom_point()

This code plots the miles per gallon (mpg) against the weight (wt) of various cars. The aes function maps the aesthetics to the respective variables.

Enhancing Visualizations

To enhance this plot, we might want to add a linear regression line to summarize the relationship between weight and fuel efficiency:

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal() +
  labs(title = "Fuel Efficiency vs. Weight", x = "Weight (1000 lbs)", y = "Miles per Gallon")
`geom_smooth()` using formula = 'y ~ x'

This code not only adds the regression line but also improves the aesthetics with a minimal theme and labels that clarify what each axis represents.

Practice once more with palmer penguins dataset.

library(palmerpenguins)
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Drop missing variables

penguins %>% 
  drop_na()
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>
ggplot(penguins) +
  aes(x = bill_length_mm,
      y = bill_depth_mm,
      colour = species) +
  geom_point(shape = "circle", size = 1.5) +
  scale_color_manual(
    values = c(Adelie = "#F8766D",
    Chinstrap = "#00C19F",
    Gentoo = "#FF61C3")
  ) +
  ggthemes::theme_fivethirtyeight() +
  theme(legend.position = "bottom")
Warning: Removed 2 rows containing missing values (`geom_point()`).

  • Layers in use above

    • Layer connecting the X-axis and Y-axis

      • aes(x = bill_length_mm, y = bill_depth_mm, colour = species)
    • A layer that sets the elements of the graph

      • geom_point(shape = "circle", size = 1.5)
    • A layer that sets the color of the graph

      • scale_color_manual( values = c(Adelie = "#F8766D", Chinstrap = "#00C19F", Gentoo = "#FF61C3") )
    • A layer that sets the theme of the graph

      • ggthemes::theme_fivethirtyeight()
    • Layer to set the position of the legend

      • theme(legend.position = "bottom")


Advanced ggplot2 Features

Faceting for Comparative Analysis

To compare how the relationship between weight and fuel efficiency varies by the number of cylinders in the engine, we can use faceting:

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  facet_wrap(~cyl)

This will create a separate plot for each number of cylinders, making it easy to see differences across categories.

  • facet: a particular aspect of feature of something
ggplot(penguins) +
  aes(x = bill_length_mm,
      y = bill_depth_mm,
      colour = species) +
  geom_point(shape = "circle", size = 1.5) +
  scale_color_manual(
    values = c(Adelie = "#F8766D",
    Chinstrap = "#00C19F",
    Gentoo = "#FF61C3")
  ) +
  ggthemes::theme_fivethirtyeight() +
  theme(legend.position = "bottom") +
  facet_wrap(~island)
Warning: Removed 2 rows containing missing values (`geom_point()`).

penguins %>% drop_na %>% 
ggplot() +
  aes(x = bill_length_mm,
      y = bill_depth_mm,
      colour = species) +
  geom_point(shape = "circle", size = 1.5) +
  scale_color_manual(
    values = c(Adelie = "#F8766D",
    Chinstrap = "#00C19F",
    Gentoo = "#FF61C3")
  ) +
  ggthemes::theme_fivethirtyeight() +
  theme(legend.position = "bottom") +
  facet_wrap(sex ~ island)

Customization and Extensions

Check out extentions of ggplot2: https://exts.ggplot2.tidyverse.org/gallery/

ggplot2 is highly customizable, allowing extensive control over nearly every visual aspect of a plot. For users interested in making interactive plots, ggplot2 can be integrated with the plotly library, transforming static charts into interactive visualizations.

The power and flexibility of ggplot2 make it an indispensable tool for data visualization in R. Whether you are a beginner or an experienced user, there is always more to explore and learn with ggplot2. Practice regularly, and don’t hesitate to experiment with different components to discover the best ways to convey your insights visually.


To master ggplot2, see the videos below:

ggplot2 workshop part 1 by Thomas Lin Pedersen

https://www.youtube.com/watch?v=h29g21z0a68

ggplot2 workshop part 2 by Thomas Lin Pedersen

https://www.youtube.com/watch?v=0m4yywqNPVY