데이터 시각화와 기술 통계(1)

Exploratory Data Analysis(1)

Weekly content




Data Visualization

Data visualization is an essential skill in data science, helping to turn complex results into comprehensible insights. In R, one of the most powerful tools for creating professional and visually appealing graphs is ggplot2. This package, built on the principles of the Grammar of Graphics by Leland Wilkinson, allows users to create graphs that are both informative and attractive. Let’s delve into the concepts and practical applications of ggplot2 to enhance your data visualization skills.


Grammar of Graphics

ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

See the official home of ggplot2: https://ggplot2.tidyverse.org/


Understanding ggplot2’s Grammar of Graphics

Components of the Grammar

At its core, ggplot2 operates on a coherent set of principles known as the “Grammar of Graphics.” This framework allows you to specify graphs in terms of their underlying components:

  • Aesthetics (aes): These define how data is mapped to visual properties like size, shape, and color.

  • Geoms (geometric objects): These are the actual visual elements that represent data—points, lines, bars, etc.

  • Stats (statistical transformations): Some plots require transformations, such as calculating means or fitting a regression line, which are handled by stats.

  • Scales: These control how data values are mapped to visual properties.

  • Coordinate systems: These define how plots are oriented, with Cartesian coordinates being the most common, but others like polar coordinates are available for specific needs.

  • Facets: Faceting allows you to generate multiple plots based on a grouping variable, creating a matrix of panels.


Setting Up Your Environment

Before diving into creating plots, you need to install and load ggplot2 in your R environment:

# ggplot2 is a package belongs to tidyverse
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


Test if it works.

 ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point()


Practical Examples

Basic Plots

Let’s start with a basic scatter plot to examine the relationship between two variables in the mtcars dataset:

ggplot(data = mtcars, aes(x = wt, y = mpg)) + geom_point()

This code plots the miles per gallon (mpg) against the weight (wt) of various cars. The aes function maps the aesthetics to the respective variables.

Let me explain with the official introduction of ggplot2: https://ggplot2.tidyverse.org/articles/ggplot2.html

library(tidyverse)

mpg %>% select(hwy, cty, cyl)
# A tibble: 234 × 3
     hwy   cty   cyl
   <int> <int> <int>
 1    29    18     4
 2    29    21     4
 3    31    20     4
 4    30    21     4
 5    26    16     6
 6    26    18     6
 7    27    18     6
 8    26    18     4
 9    25    16     4
10    28    20     4
# ℹ 224 more rows
ggplot(mpg, aes(hwy, cty)) +
  geom_point(aes(color = as.factor(cyl)))

ggplot(mpg, aes(hwy, cty)) +
  geom_point(aes(color = as.factor(cyl))) +
  geom_smooth(method ="lm")
`geom_smooth()` using formula = 'y ~ x'

ggplot(mpg, aes(hwy, cty)) +
  geom_point(aes(color = as.factor(cyl))) +
  geom_smooth(method ="glm")
`geom_smooth()` using formula = 'y ~ x'

ggplot(mpg, aes(hwy, cty)) +
  geom_point(aes(color = cyl)) +
  geom_smooth(method ="lm") +
  # coord_cartesian() +
  # scale_color_gradient() +
  theme_bw()
`geom_smooth()` using formula = 'y ~ x'

# Returns the last plot
last_plot()
`geom_smooth()` using formula = 'y ~ x'

# Saves last plot as 5’ x 5’ file named "plot.png" in
# working directory. Matches file type to file extension.
# ggsave("plot.png", width = 5, height = 5)

One variable

# Continuous
a <- ggplot(mpg, aes(hwy))
a

a + geom_area(stat = "bin")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

a + geom_density(kernel = "gaussian")

a + geom_dotplot()
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

a + geom_freqpoly()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

a + geom_histogram(binwidth = 4)

mpg %>% ggplot()+
  geom_area(aes(hwy), stat="bin")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Discrete
b <- ggplot(mpg, aes(fl))
b + geom_bar()

Two variables

  • Continuous X & Countinuous Y
# Two variables
# Continuous X & Countinuous Y
f <- ggplot(mpg, aes(cty, hwy))
f + geom_blank()

f + geom_jitter()

f + geom_point()

# install.packages("quantreg")
library(quantreg)
Warning: package 'quantreg' was built under R version 4.4.2
Loading required package: SparseM
f + geom_quantile() + geom_jitter()
Smoothing formula not specified. Using: y ~ x

f + geom_rug(sides = "bl") + geom_jitter()

f + geom_rug(sides = "bl") + geom_point()

f + geom_smooth(model = lm) +  geom_point()
Warning in geom_smooth(model = lm): Ignoring unknown parameters: `model`
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

f + geom_text(aes(label = cty)) + 
  geom_jitter()

f + geom_text(aes(label = fl))

mpg %>% 
  ggplot(aes(cty, hwy, label = fl, 
             alpha=0.1, col='red')) +
  geom_text()+
  geom_jitter()

# install.packages("ggimage")
library(ggimage)

img <- list.files(system.file("extdata", 
                              package="ggimage"),
                  pattern="png", full.names=TRUE)

img[2]
[1] "C:/R/R-4.4.1/library/ggimage/extdata/Rlogo.png"
f + geom_image(aes(image=img[2]))

  • Discrete X & Countinuous Y
# Discrete X & Countinuous Y
g <- ggplot(mpg, aes(class, hwy))
levels(as.factor(mpg$class))
[1] "2seater"    "compact"    "midsize"    "minivan"    "pickup"    
[6] "subcompact" "suv"       
str(mpg$class)
 chr [1:234] "compact" "compact" "compact" "compact" "compact" "compact" ...
levels(as.factor(mpg$class))
[1] "2seater"    "compact"    "midsize"    "minivan"    "pickup"    
[6] "subcompact" "suv"       
unique(mpg$class)
[1] "compact"    "midsize"    "suv"        "2seater"    "minivan"   
[6] "pickup"     "subcompact"
mpg %>% count(class)
# A tibble: 7 × 2
  class          n
  <chr>      <int>
1 2seater        5
2 compact       47
3 midsize       41
4 minivan       11
5 pickup        33
6 subcompact    35
7 suv           62
mpg %>% select(manufacturer, class, hwy) %>% 
  group_by(class) %>% 
  arrange(desc(hwy)) %>% head(10) -> dkdk
mpg %>% count(class)
# A tibble: 7 × 2
  class          n
  <chr>      <int>
1 2seater        5
2 compact       47
3 midsize       41
4 minivan       11
5 pickup        33
6 subcompact    35
7 suv           62
g

g + geom_bar(stat = "identity")

g + geom_boxplot()

g + geom_dotplot(binaxis = "y",
                 stackdir = "center")
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

g + geom_violin(scale = "area")

  • Discrete X & Discrete Y
# Discrete X & Discrete Y
diamonds
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
h <- ggplot(diamonds, aes(cut, color))
h + geom_jitter()

  • Continuous Bivariate Distribution
# Continuous Bivariate Distribution
# install.packages("ggplot2movies")
library(ggplot2movies)

movies %>% glimpse
Rows: 58,788
Columns: 24
$ title       <chr> "$", "$1000 a Touchdown", "$21 a Day Once a Month", "$40,0…
$ year        <int> 1971, 1939, 1941, 1996, 1975, 2000, 2002, 2002, 1987, 1917…
$ length      <int> 121, 71, 7, 70, 71, 91, 93, 25, 97, 61, 99, 96, 10, 10, 10…
$ budget      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ rating      <dbl> 6.4, 6.0, 8.2, 8.2, 3.4, 4.3, 5.3, 6.7, 6.6, 6.0, 5.4, 5.9…
$ votes       <int> 348, 20, 5, 6, 17, 45, 200, 24, 18, 51, 23, 53, 44, 11, 12…
$ r1          <dbl> 4.5, 0.0, 0.0, 14.5, 24.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4…
$ r2          <dbl> 4.5, 14.5, 0.0, 0.0, 4.5, 4.5, 0.0, 4.5, 4.5, 0.0, 0.0, 0.…
$ r3          <dbl> 4.5, 4.5, 0.0, 0.0, 0.0, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5…
$ r4          <dbl> 4.5, 24.5, 0.0, 0.0, 14.5, 14.5, 4.5, 4.5, 0.0, 4.5, 14.5,…
$ r5          <dbl> 14.5, 14.5, 0.0, 0.0, 14.5, 14.5, 24.5, 4.5, 0.0, 4.5, 24.…
$ r6          <dbl> 24.5, 14.5, 24.5, 0.0, 4.5, 14.5, 24.5, 14.5, 0.0, 44.5, 4…
$ r7          <dbl> 24.5, 14.5, 0.0, 0.0, 0.0, 4.5, 14.5, 14.5, 34.5, 14.5, 24…
$ r8          <dbl> 14.5, 4.5, 44.5, 0.0, 0.0, 4.5, 4.5, 14.5, 14.5, 4.5, 4.5,…
$ r9          <dbl> 4.5, 4.5, 24.5, 34.5, 0.0, 14.5, 4.5, 4.5, 4.5, 4.5, 14.5,…
$ r10         <dbl> 4.5, 14.5, 24.5, 45.5, 24.5, 14.5, 14.5, 14.5, 24.5, 4.5, …
$ mpaa        <chr> "", "", "", "", "", "", "R", "", "", "", "", "", "", "", "…
$ Action      <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0…
$ Animation   <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ Comedy      <int> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1…
$ Drama       <int> 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ Documentary <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ Romance     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Short       <int> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1…
i <- ggplot(movies, aes(year, rating))
i + geom_bin2d(binwidth = c(5, 0.5))

i + geom_density2d()

# install.packages("hexbin")
library(hexbin)
i + geom_hex()

# Continuous Function
j <- ggplot(economics, aes(date, unemploy))
j + geom_area()

j + geom_line()

j + geom_step(direction = "hv")

# Visualizing error
df <- data.frame(grp = c("A", "B"), fit = 4:5, se = 1:2)
k <- ggplot(df, 
            aes(grp, fit, 
                ymin = fit-se, 
                ymax = fit+se))

k + geom_crossbar(fatten = 2)

k + geom_errorbar(col="grey") +
  geom_point(aes(col="red")) 

k + geom_linerange()

k + geom_pointrange()

Three variables

# Three variables
?seals
starting httpd help server ... done
seals$z <- with(seals, sqrt(delta_long^2 + delta_lat^2))
m <- ggplot(seals, aes(long, lat))

m + geom_tile(aes(fill = z))

m + geom_contour(aes(z = z))

m + geom_raster(aes(fill = z), hjust=0.5,
                vjust=0.5, interpolate=FALSE)

# Scales
n <- b + geom_bar(aes(fill = fl))
n

n + scale_fill_manual(
  values = c("skyblue", "royalblue", "blue", "navy"),
  limits = c("d", "e", "p", "r"), breaks =c("d", "e", "p", "r"),
  name = "Fuel", labels = c("D", "E", "P", "R"))

# Color and fill scales
n <- b + geom_bar(aes(fill = fl))
o <- a + geom_dotplot(aes(fill = ..x..))
# install.packages("RColorBrewer")
library(RColorBrewer)

n + scale_fill_brewer(palette = "Blues")

display.brewer.all()

n + scale_fill_grey(
  start = 0.2, end = 0.8,
  na.value = "red")

o + scale_fill_gradient(
  low = "red",
  high = "yellow")
Warning: The dot-dot notation (`..x..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(x)` instead.
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

o + scale_fill_gradientn(
  colours = terrain.colors(5))
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

# Also: rainbow(), heat.colors(),
# topo.colors(), cm.colors(),
# RColorBrewer::brewer.pal()
# Shape scales
f

p <- f + geom_point(aes(shape = fl))
p

p + scale_shape(solid = FALSE)

p + scale_shape_manual(values = c(3:7))

# Coordinate Systems
r <- b+geom_bar()
r + coord_cartesian(xlim = c(0, 5))

r + coord_fixed(ratio = 1/2)

r + coord_fixed(ratio = 1/10)

r + coord_fixed(ratio = 1/100)

r + coord_flip()

r + coord_polar(theta = "x", direction=1 )

# Position Adjustments

s <- ggplot(mpg, aes(fl, fill = drv))

s + geom_bar(position = "dodge")

# Arrange elements side by side
s + geom_bar(position = "fill")

# Stack elements on top of one another, normalize height
s + geom_bar(position = "stack")

# Stack elements on top of one another
f + geom_point(position = "jitter")

# Add random noise to X and Y position of each element to avoid overplotting
# Theme
r + theme_bw()

r + theme_classic()

r + theme_grey()

r + theme_minimal()

# Faceting

t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
t + facet_grid(. ~ fl)

t + facet_grid(fl ~ .)

# facet into columns based on fl
t + facet_grid(year ~ .)

# facet into rows based on year
t + facet_grid(year ~ fl)

# facet into both rows and columns
t + facet_wrap(~ fl)

# wrap facets into a rectangular layout
# Labels
t + ggtitle("New Plot Title ")

# Add a main title above the plot
t + xlab("New X label")

# Change the label on the X axis
t + ylab("New Y label")

# Change the label on the Y axis
t + labs(title =" New title", x = "New x", y = "New y")


아래와 같은 그래프를 위한 R 코딩: 제조사별 평균 연비


Enhancing Visualizations

To enhance this plot, we might want to add a linear regression line to summarize the relationship between weight and fuel efficiency:

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal() +
  labs(title = "Fuel Efficiency vs. Weight", x = "Weight (1000 lbs)", y = "Miles per Gallon")
`geom_smooth()` using formula = 'y ~ x'

This code not only adds the regression line but also improves the aesthetics with a minimal theme and labels that clarify what each axis represents.

Practice once more with palmer penguins dataset.

library(palmerpenguins)
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Drop missing variables

penguins %>% 
  drop_na()
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>
ggplot(penguins) +
  aes(x = bill_length_mm,
      y = bill_depth_mm,
      colour = species) +
  geom_point(shape = "circle", size = 1.5) +
  scale_color_manual(
    values = c(Adelie = "#F8766D",
    Chinstrap = "#00C19F",
    Gentoo = "#FF61C3")
  ) +
  ggthemes::theme_fivethirtyeight() +
  theme(legend.position = "bottom")
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

  • Layers in use above

    • Layer connecting the X-axis and Y-axis

      • aes(x = bill_length_mm, y = bill_depth_mm, colour = species)
    • A layer that sets the elements of the graph

      • geom_point(shape = "circle", size = 1.5)
    • A layer that sets the color of the graph

      • scale_color_manual( values = c(Adelie = "#F8766D", Chinstrap = "#00C19F", Gentoo = "#FF61C3") )
    • A layer that sets the theme of the graph

      • ggthemes::theme_fivethirtyeight()
    • Layer to set the position of the legend

      • theme(legend.position = "bottom")


Advanced ggplot2 Features

Faceting for Comparative Analysis

To compare how the relationship between weight and fuel efficiency varies by the number of cylinders in the engine, we can use faceting:

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  facet_wrap(~cyl)

This will create a separate plot for each number of cylinders, making it easy to see differences across categories.

  • facet: a particular aspect of feature of something
ggplot(penguins) +
  aes(x = bill_length_mm,
      y = bill_depth_mm,
      colour = species) +
  geom_point(shape = "circle", size = 1.5) +
  scale_color_manual(
    values = c(Adelie = "#F8766D",
    Chinstrap = "#00C19F",
    Gentoo = "#FF61C3")
  ) +
  ggthemes::theme_fivethirtyeight() +
  theme(legend.position = "bottom") +
  facet_wrap(~island)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

penguins %>% drop_na %>% 
ggplot() +
  aes(x = bill_length_mm,
      y = bill_depth_mm,
      colour = species) +
  geom_point(shape = "circle", size = 1.5) +
  scale_color_manual(
    values = c(Adelie = "#F8766D",
    Chinstrap = "#00C19F",
    Gentoo = "#FF61C3")
  ) +
  ggthemes::theme_fivethirtyeight() +
  theme(legend.position = "bottom") +
  facet_wrap(sex ~ island)

Customization and Extensions

Check out extentions of ggplot2: https://exts.ggplot2.tidyverse.org/gallery/

ggplot2 is highly customizable, allowing extensive control over nearly every visual aspect of a plot. For users interested in making interactive plots, ggplot2 can be integrated with the plotly library, transforming static charts into interactive visualizations.

The power and flexibility of ggplot2 make it an indispensable tool for data visualization in R. Whether you are a beginner or an experienced user, there is always more to explore and learn with ggplot2. Practice regularly, and don’t hesitate to experiment with different components to discover the best ways to convey your insights visually.


To master ggplot2, see the videos below:

ggplot2 workshop part 1 by Thomas Lin Pedersen

https://www.youtube.com/watch?v=h29g21z0a68

ggplot2 workshop part 2 by Thomas Lin Pedersen

https://www.youtube.com/watch?v=0m4yywqNPVY


Practice more with gapminder dataset

library(gapminder)
library(dplyr)

y <- gapminder %>% group_by(year, continent) %>% summarize(c_pop = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
head(y, 20)
# A tibble: 20 × 3
# Groups:   year [4]
    year continent      c_pop
   <int> <fct>          <dbl>
 1  1952 Africa     237640501
 2  1952 Americas   345152446
 3  1952 Asia      1395357351
 4  1952 Europe     418120846
 5  1952 Oceania     10686006
 6  1957 Africa     264837738
 7  1957 Americas   386953916
 8  1957 Asia      1562780599
 9  1957 Europe     437890351
10  1957 Oceania     11941976
11  1962 Africa     296516865
12  1962 Americas   433270254
13  1962 Asia      1696357182
14  1962 Europe     460355155
15  1962 Oceania     13283518
16  1967 Africa     335289489
17  1967 Americas   480746623
18  1967 Asia      1905662900
19  1967 Europe     481178958
20  1967 Oceania     14600414
plot(y$year, y$c_pop)

plot(y$year, y$c_pop, col = y$continent)

plot(y$year, y$c_pop, col = y$continent, pch = c(1:5))
plot(y$year, y$c_pop, col = y$continent, pch = c(1:length(levels(y$continent))))

# Specify the number of legends as a number
legend("topright", legend = levels((y$continent)), pch = c(1:5), col = c(1:5))

# Specify the number of legends to match the number of data
legend("bottomleft", legend = levels((y$continent)), pch = c(1:length(levels(y$continent))), col = c(1:length(levels(y$continent))) )

# 02 Basic features of visualization #
plot(gapminder$gdpPercap, gapminder$lifeExp, col = gapminder$continent)
legend("bottomright", 
       legend = levels((gapminder$continent)),
        pch = c(1:length(levels(gapminder$continent))),
        col = c(1:length(levels(y$continent))))

plot(log10(gapminder$gdpPercap), gapminder$lifeExp, col = gapminder$continent)
legend("bottomright", legend = levels((gapminder$continent)), pch = c(1:length(levels(gapminder$continent))), col = c(1:length(levels(y$continent))) )

# install.packages("ggplot2")
library(ggplot2)
gapminder %>% ggplot(,aes())

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, col = continent)) +
   geom_point() +
   scale_x_log10()

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, col = continent, size = pop)) +
   geom_point() +
   scale_x_log10()

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, col = continent, size = pop)) +
   geom_point(alpha = 0.5) +
   scale_x_log10()

table(gapminder$year)

1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
 142  142  142  142  142  142  142  142  142  142  142  142 
gapminder %>% filter(year==1977) %>%
   ggplot(., aes(x=gdpPercap, y=lifeExp, col=continent, size=pop)) +
   geom_point(alpha=0.5) +
   scale_x_log10()

gapminder %>% filter(year==2007) %>%
   ggplot(., aes(x=gdpPercap, y=lifeExp, col=continent, size=pop)) +
   geom_point(alpha=0.5) +
   scale_x_log10()

ggplot(gapminder, aes(x=gdpPercap, y=lifeExp, col=continent, size=pop)) +
   geom_point(alpha=0.5) +
   scale_x_log10() +
   facet_wrap(~year)

gapminder %>%
   filter(year == 1952 & continent =="Asia") %>%
   ggplot(aes(reorder(country, pop), pop)) +
   geom_bar(stat = "identity") +
   coord_flip()

gapminder %>% 
  filter(year==1952 & continent== "Asia") %>% 
  ggplot(aes(reorder(country, pop), pop)) + 
  geom_bar(stat = "identity") + 
  scale_y_log10() + 
  coord_flip ()

gapminder %>%
   filter(country == "Korea, Rep.") %>%
   ggplot(aes(year, lifeExp, col = country)) +
   geom_point() +
   geom_line()

gapminder %>%
   filter(country == "Korea, Rep.") %>%
   ggplot(aes(year, lifeExp, col = country)) +
   # geom_point() +
   geom_line()

gapminder %>%
   ggplot(aes(x = year, y = lifeExp, col = continent)) +
   geom_point(alpha = 0.2) +
   geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'