데이터 유형 및 구조(2)

R을 활용한 데이터 전처리(2)

Weekly content


Practice: Data Frame & List



  • Korea Media Panel data is used in news articles based on media statistics, such as the article below.


  • Consists of


I made a toy data set with the KMP like below.

  • Please download R data set here: List_KMP.RData

  • Place it in your working directory



load("data/List_KMP.RData")

str(List.KMP)
List of 4
 $ :'data.frame':   10 obs. of  11 variables:
  ..$ pid              : num [1:10] 9920004 12500003 17350004 17670001 23860001 ...
  ..$ age              : num [1:10] 9 26 12 55 70 58 50 68 39 37
  ..$ gender           : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 1 1 2
  ..$ Mobile.lv        : Factor w/ 5 levels "3G","LTE","LTE-A",..: 5 2 2 2 5 2 2 2 2 3
  ..$ Telecom          : Factor w/ 5 levels "SKT","KT","LG U+",..: 5 2 3 1 5 1 1 1 2 2
  ..$ Smartphone.brand : Factor w/ 8 levels "Samsung","Apple",..: 8 1 1 1 8 1 1 1 1 3
  ..$ sp.mobile        : num [1:10] 0 42 19 38 18 65 32 58 63 54
  ..$ sp.device        : num [1:10] 0 0 0 0 0 0 0 10 0 10
  ..$ sp.online.content: num [1:10] 0 53 19 38 18 65 32 68 63 114
  ..$ sp.offline.contet: num [1:10] 10 22 0 0 0 50 36 0 25 40
  ..$ year             : num [1:10] 2017 2017 2017 2017 2017 ...
 $ :'data.frame':   10 obs. of  11 variables:
  ..$ pid              : num [1:10] 9920004 12500003 17350004 17670001 23860001 ...
  ..$ age              : num [1:10] 10 27 13 56 71 59 51 69 40 38
  ..$ gender           : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 1 1 2
  ..$ Mobile.lv        : Factor w/ 5 levels "3G","LTE","LTE-A",..: 5 3 3 2 5 2 2 2 3 3
  ..$ Telecom          : Factor w/ 5 levels "SKT","KT","LG U+",..: 5 3 3 2 5 3 1 2 1 2
  ..$ Smartphone.brand : Factor w/ 8 levels "Samsung","Apple",..: 8 1 3 1 8 3 1 3 1 3
  ..$ sp.mobile        : num [1:10] 0 90 20 39 30 80 33 36 40 59
  ..$ sp.device        : num [1:10] 0 60 35 0 0 15 0 10 10 12
  ..$ sp.online.content: num [1:10] 0 359 55 39 30 95 33 46 50 71
  ..$ sp.offline.contet: num [1:10] 8 120 0 0 0 12 20 0 0 100
  ..$ year             : num [1:10] 2018 2018 2018 2018 2018 ...
 $ :'data.frame':   10 obs. of  11 variables:
  ..$ pid              : num [1:10] 9920004 12500003 17350004 17670001 23860001 ...
  ..$ age              : num [1:10] 11 28 14 57 72 60 52 70 41 39
  ..$ gender           : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 1 1 2
  ..$ Mobile.lv        : Factor w/ 5 levels "3G","LTE","LTE-A",..: 5 2 3 2 2 2 2 2 3 3
  ..$ Telecom          : Factor w/ 5 levels "SKT","KT","LG U+",..: 5 3 3 1 2 1 1 2 3 3
  ..$ Smartphone.brand : Factor w/ 8 levels "Samsung","Apple",..: 8 2 1 1 1 1 1 3 3 3
  ..$ sp.mobile        : num [1:10] 0 60 21 30 32 78 35 54 73 45
  ..$ sp.device        : num [1:10] 0 30 0 0 0 0 0 0 20 15
  ..$ sp.online.content: num [1:10] 0 90 21 30 32 78 35 54 93 60
  ..$ sp.offline.contet: num [1:10] 21 0 0 20 0 24 90 0 20 60
  ..$ year             : num [1:10] 2019 2019 2019 2019 2019 ...
 $ :'data.frame':   10 obs. of  4 variables:
  ..$ pid           : num [1:10] 9920004 12500003 17350004 17670001 23860001 ...
  ..$ Smarphone.time: num [1:10] 0 60 345 90 40 65 170 95 85 75
  ..$ SNS.time      : num [1:10] 0 0 10 0 0 0 0 0 0 0
  ..$ year          : num [1:10] 2019 2019 2019 2019 2019 ...


Personal Data 2017

List.KMP[[1]]
        pid age gender     Mobile.lv       Telecom Smartphone.brand sp.mobile
1   9920004   9   Male No Smartphone No Smartphone    No Smartphone         0
2  12500003  26   Male           LTE            KT          Samsung        42
3  17350004  12 Female           LTE         LG U+          Samsung        19
4  17670001  55   Male           LTE           SKT          Samsung        38
5  23860001  70   Male No Smartphone No Smartphone    No Smartphone        18
6  24450001  58   Male           LTE           SKT          Samsung        65
7  27570001  50   Male           LTE           SKT          Samsung        32
8  53620001  68   Male           LTE           SKT          Samsung        58
9  59570001  39   Male           LTE            KT          Samsung        63
10 65840001  37 Female         LTE-A            KT               LG        54
   sp.device sp.online.content sp.offline.contet year
1          0                 0                10 2017
2          0                53                22 2017
3          0                19                 0 2017
4          0                38                 0 2017
5          0                18                 0 2017
6          0                65                50 2017
7          0                32                36 2017
8         10                68                 0 2017
9          0                63                25 2017
10        10               114                40 2017

Personal Data 2018

List.KMP[[2]]
        pid age gender     Mobile.lv       Telecom Smartphone.brand sp.mobile
1   9920004  10   Male No Smartphone No Smartphone    No Smartphone         0
2  12500003  27   Male         LTE-A         LG U+          Samsung        90
3  17350004  13 Female         LTE-A         LG U+               LG        20
4  17670001  56   Male           LTE            KT          Samsung        39
5  23860001  71   Male No Smartphone No Smartphone    No Smartphone        30
6  24450001  59   Male           LTE         LG U+               LG        80
7  27570001  51   Male           LTE           SKT          Samsung        33
8  53620001  69   Male           LTE            KT               LG        36
9  59570001  40   Male         LTE-A           SKT          Samsung        40
10 65840001  38 Female         LTE-A            KT               LG        59
   sp.device sp.online.content sp.offline.contet year
1          0                 0                 8 2018
2         60               359               120 2018
3         35                55                 0 2018
4          0                39                 0 2018
5          0                30                 0 2018
6         15                95                12 2018
7          0                33                20 2018
8         10                46                 0 2018
9         10                50                 0 2018
10        12                71               100 2018

Personal Data 2019

List.KMP[[3]]
        pid age gender     Mobile.lv       Telecom Smartphone.brand sp.mobile
1   9920004  11   Male No Smartphone No Smartphone    No Smartphone         0
2  12500003  28   Male           LTE         LG U+            Apple        60
3  17350004  14 Female         LTE-A         LG U+          Samsung        21
4  17670001  57   Male           LTE           SKT          Samsung        30
5  23860001  72   Male           LTE            KT          Samsung        32
6  24450001  60   Male           LTE           SKT          Samsung        78
7  27570001  52   Male           LTE           SKT          Samsung        35
8  53620001  70   Male           LTE            KT               LG        54
9  59570001  41   Male         LTE-A         LG U+               LG        73
10 65840001  39 Female         LTE-A         LG U+               LG        45
   sp.device sp.online.content sp.offline.contet year
1          0                 0                21 2019
2         30                90                 0 2019
3          0                21                 0 2019
4          0                30                20 2019
5          0                32                 0 2019
6          0                78                24 2019
7          0                35                90 2019
8          0                54                 0 2019
9         20                93                20 2019
10        15                60                60 2019

Personal Media Diary 2019

List.KMP[[4]]
        pid Smarphone.time SNS.time year
1   9920004              0        0 2019
2  12500003             60        0 2019
3  17350004            345       10 2019
4  17670001             90        0 2019
5  23860001             40        0 2019
6  24450001             65        0 2019
7  27570001            170        0 2019
8  53620001             95        0 2019
9  59570001             85        0 2019
10 65840001             75        0 2019


Let’s name the list elements

# Check the existing names
names(List.KMP)
NULL
# Give names to each element of the list
names(List.KMP) <- c("p17", "p18", "p19", "d19")


How can we extract the first element of the list?

# one way
List.KMP[[1]]
        pid age gender     Mobile.lv       Telecom Smartphone.brand sp.mobile
1   9920004   9   Male No Smartphone No Smartphone    No Smartphone         0
2  12500003  26   Male           LTE            KT          Samsung        42
3  17350004  12 Female           LTE         LG U+          Samsung        19
4  17670001  55   Male           LTE           SKT          Samsung        38
5  23860001  70   Male No Smartphone No Smartphone    No Smartphone        18
6  24450001  58   Male           LTE           SKT          Samsung        65
7  27570001  50   Male           LTE           SKT          Samsung        32
8  53620001  68   Male           LTE           SKT          Samsung        58
9  59570001  39   Male           LTE            KT          Samsung        63
10 65840001  37 Female         LTE-A            KT               LG        54
   sp.device sp.online.content sp.offline.contet year
1          0                 0                10 2017
2          0                53                22 2017
3          0                19                 0 2017
4          0                38                 0 2017
5          0                18                 0 2017
6          0                65                50 2017
7          0                32                36 2017
8         10                68                 0 2017
9          0                63                25 2017
10        10               114                40 2017
# the other way
List.KMP[['p17']]
        pid age gender     Mobile.lv       Telecom Smartphone.brand sp.mobile
1   9920004   9   Male No Smartphone No Smartphone    No Smartphone         0
2  12500003  26   Male           LTE            KT          Samsung        42
3  17350004  12 Female           LTE         LG U+          Samsung        19
4  17670001  55   Male           LTE           SKT          Samsung        38
5  23860001  70   Male No Smartphone No Smartphone    No Smartphone        18
6  24450001  58   Male           LTE           SKT          Samsung        65
7  27570001  50   Male           LTE           SKT          Samsung        32
8  53620001  68   Male           LTE           SKT          Samsung        58
9  59570001  39   Male           LTE            KT          Samsung        63
10 65840001  37 Female         LTE-A            KT               LG        54
   sp.device sp.online.content sp.offline.contet year
1          0                 0                10 2017
2          0                53                22 2017
3          0                19                 0 2017
4          0                38                 0 2017
5          0                18                 0 2017
6          0                65                50 2017
7          0                32                36 2017
8         10                68                 0 2017
9          0                63                25 2017
10        10               114                40 2017

Lists of a list


# Create an empty list 
List.KMP[[5]] <- list(0)
# See the structure
str(List.KMP)
List of 5
 $ p17:'data.frame':    10 obs. of  11 variables:
  ..$ pid              : num [1:10] 9920004 12500003 17350004 17670001 23860001 ...
  ..$ age              : num [1:10] 9 26 12 55 70 58 50 68 39 37
  ..$ gender           : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 1 1 2
  ..$ Mobile.lv        : Factor w/ 5 levels "3G","LTE","LTE-A",..: 5 2 2 2 5 2 2 2 2 3
  ..$ Telecom          : Factor w/ 5 levels "SKT","KT","LG U+",..: 5 2 3 1 5 1 1 1 2 2
  ..$ Smartphone.brand : Factor w/ 8 levels "Samsung","Apple",..: 8 1 1 1 8 1 1 1 1 3
  ..$ sp.mobile        : num [1:10] 0 42 19 38 18 65 32 58 63 54
  ..$ sp.device        : num [1:10] 0 0 0 0 0 0 0 10 0 10
  ..$ sp.online.content: num [1:10] 0 53 19 38 18 65 32 68 63 114
  ..$ sp.offline.contet: num [1:10] 10 22 0 0 0 50 36 0 25 40
  ..$ year             : num [1:10] 2017 2017 2017 2017 2017 ...
 $ p18:'data.frame':    10 obs. of  11 variables:
  ..$ pid              : num [1:10] 9920004 12500003 17350004 17670001 23860001 ...
  ..$ age              : num [1:10] 10 27 13 56 71 59 51 69 40 38
  ..$ gender           : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 1 1 2
  ..$ Mobile.lv        : Factor w/ 5 levels "3G","LTE","LTE-A",..: 5 3 3 2 5 2 2 2 3 3
  ..$ Telecom          : Factor w/ 5 levels "SKT","KT","LG U+",..: 5 3 3 2 5 3 1 2 1 2
  ..$ Smartphone.brand : Factor w/ 8 levels "Samsung","Apple",..: 8 1 3 1 8 3 1 3 1 3
  ..$ sp.mobile        : num [1:10] 0 90 20 39 30 80 33 36 40 59
  ..$ sp.device        : num [1:10] 0 60 35 0 0 15 0 10 10 12
  ..$ sp.online.content: num [1:10] 0 359 55 39 30 95 33 46 50 71
  ..$ sp.offline.contet: num [1:10] 8 120 0 0 0 12 20 0 0 100
  ..$ year             : num [1:10] 2018 2018 2018 2018 2018 ...
 $ p19:'data.frame':    10 obs. of  11 variables:
  ..$ pid              : num [1:10] 9920004 12500003 17350004 17670001 23860001 ...
  ..$ age              : num [1:10] 11 28 14 57 72 60 52 70 41 39
  ..$ gender           : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 1 1 2
  ..$ Mobile.lv        : Factor w/ 5 levels "3G","LTE","LTE-A",..: 5 2 3 2 2 2 2 2 3 3
  ..$ Telecom          : Factor w/ 5 levels "SKT","KT","LG U+",..: 5 3 3 1 2 1 1 2 3 3
  ..$ Smartphone.brand : Factor w/ 8 levels "Samsung","Apple",..: 8 2 1 1 1 1 1 3 3 3
  ..$ sp.mobile        : num [1:10] 0 60 21 30 32 78 35 54 73 45
  ..$ sp.device        : num [1:10] 0 30 0 0 0 0 0 0 20 15
  ..$ sp.online.content: num [1:10] 0 90 21 30 32 78 35 54 93 60
  ..$ sp.offline.contet: num [1:10] 21 0 0 20 0 24 90 0 20 60
  ..$ year             : num [1:10] 2019 2019 2019 2019 2019 ...
 $ d19:'data.frame':    10 obs. of  4 variables:
  ..$ pid           : num [1:10] 9920004 12500003 17350004 17670001 23860001 ...
  ..$ Smarphone.time: num [1:10] 0 60 345 90 40 65 170 95 85 75
  ..$ SNS.time      : num [1:10] 0 0 10 0 0 0 0 0 0 0
  ..$ year          : num [1:10] 2019 2019 2019 2019 2019 ...
 $    :List of 1
  ..$ : num 0
# The first element of the fifth element
List.KMP[[5]][[1]]<-c(1:10)

# The second element of the fifth element
List.KMP[[5]][[2]]<-matrix(c(1:12), nrow=4)
List.KMP[[5]]
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12


Extract the ‘p17’ element

p17_df <- List.KMP[["p17"]]
p17_df
        pid age gender     Mobile.lv       Telecom Smartphone.brand sp.mobile
1   9920004   9   Male No Smartphone No Smartphone    No Smartphone         0
2  12500003  26   Male           LTE            KT          Samsung        42
3  17350004  12 Female           LTE         LG U+          Samsung        19
4  17670001  55   Male           LTE           SKT          Samsung        38
5  23860001  70   Male No Smartphone No Smartphone    No Smartphone        18
6  24450001  58   Male           LTE           SKT          Samsung        65
7  27570001  50   Male           LTE           SKT          Samsung        32
8  53620001  68   Male           LTE           SKT          Samsung        58
9  59570001  39   Male           LTE            KT          Samsung        63
10 65840001  37 Female         LTE-A            KT               LG        54
   sp.device sp.online.content sp.offline.contet year
1          0                 0                10 2017
2          0                53                22 2017
3          0                19                 0 2017
4          0                38                 0 2017
5          0                18                 0 2017
6          0                65                50 2017
7          0                32                36 2017
8         10                68                 0 2017
9          0                63                25 2017
10        10               114                40 2017
# Summary Statistics
summary(p17_df)
      pid                age           gender          Mobile.lv
 Min.   : 9920004   Min.   : 9.00   Male  :8   3G           :0  
 1st Qu.:17430003   1st Qu.:28.75   Female:2   LTE          :7  
 Median :24155001   Median :44.50              LTE-A        :1  
 Mean   :31235002   Mean   :42.40              5G           :0  
 3rd Qu.:47107501   3rd Qu.:57.25              No Smartphone:2  
 Max.   :65840001   Max.   :70.00                               
                                                                
          Telecom       Smartphone.brand   sp.mobile       sp.device 
 SKT          :4   Samsung      :7       Min.   : 0.00   Min.   : 0  
 KT           :3   No Smartphone:2       1st Qu.:22.25   1st Qu.: 0  
 LG U+        :1   LG           :1       Median :40.00   Median : 0  
 MVNO         :0   Apple        :0       Mean   :38.90   Mean   : 2  
 No Smartphone:2   Pantech      :0       3rd Qu.:57.00   3rd Qu.: 0  
                   Xiaomi       :0       Max.   :65.00   Max.   :10  
                   (Other)      :0                                   
 sp.online.content sp.offline.contet      year     
 Min.   :  0.00    Min.   : 0.00     Min.   :2017  
 1st Qu.: 22.25    1st Qu.: 0.00     1st Qu.:2017  
 Median : 45.50    Median :16.00     Median :2017  
 Mean   : 47.00    Mean   :18.30     Mean   :2017  
 3rd Qu.: 64.50    3rd Qu.:33.25     3rd Qu.:2017  
 Max.   :114.00    Max.   :50.00     Max.   :2017  
                                                   


  • Among 10 people, How many people did use Samsung phone at 2017?

  • How much did people spend for the mobile communication on average?


Do the same thing to 2019 data set and answer the questions below.

  1. In 2019, how many people did use Samsung phone?

  2. Draw boxplot of people’s spending on the mobile communication
    *Hint: use boxplot()


Filter

# 02 Conditional statements and loop statements for data purification #

test = c(15, 20, 30, NA, 45) # If it is a vector
test[test<40] # extract elements with value less than 40
[1] 15 20 30 NA
test[test%%3!= 0] # Extract elements whose value is not divisible by 3
[1] 20 NA
test[is.na(test)] # Extract elements that are NA
[1] NA
test[!is.na(test)] # Extract non-NA elements
[1] 15 20 30 45
test[test%%2==0 & !is.na(test)] # Extract elements that are multiples of 2 but are not NA
[1] 20 30
characters = data.frame(name = c("Gil-dong", "Chunhyang", "Cheolsu"),
                         age = c(30, 16, 21),
                         gender = factor(c("M", "F","M")))
# In case of data frame

characters
       name age gender
1  Gil-dong  30      M
2 Chunhyang  16      F
3   Cheolsu  21      M
characters[characters$gender =="F",1] # Extract rows where gender is female
[1] "Chunhyang"
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
characters %>% filter(gender=="F") %>% select(name)
       name
1 Chunhyang
characters[characters$age<30 & characters$gender =="M",]
     name age gender
3 Cheolsu  21      M
# Extract rows of men under 30 years old
characters %>% filter(age<30 & gender=="M")
     name age gender
1 Cheolsu  21      M


Let’s use the List.KMP again

load("data/List_KMP.RData")
names(List.KMP) <- c("p17", "p18", "p19", "d19")

See the structure

str(List.KMP)
List of 4
 $ p17:'data.frame':    10 obs. of  11 variables:
  ..$ pid              : num [1:10] 9920004 12500003 17350004 17670001 23860001 ...
  ..$ age              : num [1:10] 9 26 12 55 70 58 50 68 39 37
  ..$ gender           : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 1 1 2
  ..$ Mobile.lv        : Factor w/ 5 levels "3G","LTE","LTE-A",..: 5 2 2 2 5 2 2 2 2 3
  ..$ Telecom          : Factor w/ 5 levels "SKT","KT","LG U+",..: 5 2 3 1 5 1 1 1 2 2
  ..$ Smartphone.brand : Factor w/ 8 levels "Samsung","Apple",..: 8 1 1 1 8 1 1 1 1 3
  ..$ sp.mobile        : num [1:10] 0 42 19 38 18 65 32 58 63 54
  ..$ sp.device        : num [1:10] 0 0 0 0 0 0 0 10 0 10
  ..$ sp.online.content: num [1:10] 0 53 19 38 18 65 32 68 63 114
  ..$ sp.offline.contet: num [1:10] 10 22 0 0 0 50 36 0 25 40
  ..$ year             : num [1:10] 2017 2017 2017 2017 2017 ...
 $ p18:'data.frame':    10 obs. of  11 variables:
  ..$ pid              : num [1:10] 9920004 12500003 17350004 17670001 23860001 ...
  ..$ age              : num [1:10] 10 27 13 56 71 59 51 69 40 38
  ..$ gender           : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 1 1 2
  ..$ Mobile.lv        : Factor w/ 5 levels "3G","LTE","LTE-A",..: 5 3 3 2 5 2 2 2 3 3
  ..$ Telecom          : Factor w/ 5 levels "SKT","KT","LG U+",..: 5 3 3 2 5 3 1 2 1 2
  ..$ Smartphone.brand : Factor w/ 8 levels "Samsung","Apple",..: 8 1 3 1 8 3 1 3 1 3
  ..$ sp.mobile        : num [1:10] 0 90 20 39 30 80 33 36 40 59
  ..$ sp.device        : num [1:10] 0 60 35 0 0 15 0 10 10 12
  ..$ sp.online.content: num [1:10] 0 359 55 39 30 95 33 46 50 71
  ..$ sp.offline.contet: num [1:10] 8 120 0 0 0 12 20 0 0 100
  ..$ year             : num [1:10] 2018 2018 2018 2018 2018 ...
 $ p19:'data.frame':    10 obs. of  11 variables:
  ..$ pid              : num [1:10] 9920004 12500003 17350004 17670001 23860001 ...
  ..$ age              : num [1:10] 11 28 14 57 72 60 52 70 41 39
  ..$ gender           : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 1 1 1 2
  ..$ Mobile.lv        : Factor w/ 5 levels "3G","LTE","LTE-A",..: 5 2 3 2 2 2 2 2 3 3
  ..$ Telecom          : Factor w/ 5 levels "SKT","KT","LG U+",..: 5 3 3 1 2 1 1 2 3 3
  ..$ Smartphone.brand : Factor w/ 8 levels "Samsung","Apple",..: 8 2 1 1 1 1 1 3 3 3
  ..$ sp.mobile        : num [1:10] 0 60 21 30 32 78 35 54 73 45
  ..$ sp.device        : num [1:10] 0 30 0 0 0 0 0 0 20 15
  ..$ sp.online.content: num [1:10] 0 90 21 30 32 78 35 54 93 60
  ..$ sp.offline.contet: num [1:10] 21 0 0 20 0 24 90 0 20 60
  ..$ year             : num [1:10] 2019 2019 2019 2019 2019 ...
 $ d19:'data.frame':    10 obs. of  4 variables:
  ..$ pid           : num [1:10] 9920004 12500003 17350004 17670001 23860001 ...
  ..$ Smarphone.time: num [1:10] 0 60 345 90 40 65 170 95 85 75
  ..$ SNS.time      : num [1:10] 0 0 10 0 0 0 0 0 0 0
  ..$ year          : num [1:10] 2019 2019 2019 2019 2019 ...





Did ‘KT’ users in 2017 change their telecom companies in 2018 and 2019?

  1. Extract data.frames from the List
  2. Find the pid(personal ID) whose telecom company was “KT”
  3. Try the code below and explain how it works
  4. Merge the two data.frames above by the key variable “pid”, and change the second and the third column names to “y2018” and “y2019”

  1. Extract data.frames from the List

    p17_df <- List.KMP[["p17"]]
    p18_df <- List.KMP[["p18"]]
    p19_df <- List.KMP[["p19"]]
  2. Find the pid(personal ID) whose telecom company was “KT”

    pid_kt_user <- p17_df[p17_df$Telecom=="KT", "pid"]
    pid_kt_user
    [1] 12500003 59570001 65840001
  3. Try the code below and explain how it works

    kt_user_18 <- p18_df[p18_df$pid %in% pid_kt_user, c("pid", "Telecom")]
    kt_user_19 <- p19_df[p19_df$pid %in% pid_kt_user, c("pid", "Telecom")]
    kt_user_18
            pid Telecom
    2  12500003   LG U+
    9  59570001     SKT
    10 65840001      KT
    kt_user_19
            pid Telecom
    2  12500003   LG U+
    9  59570001   LG U+
    10 65840001   LG U+
  4. Merge the two data.frames above by the key variable “pid”, and change the second and the third column names to “y2018” and “y2019”

    DF3 <- merge(kt_user_18, kt_user_19, by="pid")
    names(DF3)[c(2,3)]<-c("y2018", "y2019")
    DF3
           pid y2018 y2019
    1 12500003 LG U+ LG U+
    2 59570001   SKT LG U+
    3 65840001    KT LG U+


What kinds of studies we can do with the analysis above? Let’s think about it.

Download and use this data for you study

You can also download the original questionnaires, codebooks, and user guides documents


Did ‘Samsung’ users in 2017 change their phone brand in 2018 and 2019?

Let’s analyse this using the code we learnt


Practice & QZ (more)

Let’s use ‘airquality’ data stored in base R

airquality
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      NA      NA 14.3   56     5   5
6      28      NA 14.9   66     5   6
7      23     299  8.6   65     5   7
8      19      99 13.8   59     5   8
9       8      19 20.1   61     5   9
10     NA     194  8.6   69     5  10
11      7      NA  6.9   74     5  11
12     16     256  9.7   69     5  12
13     11     290  9.2   66     5  13
14     14     274 10.9   68     5  14
15     18      65 13.2   58     5  15
16     14     334 11.5   64     5  16
17     34     307 12.0   66     5  17
18      6      78 18.4   57     5  18
19     30     322 11.5   68     5  19
20     11      44  9.7   62     5  20
21      1       8  9.7   59     5  21
22     11     320 16.6   73     5  22
23      4      25  9.7   61     5  23
24     32      92 12.0   61     5  24
25     NA      66 16.6   57     5  25
26     NA     266 14.9   58     5  26
27     NA      NA  8.0   57     5  27
28     23      13 12.0   67     5  28
29     45     252 14.9   81     5  29
30    115     223  5.7   79     5  30
31     37     279  7.4   76     5  31
32     NA     286  8.6   78     6   1
33     NA     287  9.7   74     6   2
34     NA     242 16.1   67     6   3
35     NA     186  9.2   84     6   4
36     NA     220  8.6   85     6   5
37     NA     264 14.3   79     6   6
38     29     127  9.7   82     6   7
39     NA     273  6.9   87     6   8
40     71     291 13.8   90     6   9
41     39     323 11.5   87     6  10
42     NA     259 10.9   93     6  11
43     NA     250  9.2   92     6  12
44     23     148  8.0   82     6  13
45     NA     332 13.8   80     6  14
46     NA     322 11.5   79     6  15
47     21     191 14.9   77     6  16
48     37     284 20.7   72     6  17
49     20      37  9.2   65     6  18
50     12     120 11.5   73     6  19
51     13     137 10.3   76     6  20
52     NA     150  6.3   77     6  21
53     NA      59  1.7   76     6  22
54     NA      91  4.6   76     6  23
55     NA     250  6.3   76     6  24
56     NA     135  8.0   75     6  25
57     NA     127  8.0   78     6  26
58     NA      47 10.3   73     6  27
59     NA      98 11.5   80     6  28
60     NA      31 14.9   77     6  29
61     NA     138  8.0   83     6  30
62    135     269  4.1   84     7   1
63     49     248  9.2   85     7   2
64     32     236  9.2   81     7   3
65     NA     101 10.9   84     7   4
66     64     175  4.6   83     7   5
67     40     314 10.9   83     7   6
68     77     276  5.1   88     7   7
69     97     267  6.3   92     7   8
70     97     272  5.7   92     7   9
71     85     175  7.4   89     7  10
72     NA     139  8.6   82     7  11
73     10     264 14.3   73     7  12
74     27     175 14.9   81     7  13
75     NA     291 14.9   91     7  14
76      7      48 14.3   80     7  15
77     48     260  6.9   81     7  16
78     35     274 10.3   82     7  17
79     61     285  6.3   84     7  18
80     79     187  5.1   87     7  19
81     63     220 11.5   85     7  20
82     16       7  6.9   74     7  21
83     NA     258  9.7   81     7  22
84     NA     295 11.5   82     7  23
85     80     294  8.6   86     7  24
86    108     223  8.0   85     7  25
87     20      81  8.6   82     7  26
88     52      82 12.0   86     7  27
89     82     213  7.4   88     7  28
90     50     275  7.4   86     7  29
91     64     253  7.4   83     7  30
92     59     254  9.2   81     7  31
93     39      83  6.9   81     8   1
94      9      24 13.8   81     8   2
95     16      77  7.4   82     8   3
96     78      NA  6.9   86     8   4
97     35      NA  7.4   85     8   5
98     66      NA  4.6   87     8   6
99    122     255  4.0   89     8   7
100    89     229 10.3   90     8   8
101   110     207  8.0   90     8   9
102    NA     222  8.6   92     8  10
103    NA     137 11.5   86     8  11
104    44     192 11.5   86     8  12
105    28     273 11.5   82     8  13
106    65     157  9.7   80     8  14
107    NA      64 11.5   79     8  15
108    22      71 10.3   77     8  16
109    59      51  6.3   79     8  17
110    23     115  7.4   76     8  18
111    31     244 10.9   78     8  19
112    44     190 10.3   78     8  20
113    21     259 15.5   77     8  21
114     9      36 14.3   72     8  22
115    NA     255 12.6   75     8  23
116    45     212  9.7   79     8  24
117   168     238  3.4   81     8  25
118    73     215  8.0   86     8  26
119    NA     153  5.7   88     8  27
120    76     203  9.7   97     8  28
121   118     225  2.3   94     8  29
122    84     237  6.3   96     8  30
123    85     188  6.3   94     8  31
124    96     167  6.9   91     9   1
125    78     197  5.1   92     9   2
126    73     183  2.8   93     9   3
127    91     189  4.6   93     9   4
128    47      95  7.4   87     9   5
129    32      92 15.5   84     9   6
130    20     252 10.9   80     9   7
131    23     220 10.3   78     9   8
132    21     230 10.9   75     9   9
133    24     259  9.7   73     9  10
134    44     236 14.9   81     9  11
135    21     259 15.5   76     9  12
136    28     238  6.3   77     9  13
137     9      24 10.9   71     9  14
138    13     112 11.5   71     9  15
139    46     237  6.9   78     9  16
140    18     224 13.8   67     9  17
141    13      27 10.3   76     9  18
142    24     238 10.3   68     9  19
143    16     201  8.0   82     9  20
144    13     238 12.6   64     9  21
145    23      14  9.2   71     9  22
146    36     139 10.3   81     9  23
147     7      49 10.3   69     9  24
148    14      20 16.6   63     9  25
149    30     193  6.9   70     9  26
150    NA     145 13.2   77     9  27
151    14     191 14.3   75     9  28
152    18     131  8.0   76     9  29
153    20     223 11.5   68     9  30

Format

A data frame with 153 observations on 6 variables.

[,1] Ozone numeric Ozone (ppb)
[,2] Solar.R numeric Solar R (lang)
[,3] Wind numeric Wind (mph)
[,4] Temp numeric Temperature (degrees F)
[,5] Month numeric Month (1–12)

Details

Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.

  • Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island

  • Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park

  • Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport

  • Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

pairs(airquality, panel = panel.smooth, main = "airquality data")

  1. What are the column names of the data frame?

    *Hint: names()


  2. What are the row names of the data frame?

    *Hint: rownames()


  3. Extract the first 10 rows in ‘airquality’

    *Hint: One way: by using [] / Alternative way: by using head()


  4. Check the number of rows (observations) in the dataset

    *Hint: length() or nrow()


  5. How many missing values are in the ‘Ozone’ column ?

    *Hint: is.na() and sum()


  6. What is the mean value of the ‘Ozone’ column? (Exclude missing values (coded as NA) from this calculation)

    *Hint: Use mean() with an option na.rm = T


  7. Extract the subset where Ozone values are above 31 and Temp values are above 90.

    *Hint: Use subset()


  8. Use the apply function to calculate the standard deviation of each column in the data frame

    *Hint: Use apply() with an option na.rm = T / Use sd() function


  9. Calculate the mean value of ‘Ozone’ for each Month in the data frame and create a vector containing the monthly means (exclude all missing values)

    *Hint: Use tapply() with an option na.rm = T


  10. Draw a random sample of 5 rows from the data frame

    *Hint: Use sample(), nrow(), airquality[]



Introduction to Tidyverse

The tidyverse is a powerful collection  of  R  packages  that are actually  data tools for transforming and visualizing data. All packages of the tidyverse share an underlying philosophy and common APls.

The core packages are: 

  • ggplot2, which implements the grammar of graphics. You can use it to visualize your data.

  • dplyr is a grammar of data You can use it to solve the most common data manipulation challenges.

  • tidyr helps you to create tidy data or data where each variable is in a column, each observation is a row end each value is a column, each observation is a row end each value is a cell. 

  • readr is a fast and friendly way to read rectangular

  • purrr enhances R’s functional programming (FP)toolkit by providing a complete and consistent set of tools for working with functions and vectors. 

  • tibble is a modern re-imaginging of the data

  • stringr provides a cohesive set of functions designed to make working with strings as easy as possible

  • forcats provide a suite of useful  tools that solve common problems with factors.  



The introduction of the tidyverse package in R has significantly influenced the way data science is performed using R, impacting coding practices, data analysis methodologies, and the overall approach to data manipulation and visualization.

Before Tidyverse

Before the tidyverse, R programming was largely centered around base R functions and packages. This included using base R functions for data manipulation (like subset, merge, and apply functions) and visualization (such as plotting with plot and hist). The syntax and methods varied widely across different packages, which often led to inconsistent coding practices and a steeper learning curve for beginners. Each task could be approached in multiple ways without a clear ‘best’ method, leading to fragmented and less readable code.

After Tidyverse

The tidyverse, developed by Hadley Wickham and others, brought a suite of packages designed to work harmoniously together using a consistent syntax and underlying philosophy. Key features and impacts include:

  1. Consistent Syntax: The tidyverse introduced a consistent and readable syntax that leverages chaining operations using the %>% operator from the magrittr package. This has made code more readable and easier to write and understand, especially for newcomers.

  2. Data Manipulation: With dplyr, data manipulation became more intuitive and less verbose. Functions like filter(), arrange(), select(), mutate(), and summarise() allow for straightforward data operations that are both faster and easier to code compared to base R functions.

  3. Data Importing and Tidying: readr for reading data and tidyr for tidying data introduced more efficient data reading and transforming capabilities, making it simpler to convert data into a tidy format. Tidy data, where each column is a variable and each row is an observation, has become a standard for data analysis, facilitating easier manipulation and analysis.

  4. Visualization: ggplot2 transformed data visualization in R by allowing for the layering of plots intuitively and flexibly, using a system based on the grammar of graphics. This has enabled users to create complex, publication-quality graphs more easily than was possible with base R plotting functions.

  5. Community and Accessibility: The tidyverse has fostered a strong community and has contributed significantly to teaching materials that are user-friendly and accessible to beginners. This has democratized data analysis in R, making it more accessible to non-programmers.

  6. Impact on Package Development: The tidyverse’s philosophy and popularity have influenced the development of other packages, even those not part of the tidyverse, to adopt tidy principles and interoperate smoothly with tidyverse packages.

The tidyverse has not only changed the syntax and functionality of R coding but also its philosophy towards data analysis. It promotes a workflow that is coherent, transparent, and efficient, which has been widely adopted in academia, industry, and teaching. While some veteran R users prefer the flexibility and control of base R, the tidyverse’s approachable syntax and powerful capabilities have made it a pivotal tool in modern R programming, particularly for data science.


Please check out the homepage of tidyverse: https://www.tidyverse.org/


You can install the complete tidyverse with:

> install.packages(“tidyverse”)

Then, load the core tidyverse and make it available in your current R session by running:

> library(tidyverse)

Please see this cheat sheet:

https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf


Pipe

Subset Observations

# filter
iris %>% filter(Sepal.Length > 7)
   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1           7.1         3.0          5.9         2.1 virginica
2           7.6         3.0          6.6         2.1 virginica
3           7.3         2.9          6.3         1.8 virginica
4           7.2         3.6          6.1         2.5 virginica
5           7.7         3.8          6.7         2.2 virginica
6           7.7         2.6          6.9         2.3 virginica
7           7.7         2.8          6.7         2.0 virginica
8           7.2         3.2          6.0         1.8 virginica
9           7.2         3.0          5.8         1.6 virginica
10          7.4         2.8          6.1         1.9 virginica
11          7.9         3.8          6.4         2.0 virginica
12          7.7         3.0          6.1         2.3 virginica
levels(iris$Species)
[1] "setosa"     "versicolor" "virginica" 
iris %>% filter(Species == "setosa")
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
11          5.4         3.7          1.5         0.2  setosa
12          4.8         3.4          1.6         0.2  setosa
13          4.8         3.0          1.4         0.1  setosa
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa
16          5.7         4.4          1.5         0.4  setosa
17          5.4         3.9          1.3         0.4  setosa
18          5.1         3.5          1.4         0.3  setosa
19          5.7         3.8          1.7         0.3  setosa
20          5.1         3.8          1.5         0.3  setosa
21          5.4         3.4          1.7         0.2  setosa
22          5.1         3.7          1.5         0.4  setosa
23          4.6         3.6          1.0         0.2  setosa
24          5.1         3.3          1.7         0.5  setosa
25          4.8         3.4          1.9         0.2  setosa
26          5.0         3.0          1.6         0.2  setosa
27          5.0         3.4          1.6         0.4  setosa
28          5.2         3.5          1.5         0.2  setosa
29          5.2         3.4          1.4         0.2  setosa
30          4.7         3.2          1.6         0.2  setosa
31          4.8         3.1          1.6         0.2  setosa
32          5.4         3.4          1.5         0.4  setosa
33          5.2         4.1          1.5         0.1  setosa
34          5.5         4.2          1.4         0.2  setosa
35          4.9         3.1          1.5         0.2  setosa
36          5.0         3.2          1.2         0.2  setosa
37          5.5         3.5          1.3         0.2  setosa
38          4.9         3.6          1.4         0.1  setosa
39          4.4         3.0          1.3         0.2  setosa
40          5.1         3.4          1.5         0.2  setosa
41          5.0         3.5          1.3         0.3  setosa
42          4.5         2.3          1.3         0.3  setosa
43          4.4         3.2          1.3         0.2  setosa
44          5.0         3.5          1.6         0.6  setosa
45          5.1         3.8          1.9         0.4  setosa
46          4.8         3.0          1.4         0.3  setosa
47          5.1         3.8          1.6         0.2  setosa
48          4.6         3.2          1.4         0.2  setosa
49          5.3         3.7          1.5         0.2  setosa
50          5.0         3.3          1.4         0.2  setosa
iris %>% filter(Species == "setosa") %>% plot

iris %>% filter(Species == "setosa") %>% summary
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600  
       Species  
 setosa    :50  
 versicolor: 0  
 virginica : 0  
                
                
                
# distinct: remove duplication (take only unique values)
iris %>% distinct(Species)
     Species
1     setosa
2 versicolor
3  virginica
# random sampling
iris %>% nrow
[1] 150
iris %>% sample_frac(0.5, replace=T)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           6.7         3.1          4.7         1.5 versicolor
2           6.0         3.4          4.5         1.6 versicolor
3           5.8         2.7          5.1         1.9  virginica
4           5.1         2.5          3.0         1.1 versicolor
5           6.0         2.9          4.5         1.5 versicolor
6           4.3         3.0          1.1         0.1     setosa
7           5.0         2.3          3.3         1.0 versicolor
8           5.0         2.0          3.5         1.0 versicolor
9           5.0         3.6          1.4         0.2     setosa
10          5.5         2.4          3.8         1.1 versicolor
11          4.3         3.0          1.1         0.1     setosa
12          6.8         2.8          4.8         1.4 versicolor
13          5.4         3.9          1.3         0.4     setosa
14          7.7         2.6          6.9         2.3  virginica
15          5.1         3.8          1.9         0.4     setosa
16          4.9         2.5          4.5         1.7  virginica
17          5.5         2.3          4.0         1.3 versicolor
18          6.4         2.9          4.3         1.3 versicolor
19          6.5         3.0          5.8         2.2  virginica
20          6.7         3.1          4.7         1.5 versicolor
21          4.7         3.2          1.6         0.2     setosa
22          6.4         2.9          4.3         1.3 versicolor
23          5.0         3.6          1.4         0.2     setosa
24          7.0         3.2          4.7         1.4 versicolor
25          6.4         3.2          5.3         2.3  virginica
26          6.6         3.0          4.4         1.4 versicolor
27          4.9         3.0          1.4         0.2     setosa
28          5.1         3.5          1.4         0.2     setosa
29          6.2         3.4          5.4         2.3  virginica
30          4.6         3.6          1.0         0.2     setosa
31          7.7         2.6          6.9         2.3  virginica
32          6.1         3.0          4.9         1.8  virginica
33          6.3         2.5          5.0         1.9  virginica
34          6.6         3.0          4.4         1.4 versicolor
35          6.2         2.8          4.8         1.8  virginica
36          7.7         3.0          6.1         2.3  virginica
37          6.1         2.8          4.0         1.3 versicolor
38          5.4         3.4          1.5         0.4     setosa
39          5.7         4.4          1.5         0.4     setosa
40          6.5         3.0          5.5         1.8  virginica
41          6.3         2.3          4.4         1.3 versicolor
42          4.8         3.1          1.6         0.2     setosa
43          6.2         2.9          4.3         1.3 versicolor
44          4.6         3.6          1.0         0.2     setosa
45          4.9         2.5          4.5         1.7  virginica
46          6.4         2.9          4.3         1.3 versicolor
47          4.4         3.2          1.3         0.2     setosa
48          5.1         3.5          1.4         0.2     setosa
49          5.0         2.0          3.5         1.0 versicolor
50          6.4         3.1          5.5         1.8  virginica
51          6.0         2.7          5.1         1.6 versicolor
52          4.8         3.0          1.4         0.3     setosa
53          6.3         3.4          5.6         2.4  virginica
54          5.0         3.5          1.3         0.3     setosa
55          6.7         3.0          5.2         2.3  virginica
56          6.2         3.4          5.4         2.3  virginica
57          5.0         2.0          3.5         1.0 versicolor
58          5.5         2.6          4.4         1.2 versicolor
59          6.3         2.5          4.9         1.5 versicolor
60          6.1         2.9          4.7         1.4 versicolor
61          5.7         2.9          4.2         1.3 versicolor
62          4.8         3.1          1.6         0.2     setosa
63          4.9         2.4          3.3         1.0 versicolor
64          6.7         3.0          5.2         2.3  virginica
65          6.3         2.5          5.0         1.9  virginica
66          6.2         2.2          4.5         1.5 versicolor
67          5.0         2.0          3.5         1.0 versicolor
68          5.8         2.6          4.0         1.2 versicolor
69          5.1         3.8          1.6         0.2     setosa
70          6.6         3.0          4.4         1.4 versicolor
71          6.5         3.0          5.2         2.0  virginica
72          5.8         2.7          5.1         1.9  virginica
73          5.4         3.7          1.5         0.2     setosa
74          5.7         2.8          4.1         1.3 versicolor
75          5.4         3.7          1.5         0.2     setosa
iris %>% sample_frac(0.5, replace=T) %>% nrow
[1] 75
iris %>% sample_n(10, replace=T) 
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           6.8         3.2          5.9         2.3  virginica
2           5.9         3.2          4.8         1.8 versicolor
3           4.6         3.4          1.4         0.3     setosa
4           5.6         2.7          4.2         1.3 versicolor
5           7.7         3.8          6.7         2.2  virginica
6           6.7         2.5          5.8         1.8  virginica
7           4.6         3.4          1.4         0.3     setosa
8           6.1         2.6          5.6         1.4  virginica
9           6.4         3.2          5.3         2.3  virginica
10          5.5         2.5          4.0         1.3 versicolor
iris %>% sample_n(10, replace=T) %>% nrow 
[1] 10
# slice 
iris %>% slice(10:15)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.9         3.1          1.5         0.1  setosa
2          5.4         3.7          1.5         0.2  setosa
3          4.8         3.4          1.6         0.2  setosa
4          4.8         3.0          1.4         0.1  setosa
5          4.3         3.0          1.1         0.1  setosa
6          5.8         4.0          1.2         0.2  setosa
# Top n in X
iris %>% top_n(5, Sepal.Length)
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          7.7         3.8          6.7         2.2 virginica
2          7.7         2.6          6.9         2.3 virginica
3          7.7         2.8          6.7         2.0 virginica
4          7.9         3.8          6.4         2.0 virginica
5          7.7         3.0          6.1         2.3 virginica
iris %>% top_n(5, Sepal.Width)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.4         3.9          1.7         0.4  setosa
2          5.8         4.0          1.2         0.2  setosa
3          5.7         4.4          1.5         0.4  setosa
4          5.4         3.9          1.3         0.4  setosa
5          5.2         4.1          1.5         0.1  setosa
6          5.5         4.2          1.4         0.2  setosa

Subset Variables

# pull & select
iris %>% pull(Petal.Width)
  [1] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1 0.1 0.2 0.4 0.4 0.3
 [19] 0.3 0.3 0.2 0.4 0.2 0.5 0.2 0.2 0.4 0.2 0.2 0.2 0.2 0.4 0.1 0.2 0.2 0.2
 [37] 0.2 0.1 0.2 0.2 0.3 0.3 0.2 0.6 0.4 0.3 0.2 0.2 0.2 0.2 1.4 1.5 1.5 1.3
 [55] 1.5 1.3 1.6 1.0 1.3 1.4 1.0 1.5 1.0 1.4 1.3 1.4 1.5 1.0 1.5 1.1 1.8 1.3
 [73] 1.5 1.2 1.3 1.4 1.4 1.7 1.5 1.0 1.1 1.0 1.2 1.6 1.5 1.6 1.5 1.3 1.3 1.3
 [91] 1.2 1.4 1.2 1.0 1.3 1.2 1.3 1.3 1.1 1.3 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8
[109] 1.8 2.5 2.0 1.9 2.1 2.0 2.4 2.3 1.8 2.2 2.3 1.5 2.3 2.0 2.0 1.8 2.1 1.8
[127] 1.8 1.8 2.1 1.6 1.9 2.0 2.2 1.5 1.4 2.3 2.4 1.8 1.8 2.1 2.4 2.3 1.9 2.3
[145] 2.5 2.3 1.9 2.0 2.3 1.8
iris %>% select(Petal.Width)
    Petal.Width
1           0.2
2           0.2
3           0.2
4           0.2
5           0.2
6           0.4
7           0.3
8           0.2
9           0.2
10          0.1
11          0.2
12          0.2
13          0.1
14          0.1
15          0.2
16          0.4
17          0.4
18          0.3
19          0.3
20          0.3
21          0.2
22          0.4
23          0.2
24          0.5
25          0.2
26          0.2
27          0.4
28          0.2
29          0.2
30          0.2
31          0.2
32          0.4
33          0.1
34          0.2
35          0.2
36          0.2
37          0.2
38          0.1
39          0.2
40          0.2
41          0.3
42          0.3
43          0.2
44          0.6
45          0.4
46          0.3
47          0.2
48          0.2
49          0.2
50          0.2
51          1.4
52          1.5
53          1.5
54          1.3
55          1.5
56          1.3
57          1.6
58          1.0
59          1.3
60          1.4
61          1.0
62          1.5
63          1.0
64          1.4
65          1.3
66          1.4
67          1.5
68          1.0
69          1.5
70          1.1
71          1.8
72          1.3
73          1.5
74          1.2
75          1.3
76          1.4
77          1.4
78          1.7
79          1.5
80          1.0
81          1.1
82          1.0
83          1.2
84          1.6
85          1.5
86          1.6
87          1.5
88          1.3
89          1.3
90          1.3
91          1.2
92          1.4
93          1.2
94          1.0
95          1.3
96          1.2
97          1.3
98          1.3
99          1.1
100         1.3
101         2.5
102         1.9
103         2.1
104         1.8
105         2.2
106         2.1
107         1.7
108         1.8
109         1.8
110         2.5
111         2.0
112         1.9
113         2.1
114         2.0
115         2.4
116         2.3
117         1.8
118         2.2
119         2.3
120         1.5
121         2.3
122         2.0
123         2.0
124         1.8
125         2.1
126         1.8
127         1.8
128         1.8
129         2.1
130         1.6
131         1.9
132         2.0
133         2.2
134         1.5
135         1.4
136         2.3
137         2.4
138         1.8
139         1.8
140         2.1
141         2.4
142         2.3
143         1.9
144         2.3
145         2.5
146         2.3
147         1.9
148         2.0
149         2.3
150         1.8
iris %>% pull(Petal.Width) %>% str
 num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
iris %>% select(Petal.Width) %>% str
'data.frame':   150 obs. of  1 variable:
 $ Petal.Width: num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
iris %>% select(Petal.Length, Petal.Width) %>% head
  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2
4          1.5         0.2
5          1.4         0.2
6          1.7         0.4
# useful helpers: starts_with(), contains()
iris %>% select(starts_with("Peta")) %>% head
  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2
4          1.5         0.2
5          1.4         0.2
6          1.7         0.4
iris %>% select(contains("tal")) %>% head
  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2
4          1.5         0.2
5          1.4         0.2
6          1.7         0.4

Reshaping & Arrange data

# let's use mtcars dataset
mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
# About mtcars dataset
# help(mtcars)

# arrange
mtcars %>% arrange(mpg) %>% head
                     mpg cyl disp  hp drat    wt  qsec vs am gear carb
Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
Maserati Bora       15.0   8  301 335 3.54 3.570 14.60  0  1    5    8
mtcars %>% add_rownames %>% head
Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
ℹ Please use `tibble::rownames_to_column()` instead.
# A tibble: 6 × 12
  rowname        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
mtcars %>% add_rownames %>% arrange(mpg) %>% 
head
Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
ℹ Please use `tibble::rownames_to_column()` instead.
# A tibble: 6 × 12
  rowname        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Cadillac Fl…  10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
2 Lincoln Con…  10.4     8   460   215  3     5.42  17.8     0     0     3     4
3 Camaro Z28    13.3     8   350   245  3.73  3.84  15.4     0     0     3     4
4 Duster 360    14.3     8   360   245  3.21  3.57  15.8     0     0     3     4
5 Chrysler Im…  14.7     8   440   230  3.23  5.34  17.4     0     0     3     4
6 Maserati Bo…  15       8   301   335  3.54  3.57  14.6     0     1     5     8
mtcars %>% add_rownames %>% arrange(desc(mpg)) %>% select(rowname)
Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
ℹ Please use `tibble::rownames_to_column()` instead.
# A tibble: 32 × 1
   rowname       
   <chr>         
 1 Toyota Corolla
 2 Fiat 128      
 3 Honda Civic   
 4 Lotus Europa  
 5 Fiat X1-9     
 6 Porsche 914-2 
 7 Merc 240D     
 8 Datsun 710    
 9 Merc 230      
10 Toyota Corona 
# ℹ 22 more rows

Summarise data

# Summarise

iris %>% summarise(avg.PL=mean(Petal.Length))
  avg.PL
1  3.758
iris %>% summarise(sd.PL=sd(Petal.Length))
     sd.PL
1 1.765298
iris %>% summarise(avg.PL=mean(Petal.Length),
                   sd.PL=sd(Petal.Length),
                   min.PL=min(Petal.Length))
  avg.PL    sd.PL min.PL
1  3.758 1.765298      1
iris %>% count(Species)
     Species  n
1     setosa 50
2 versicolor 50
3  virginica 50
iris %>% sample_frac(0.3) %>% count(Species)
     Species  n
1     setosa 16
2 versicolor 14
3  virginica 15
iris %>% summarise_all(mean)
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `Species = (function (x, ...) ...`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1     5.843333    3.057333        3.758    1.199333      NA
iris %>% summarise_at("Petal.Length", sum)
  Petal.Length
1        563.7
iris %>% summarise_at(c("Petal.Length", "Petal.Width"), mean)
  Petal.Length Petal.Width
1        3.758    1.199333

Group and Summarise data

# Group
iris %>% group_by(Species)
# A tibble: 150 × 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows
iris %>% group_by(Species) %>% 
  summarise_all(mean)
# A tibble: 3 × 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             5.01        3.43         1.46       0.246
2 versicolor         5.94        2.77         4.26       1.33 
3 virginica          6.59        2.97         5.55       2.03 
iris %>% group_by(Species) %>% 
  select(starts_with("Sep")) %>% 
  summarise_all(mean)
Adding missing grouping variables: `Species`
# A tibble: 3 × 3
  Species    Sepal.Length Sepal.Width
  <fct>             <dbl>       <dbl>
1 setosa             5.01        3.43
2 versicolor         5.94        2.77
3 virginica          6.59        2.97
mtcars %>% group_by(am) %>% 
  summarise(hp.avg=mean(hp),
            hp.sd=sd(hp))
# A tibble: 2 × 3
     am hp.avg hp.sd
  <dbl>  <dbl> <dbl>
1     0   160.  53.9
2     1   127.  84.1

Make New Variables

# mutate
# mpg to kml
mtcars %>% mutate(kml=0.425144*mpg)
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
                          kml
Mazda RX4            8.928024
Mazda RX4 Wag        8.928024
Datsun 710           9.693283
Hornet 4 Drive       9.098082
Hornet Sportabout    7.950193
Valiant              7.695106
Duster 360           6.079559
Merc 240D           10.373514
Merc 230             9.693283
Merc 280             8.162765
Merc 280C            7.567563
Merc 450SE           6.972362
Merc 450SL           7.354991
Merc 450SLC          6.462189
Cadillac Fleetwood   4.421498
Lincoln Continental  4.421498
Chrysler Imperial    6.249617
Fiat 128            13.774666
Honda Civic         12.924378
Toyota Corolla      14.412382
Toyota Corona        9.140596
Dodge Challenger     6.589732
AMC Javelin          6.462189
Camaro Z28           5.654415
Pontiac Firebird     8.162765
Fiat X1-9           11.606431
Porsche 914-2       11.053744
Lotus Europa        12.924378
Ford Pantera L       6.717275
Ferrari Dino         8.375337
Maserati Bora        6.377160
Volvo 142E           9.098082
mtcars %>% mutate(kml=round(0.425144*mpg, 1))
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb  kml
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  8.9
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  8.9
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1  9.7
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1  9.1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2  8.0
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1  7.7
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4  6.1
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2 10.4
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2  9.7
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4  8.2
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4  7.6
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3  7.0
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3  7.4
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3  6.5
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4  4.4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4  4.4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4  6.2
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1 13.8
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2 12.9
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 14.4
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  9.1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2  6.6
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2  6.5
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4  5.7
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2  8.2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1 11.6
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2 11.1
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2 12.9
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4  6.7
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6  8.4
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8  6.4
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2  9.1
mtcars %>% mutate(kml=round(0.425144*mpg, 1)) %>% 
  select(mpg, kml) %>% 
  top_n(5, mpg)
                mpg  kml
Fiat 128       32.4 13.8
Honda Civic    30.4 12.9
Toyota Corolla 33.9 14.4
Fiat X1-9      27.3 11.6
Lotus Europa   30.4 12.9
# transmute
mtcars %>% transmute(hp/wt)
                       hp/wt
Mazda RX4           41.98473
Mazda RX4 Wag       38.26087
Datsun 710          40.08621
Hornet 4 Drive      34.21462
Hornet Sportabout   50.87209
Valiant             30.34682
Duster 360          68.62745
Merc 240D           19.43574
Merc 230            30.15873
Merc 280            35.75581
Merc 280C           35.75581
Merc 450SE          44.22604
Merc 450SL          48.25737
Merc 450SLC         47.61905
Cadillac Fleetwood  39.04762
Lincoln Continental 39.63864
Chrysler Imperial   43.03087
Fiat 128            30.00000
Honda Civic         32.19814
Toyota Corolla      35.42234
Toyota Corona       39.35091
Dodge Challenger    42.61364
AMC Javelin         43.66812
Camaro Z28          63.80208
Pontiac Firebird    45.51365
Fiat X1-9           34.10853
Porsche 914-2       42.52336
Lotus Europa        74.68605
Ford Pantera L      83.28076
Ferrari Dino        63.17690
Maserati Bora       93.83754
Volvo 142E          39.20863
# rename
mtcars %>% names
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
mtcars %>% mutate(new=1) %>% head
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb new
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   1
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   1
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1   1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1   1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2   1
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1   1
mtcars %>% mutate(new=1) %>% 
rename(change.name=new) %>% head
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
                  change.name
Mazda RX4                   1
Mazda RX4 Wag               1
Datsun 710                  1
Hornet 4 Drive              1
Hornet Sportabout           1
Valiant                     1




Joins! (Combine Datasets)

Thanks to Garrick Aden-Buie (garrickadenbuie.com), we can teach joins very easily with his fantastic diagrams.


Let’s use some example datasets that come pre-loaded in R to demonstrate how joins work using dplyr. We will use the mtcars dataset, splitting it into two separate data frames to illustrate how the different types of joins operate.

First, let’s set up our example data frames from the mtcars dataset:

# Create two data frames from mtcars
df1 <- mtcars[1:10, c("mpg", "cyl", "disp")]  # first 10 rows, select specific columns
df2 <- mtcars[5:15, c("disp", "hp", "drat")]  # rows 5 to 15, select specific columns

# Make sure there's a common key for joining; here, we'll use 'disp'
# Print out the data frames to see what they contain
print(df1)
                   mpg cyl  disp
Mazda RX4         21.0   6 160.0
Mazda RX4 Wag     21.0   6 160.0
Datsun 710        22.8   4 108.0
Hornet 4 Drive    21.4   6 258.0
Hornet Sportabout 18.7   8 360.0
Valiant           18.1   6 225.0
Duster 360        14.3   8 360.0
Merc 240D         24.4   4 146.7
Merc 230          22.8   4 140.8
Merc 280          19.2   6 167.6
print(df2)
                    disp  hp drat
Hornet Sportabout  360.0 175 3.15
Valiant            225.0 105 2.76
Duster 360         360.0 245 3.21
Merc 240D          146.7  62 3.69
Merc 230           140.8  95 3.92
Merc 280           167.6 123 3.92
Merc 280C          167.6 123 3.92
Merc 450SE         275.8 180 3.07
Merc 450SL         275.8 180 3.07
Merc 450SLC        275.8 180 3.07
Cadillac Fleetwood 472.0 205 2.93

Inner Join

All rows from x where there are matching values in y, and all columns from x and y.

inner_join_result <- inner_join(df1, df2, by = "disp")
Warning in inner_join(df1, df2, by = "disp"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 5 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
print(inner_join_result)
   mpg cyl  disp  hp drat
1 18.7   8 360.0 175 3.15
2 18.7   8 360.0 245 3.21
3 18.1   6 225.0 105 2.76
4 14.3   8 360.0 175 3.15
5 14.3   8 360.0 245 3.21
6 24.4   4 146.7  62 3.69
7 22.8   4 140.8  95 3.92
8 19.2   6 167.6 123 3.92
9 19.2   6 167.6 123 3.92

Left Join

All rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns.

left_join_result <- left_join(df1, df2, by = "disp")
Warning in left_join(df1, df2, by = "disp"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 5 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
print(left_join_result)
    mpg cyl  disp  hp drat
1  21.0   6 160.0  NA   NA
2  21.0   6 160.0  NA   NA
3  22.8   4 108.0  NA   NA
4  21.4   6 258.0  NA   NA
5  18.7   8 360.0 175 3.15
6  18.7   8 360.0 245 3.21
7  18.1   6 225.0 105 2.76
8  14.3   8 360.0 175 3.15
9  14.3   8 360.0 245 3.21
10 24.4   4 146.7  62 3.69
11 22.8   4 140.8  95 3.92
12 19.2   6 167.6 123 3.92
13 19.2   6 167.6 123 3.92

Left Join (Extra Rows in y)

… If there are multiple matches between x and y, all combinations of the matches are returned.

Right Join

All rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns.

right_join_result <- right_join(df1, df2, by = "disp")
Warning in right_join(df1, df2, by = "disp"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 5 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
print(right_join_result)
    mpg cyl  disp  hp drat
1  18.7   8 360.0 175 3.15
2  18.7   8 360.0 245 3.21
3  18.1   6 225.0 105 2.76
4  14.3   8 360.0 175 3.15
5  14.3   8 360.0 245 3.21
6  24.4   4 146.7  62 3.69
7  22.8   4 140.8  95 3.92
8  19.2   6 167.6 123 3.92
9  19.2   6 167.6 123 3.92
10   NA  NA 275.8 180 3.07
11   NA  NA 275.8 180 3.07
12   NA  NA 275.8 180 3.07
13   NA  NA 472.0 205 2.93

Full Join

All rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

full_join_result <- full_join(df1, df2, by = "disp")
Warning in full_join(df1, df2, by = "disp"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 5 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
print(full_join_result)
    mpg cyl  disp  hp drat
1  21.0   6 160.0  NA   NA
2  21.0   6 160.0  NA   NA
3  22.8   4 108.0  NA   NA
4  21.4   6 258.0  NA   NA
5  18.7   8 360.0 175 3.15
6  18.7   8 360.0 245 3.21
7  18.1   6 225.0 105 2.76
8  14.3   8 360.0 175 3.15
9  14.3   8 360.0 245 3.21
10 24.4   4 146.7  62 3.69
11 22.8   4 140.8  95 3.92
12 19.2   6 167.6 123 3.92
13 19.2   6 167.6 123 3.92
14   NA  NA 275.8 180 3.07
15   NA  NA 275.8 180 3.07
16   NA  NA 275.8 180 3.07
17   NA  NA 472.0 205 2.93

Filtering Joins

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. … Semi-joins are useful for matching filtered summary tables back to the original rows. … Anti-joins are useful for diagnosing join mismatches.
R for Data Science: Filtering Joins

Semi Join

All rows from x where there are matching values in y, keeping just columns from x.

semi_join_result <- semi_join(df1, df2, by = "disp")
print(semi_join_result)
                   mpg cyl  disp
Hornet Sportabout 18.7   8 360.0
Valiant           18.1   6 225.0
Duster 360        14.3   8 360.0
Merc 240D         24.4   4 146.7
Merc 230          22.8   4 140.8
Merc 280          19.2   6 167.6

Anti Join

All rows from x where there are not matching values in y, keeping just column

# Assuming df1 and df2 from the previous example
anti_join_result <- anti_join(df1, df2, by = "disp")
print(anti_join_result)
                mpg cyl disp
Mazda RX4      21.0   6  160
Mazda RX4 Wag  21.0   6  160
Datsun 710     22.8   4  108
Hornet 4 Drive 21.4   6  258