데이터 유형 및 구조(1)

R을 활용한 데이터 전처리(1)

Weekly content



R & R Studio


Base R

  • The Data given in base R

    • Can be checked by data() command

    • ex) ChickWeight data “Weight versus age of chicks on different diets”, women data “Average heights and weights for American women aged 30-39”

women
   height weight
1      58    115
2      59    117
3      60    120
4      61    123
5      62    126
6      63    129
7      64    132
8      65    135
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164
  • Car datset
str(cars)
'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
cars
   speed dist
1      4    2
2      4   10
3      7    4
4      7   22
5      8   16
6      9   10
7     10   18
8     10   26
9     10   34
10    11   17
11    11   28
12    12   14
13    12   20
14    12   24
15    12   28
16    13   26
17    13   34
18    13   34
19    13   46
20    14   26
21    14   36
22    14   60
23    14   80
24    15   20
25    15   26
26    15   54
27    16   32
28    16   40
29    17   32
30    17   40
31    17   50
32    18   42
33    18   56
34    18   76
35    18   84
36    19   36
37    19   46
38    19   68
39    20   32
40    20   48
41    20   52
42    20   56
43    20   64
44    22   66
45    23   54
46    24   70
47    24   92
48    24   93
49    24  120
50    25   85

str function: Function summarizing the contents of the data

  • Various visualization functions

    • Most widely used function in base R: plot

      plot(women)

  • Apply different visualization options

    • Color options (parameters) col,

    • xlab and ylab to name the axis,

    • pch to specify the symbol shape

      plot(cars)

      plot(cars, col = 'blue')

      plot(cars, col = 'blue', xlab = "speed")

      plot(cars, col = 'blue', xlab = "speed", ylab = 'distance')

      plot(cars, col = 'blue', xlab = "speed", ylab = 'distance', pch = 18)


Good habits in learning data science


# ?plot
# help(plot)
  • Think incrementally (Step by Step)

    • After creating the most basic features, check the behavior, add a new feature to it, and add another feature to verify it.

    • Once you’ve created everything and checked it, it’s hard to find out where the cause is

    • See Figure above: Check the most basic plot function, add the col option to check, add the xlab and ylab options, and add the pch option to check

  • Specify working directory

    • The way to Save Data Files in a Specified Directory (Folder)

    • getwd() function displays the current working directory (the red part is the computer name)

      getwd()
      [1] "C:/R/Rproj/[2]web_pages/changjunlee_com_2/teaching/grad_stat/weekly_2/posts"
    • setwd() to set the new working directory

  • Use of library (package)

    • Libraries are software that collects R functions developed for specific fields.

      • E.g.) ggplot2 is a collection of functions that visualize your data neatly and consistently

      • E.g.) gapminder is a collection of functions needed to utilize gapminder data, which gathers population, GDP per capita, and life expectancy in five years from 1952 to 2007.

    • R is so powerful and popular because of its huge library

    • If you access the CRAN site, you will see that it is still being added.

      • [Packages] menu: see all libraries provided by R [Task Views] menu: Introduce libraries field by field
  • When using it, attach it using the library function

  • Library installation saves library files to your hard disk

  • Library Attachment loads it from Hard Disk to Main Memory

Data for example..

  • Lovely iris data

    • In 1936, Edger Anderson collected irises in the Gaspe Peninsula in eastern Canada.

    • Collect 50 from each three species(setosa, versicolor, verginica) on the same day

    • The same person measures the width and length of the petals and sepals with the same ruler

    • Has been famous since Statistician Professor Ronald Fisher published a paper with this data and is still widely used.

    str(iris)
    'data.frame':   150 obs. of  5 variables:
     $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
     $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
     $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
     $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
     $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
    head(iris, 10)
       Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    1           5.1         3.5          1.4         0.2  setosa
    2           4.9         3.0          1.4         0.2  setosa
    3           4.7         3.2          1.3         0.2  setosa
    4           4.6         3.1          1.5         0.2  setosa
    5           5.0         3.6          1.4         0.2  setosa
    6           5.4         3.9          1.7         0.4  setosa
    7           4.6         3.4          1.4         0.3  setosa
    8           5.0         3.4          1.5         0.2  setosa
    9           4.4         2.9          1.4         0.2  setosa
    10          4.9         3.1          1.5         0.1  setosa
    plot(iris)

  • See the correlation of two properties

    • col = iris$Species is an option to draw colors differently by species

      plot(iris$Petal.Width, 
           iris$Petal.Length,
           col = iris$Species)

Data Science Process with example data

flowchart LR
  A[Collecting Data] --> B(EDA)
  B --> C{Modeling}

  • Tips data

    • Tips earning at tables in a restaurant

    • Can we get more tips using data science?

  • Step 1: Data collecting

    • Collect values in seven variables

      • total_bill

      • tip

      • gender

      • smoker

      • day

      • time

      • size: number of people in a table

    • After weeks of hard work, collected 244 and saved it to the tips.csv file

tips = read.csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
str(tips)
'data.frame':   244 obs. of  7 variables:
 $ total_bill: num  17 10.3 21 23.7 24.6 ...
 $ tip       : num  1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
 $ sex       : chr  "Female" "Male" "Male" "Male" ...
 $ smoker    : chr  "No" "No" "No" "No" ...
 $ day       : chr  "Sun" "Sun" "Sun" "Sun" ...
 $ time      : chr  "Dinner" "Dinner" "Dinner" "Dinner" ...
 $ size      : int  2 3 3 2 4 4 2 4 2 2 ...
head(tips, 10)
   total_bill  tip    sex smoker day   time size
1       16.99 1.01 Female     No Sun Dinner    2
2       10.34 1.66   Male     No Sun Dinner    3
3       21.01 3.50   Male     No Sun Dinner    3
4       23.68 3.31   Male     No Sun Dinner    2
5       24.59 3.61 Female     No Sun Dinner    4
6       25.29 4.71   Male     No Sun Dinner    4
7        8.77 2.00   Male     No Sun Dinner    2
8       26.88 3.12   Male     No Sun Dinner    4
9       15.04 1.96   Male     No Sun Dinner    2
10      14.78 3.23   Male     No Sun Dinner    2

Interpreting the first sample, it was shown that two people had dinner on Sunday, no smokers, and a $1.01 tip at the table where a woman paid the total $16.99.


  • Step 2: Exploratory Data Analysis (EDA)

    • summary function to check the summary statistics

    • How to explain the summary statistics below?

summary(tips)              
   total_bill         tip             sex               smoker         
 Min.   : 3.07   Min.   : 1.000   Length:244         Length:244        
 1st Qu.:13.35   1st Qu.: 2.000   Class :character   Class :character  
 Median :17.80   Median : 2.900   Mode  :character   Mode  :character  
 Mean   :19.79   Mean   : 2.998                                        
 3rd Qu.:24.13   3rd Qu.: 3.562                                        
 Max.   :50.81   Max.   :10.000                                        
     day                time                size     
 Length:244         Length:244         Min.   :1.00  
 Class :character   Class :character   1st Qu.:2.00  
 Mode  :character   Mode  :character   Median :2.00  
                                       Mean   :2.57  
                                       3rd Qu.:3.00  
                                       Max.   :6.00  

This statistic summary doesn’t reveal the effect of day or gender on the tip, so let’s explore it further with visualization.

  • Attach dplyr and ggplot2 libraries (for now just run it and study the meaning)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
  • What do you see in the figures below?

    • Distribution of fellow persons in a table
tips %>% ggplot(aes(size)) + geom_histogram()                                            
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Tip amount according to bill amount (total_bill)

tips %>% ggplot(aes(total_bill, tip)) + geom_point()                                     

Added day information using color

tips %>% ggplot(aes(total_bill, tip)) + 
  geom_point(aes(col = day))                       

Women and men separated by different symbols

tips %>% ggplot(aes(total_bill, tip)) + 
  geom_point(aes(col = day, pch = sex), size = 3) 

  • Step 3: Modeling

    • Limitations of Exploratory Data Analysis: You can design a strategy to make more money, but you can’t predict exactly how much more income will come from the new strategy.

    • Modeling allows predictions

    • Create future financial portfolios

      • E.g.) Know how much your income will increase as fellows in a table grow, and how much your income will change when paying people’s gender changes


Setup your project!

  • Create a new project

    • *.Rproj

    • *.R

    • getwd()

  • Variable and Object

    • An object in R is a data structure used for storing data: Everything in R is an object, including functions, numbers, character strings, vectors, and lists. Each object has attributes such as its type (e.g., integer, numeric, character), its length, and often its dimensions. Objects can be complex structures, like data frames that hold tabular data, or simpler structures like a single numeric value or vector.

    • A variable in R is a name that you assign to an object so that you can refer to it later in your code. When you assign data to a variable, you are effectively labeling that data with a name that you can use to call up the object later on.

Here’s a simple example in R:

my_vector <- c(1, 2, 3)
  • my_vector is a variable. It’s a symbolic name that we’re using to refer to some data we’re interested in.

  • c(1, 2, 3) creates a vector object containing the numbers 1, 2, and 3.

  • This vector is the object, and it’s the actual data structure that R is storing in memory.

# remove all objects stored
rm()

# Create a vector 1 to 10
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
# Sampling 10 values from the vector 1:10
sample(1:10, 10)
 [1]  7  1 10  4  5  9  8  3  6  2
X <- sample(1:10, 10)
# Extract 2nd to 5th elements of X
X[2:5]
[1] 2 4 5 9


Basic Syntax

  • Grammar of data science.

    • Variable: data storage space

    • Data types: numeric, character, categorical, logical, special constants, etc.

    • Operators: arithmetic, comparison, logical operators

    • Vector: a collection of single values

    • Array: A set of data with columns and rows (or A set of vectors)

    • Data frame: A structure in which different data types are organized in a tabular form. Each properㄹty has the same size.

    • List: A tabular structure similar to “Data frame”. The size of each property can be different.

  • Grammar study is essential to save data and process operations

    • a = 1

    • b = 2

    • c = a+b

  • When there needs a lot of data, such as student grade processing

    • A single variable cannot represent all the data

    • By using vector, matrix, data frame, list, etc., it is possible to store a lot of data with one variable name.

    • There are many things around us are organized in a tabular form for easy data management. (e.g. attendance checking, grade, and member management, etc.)

  • Storing values in variables

    • Value assignment using =, <-, ->

      # Assign 1 to X
      x = 1 
      
      # Assign 2 to Y.
      y = 2
      
      z = x + y
      
      z
      [1] 3
      x + y -> z
      
      z
      [1] 3
  • Example of exchanging two values

    • Make temporary storage space and save one value in advance

      x = 1
      y = 2
      temp = x
      x = y
      y = temp
      
      x
      [1] 2
      y
      [1] 1
  • Basic data types of R

    • Numeric: int / num / cplx

    • Character: chr

    • Categorical: factor

    • Logical: True(T), FALSE(F)

    • Special constant

      • NULL: undefined value

      • NA: missing value

      • Inf & -Inf: Positive & Negative infinity

      • NaN: Not a Number, values cannot be computed such as 0/0, Inf/Inf, etc

  • Examples for basic data types in R

Numeric

# Data type #

x = 5
y = 2
x/y
[1] 2.5

Complex

xi = 1 + 2i
yi = 1 - 2i
xi+yi
[1] 2+0i

Character (string)

str = "Hello, World!"
str
[1] "Hello, World!"

Categorical (factor)

blood.type = factor(c('A', 'B', 'O', 'AB'))
blood.type
[1] A  B  O  AB
Levels: A AB B O

Logical & Special constant

T
[1] TRUE
F
[1] FALSE
xinf = Inf
yinf = -Inf
xinf/yinf
[1] NaN
  • Data type verification and conversion functions

    • Functions to check data type

      • class(x)

      • typeof(x)

      • is.integer(x)

      • is.numeric(x)

      • is.complex(x)

      • is.character(x)

      • is.na(x)

    • Functions to transform data type

      • as.factor(x)

      • as.integer(x)

      • as.numeric(x)

      • as.character(x)

      • as.matrix(x)

      • as.array(x)

x = 1       # If you simply put 1 in x, x is a numeric type.
x
[1] 1
is.integer(x)
[1] FALSE
x = 1L      # If 1L is entered in x, x is an integer.
x
[1] 1
is.integer(x)
[1] TRUE
x = as.integer(1)    

is.integer(x)
[1] TRUE
  • Arithmetic Operators

    Operator Description
    + addition
    - subtraction
    * multiplication
    / division
    ^ or ** exponentiation
    x %% y modulus (x mod y) 5%%2 is 1
    x %/% y integer division 5%/%2 is 2
  • Logical Operators

    Operator Description
    < less than
    <= less than or equal to
    > greater than
    >= greater than or equal to
    == exactly equal to
    != not equal to
    !x Not x
    **x y**
    x & y x AND y
    isTRUE(x) test if X is TRUE
  • More information for operators: https://www.statmethods.net/management/operators.html


Vector

In R, a vector is one of the most basic data structures used to store a sequence of elements of the same type. Vectors can hold numeric, character, or logical data. They are a fundamental part of R programming, especially for statistical operations.


Creating a Vector

You can create a vector in R using the c() function, which combines individual values into a single vector

# Numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)

# Character vector
char_vector <- c("apple", "banana", "cherry")

# Logical vector
logical_vector <- c(TRUE, FALSE, TRUE)

Accessing Elements of a Vector

You can access elements in a vector using square brackets [ ]:

# Access the second element of the numeric vector
numeric_vector[2]  # Output: 2
[1] 2
# Access multiple elements
numeric_vector[c(1, 3)]  # Output: 1 3
[1] 1 3

Vectorized Operations

R supports vectorized operations, meaning you can perform operations on entire vectors without needing to loop through individual elements:

# Adding a constant to each element of a numeric vector
numeric_vector + 1  # Output: 2 3 4 5 6
[1] 2 3 4 5 6
# Element-wise addition of two vectors
other_vector <- c(5, 4, 3, 2, 1)
numeric_vector + other_vector  # Output: 6 6 6 6 6
[1] 6 6 6 6 6

Common Functions with Vectors

Here are some basic functions you can use with vectors:

  • length(): Returns the number of elements in a vector.

  • sum(): Sums all elements (for numeric vectors).

  • mean(): Calculates the average (for numeric vectors).

Example:

length(numeric_vector)  # Output: 5
[1] 5
sum(numeric_vector)     # Output: 15
[1] 15
mean(numeric_vector)    # Output: 3
[1] 3

More Practices..

# Create a vector with 7 elements by increasing the numbers 1 to 7 by 1.
1:7         
[1] 1 2 3 4 5 6 7
# Decrease by 1 from 7 to 1 to create a vector with 7 elements.
7:1     
[1] 7 6 5 4 3 2 1
vector(length = 5)
[1] FALSE FALSE FALSE FALSE FALSE
# Create a vector consisting of 1 to 5 elements. Same as 1:5
c(1:5)      
[1] 1 2 3 4 5
# Create a vector of elements 1 to 6 by combining elements 1 to 3 and elements 4 to 6
c(1, 2, 3, c(4:6))  
[1] 1 2 3 4 5 6
# Store a vector consisting of 1 to 3 elements in x
x = c(1, 2, 3)  
x       
[1] 1 2 3
# Create y as an empty vector
y = c()         

# Created by adding the c(1:3) vector to the existing y vector
y = c(y, c(1:3))    
y   
[1] 1 2 3
# Create a vector from 1 to 10 in increments of 2
seq(from = 1, to = 10, by = 2)  
[1] 1 3 5 7 9
# Same code with above
seq(1, 10, by = 2)      
[1] 1 3 5 7 9
# Create a vector with 11 elements from 0 to 1 in increments of 0.1
seq(0, 1, by = 0.1)             
 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
# Create a vector with 11 elements from 0 to 1
seq(0, 1, length.out = 11)      
 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
# Create a vector by repeating the (1, 2, 3) vector twice
rep(c(1:3), times = 2)  
[1] 1 2 3 1 2 3
# (1, 2, 3) Creates a vector by repeating the individual elements of the vector twice
rep(c(1:3), each = 2)       
[1] 1 1 2 2 3 3
x = c(2, 4, 6, 8, 10)

# Find the length (size) of the x vector
length(x)   
[1] 5
# Find the value of element 1 of the x vector
x[1]        
[1] 2
# An error occurs if you enter elements 1, 2, and 3 of the x vector.

# x[1, 2, 3]        

# When finding elements 1, 2, and 3 of the x vector, they must be grouped into a vector.

x[c(1, 2, 3)] 
[1] 2 4 6
# Output the value excluding elements 1, 2, and 3 from the x vector

x[-c(1, 2, 3)] 
[1]  8 10
# Print elements 1 to 3 in the x vector
x[c(1:3)]       
[1] 2 4 6
# Add 2 to each individual element of the x vector
x = c(1, 2, 3, 4)
y = c(5, 6, 7, 8)
z = c(3, 4)
w = c(5, 6, 7)
x+2         
[1] 3 4 5 6
# Since the size of the x vector and y vector are the same, each element is added
x + y   
[1]  6  8 10 12
# If the x vector is an integer multiple of the size of the z vector, add the smaller vector elements in a circular motion.
x + z       
[1] 4 6 6 8
# Operation error because the sizes of x and w are not integer multiples
x + w       
Warning in x + w: longer object length is not a multiple of shorter object
length
[1]  6  8 10  9
# Check if element value of x vector is greater than 5

x > 5       
[1] FALSE FALSE FALSE FALSE
# Check if all elements of the x vector are greater than 5
all(x > 5)      
[1] FALSE
# Check if any of the element values of the x vector are greater than 5
any(x > 5)  
[1] FALSE
x = 1:10
# Extract the first 6 elements of data
head(x)         
[1] 1 2 3 4 5 6
# Extract the last 6 elements of data
tail(x)         
[1]  5  6  7  8  9 10
# Extract the first 3 elements of data
head(x, 3)  
[1] 1 2 3
# Extract the last 3 elements of data
tail(x, 3) 
[1]  8  9 10


Sets

x = c(1, 2, 3)
y = c(3, 4, 5)
z = c(3, 1, 2)

# Union set
union(x, y) 
[1] 1 2 3 4 5
# Intersection set
intersect(x, y) 
[1] 3
# Set difference (X - Y)
setdiff(x, y)   
[1] 1 2
# Set difference (Y - X)
setdiff(y, x)   
[1] 4 5
# Compare whether x and y have the same elements
setequal(x, y)  
[1] FALSE
# Compare whether x and z have the same elements
setequal(x, z) 
[1] TRUE
  • Vectorized codes
c(1, 2, 4) + c(2, 3, 5)
[1] 3 5 9


X <- c(1,2,4,5)

X * 2
[1]  2  4  8 10
  • Recycling rule
1:4 + c(1, 2)
[1] 2 4 4 6
X<-c(1,2,4,5)
X * 2
[1]  2  4  8 10
1:4 + 1:3
Warning in 1:4 + 1:3: longer object length is not a multiple of shorter object
length
[1] 2 4 6 5


Array

Understanding Arrays in R: Concepts and Examples

Arrays are a fundamental data structure in R that extend vectors by allowing you to store multi-dimensional data. While a vector has one dimension, arrays in R can have two or more dimensions, making them incredibly versatile for complex data organization.

What is an Array in R?

An array in R is a collection of elements of the same type arranged in a grid of a specified dimensionality. It is a multi-dimensional data structure that can hold values in more than two dimensions. Arrays are particularly useful in scenarios where operations on multi-dimensional data are required, such as matrix computations, tabulations, and various applications in data analysis and statistics.

Creating an Array

To create an array in R, you can use the array function. This function takes a vector of data and a vector of dimensions as arguments. For example:

# Create a 2x3 array
my_array <- array(1:6, dim = c(2, 3))
print(my_array)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

This code snippet creates a 2x3 array (2 rows and 3 columns) with the numbers 1 to 6.

Accessing Array Elements

Elements within an array can be accessed using indices for each dimension in square brackets []. For example:

# Access the element in the 1st row and 2nd column
element <- my_array[1, 2]
print(element)
[1] 3

Modifying Arrays

Just like vectors, you can modify the elements of an array by accessing them using their indices and assigning new values. For example:

# Modify the element in the 1st row and 2nd column to be 20
my_array[1, 2] <- 20
print(my_array)
     [,1] [,2] [,3]
[1,]    1   20    5
[2,]    2    4    6

Operations on Arrays

R allows you to perform operations on arrays. These operations can be element-wise or can involve the entire array. For example, you can add two arrays of the same dimensions, and R will perform element-wise addition.

Example: Creating and Manipulating a 3D Array

# Create a 3x2x2 array
my_3d_array <- array(1:12, dim = c(3, 2, 2))
print(my_3d_array)
, , 1

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

, , 2

     [,1] [,2]
[1,]    7   10
[2,]    8   11
[3,]    9   12
# Access an element (2nd row, 1st column, 2nd matrix)
element_3d <- my_3d_array[2, 1, 2]
print(element_3d)
[1] 8


# Create N-dimensional array

# Assign values 1 to 5 to a 2×4 matrix
x = array(1:5, c(2, 4)) 

x
     [,1] [,2] [,3] [,4]
[1,]    1    3    5    2
[2,]    2    4    1    3
# Print row 1 element value
x[1, ] 
[1] 1 3 5 2
# Print 2nd column element values
x[, 2] 
[1] 3 4
# Set row and column names
dimnamex = list(c("1st", "2nd"), c("1st", "2nd", "3rd", "4th")) 

x = array(1:5, c(2, 4), dimnames = dimnamex)
x
    1st 2nd 3rd 4th
1st   1   3   5   2
2nd   2   4   1   3
x["1st", ]
1st 2nd 3rd 4th 
  1   3   5   2 
x[, "4th"]
1st 2nd 
  2   3 
# Create a two-dimensional array
x = 1:12
x
 [1]  1  2  3  4  5  6  7  8  9 10 11 12
matrix(x, nrow = 3)
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
matrix(x, nrow = 3, byrow = T)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
# Create an array by combining vectors
v1 = c(1, 2, 3, 4)
v2 = c(5, 6, 7, 8)
v3 = c(9, 10, 11, 12)

# Create an array by binding by column
cbind(v1, v2, v3) 
     v1 v2 v3
[1,]  1  5  9
[2,]  2  6 10
[3,]  3  7 11
[4,]  4  8 12
# Create array by binding row by row
rbind(v1, v2, v3) 
   [,1] [,2] [,3] [,4]
v1    1    2    3    4
v2    5    6    7    8
v3    9   10   11   12
# Various matrix operations using the operators in [Table 3-7]
# Store two 2×2 matrices in x and y, respectively
x = array(1:4, dim = c(2, 2))
y = array(5:8, dim = c(2, 2))
x
     [,1] [,2]
[1,]    1    3
[2,]    2    4
y
     [,1] [,2]
[1,]    5    7
[2,]    6    8
x+y
     [,1] [,2]
[1,]    6   10
[2,]    8   12
x-y
     [,1] [,2]
[1,]   -4   -4
[2,]   -4   -4
# multiplication for each column
x * y 
     [,1] [,2]
[1,]    5   21
[2,]   12   32
# mathematical matrix multiplication
x %*% y 
     [,1] [,2]
[1,]   23   31
[2,]   34   46
# transpose matrix of x
t(x) 
     [,1] [,2]
[1,]    1    2
[2,]    3    4
# inverse of x
solve(x) 
     [,1] [,2]
[1,]   -2  1.5
[2,]    1 -0.5
# determinant of x
det(x) 
[1] -2
x = array(1:12, c(3, 4))
x
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
# If the center value is 1, apply the function row by row
apply(x, 1, mean) 
[1] 5.5 6.5 7.5
# If the center value is 2, apply the function to each column
apply(x, 2, mean) 
[1]  2  5  8 11
x = array(1:12, c(3, 4))
dim(x)
[1] 3 4
x = array(1:12, c(3, 4))

# Randomly mix and extract array elements
sample(x) 
 [1]  2  7  1  8  3 10  4  5 12 11  6  9
# Select and extract 10 elements from the array
sample(x, 10) 
 [1] 10 11  2  5  8  3  4  6  7 12
library(dplyr)
# ?sample

# The extraction probability for each element can be varied
sample(x, 10, prob = c(1:12)/24) 
 [1]  7 10  9  2 11  6  5  3 12  4
# You can create a sample using just numbers
sample(10) 
 [1]  1  5  6  3  2  9 10  8  7  4


Data.frame

In R, a data.frame is a two-dimensional table-like data structure that holds data in rows and columns. It’s one of the most commonly used data structures, especially when dealing with tabular data, similar to spreadsheets or SQL tables.

Characteristics of data.frame

  • Each column can hold different types of data (numeric, character, logical, etc.).

  • Each row represents a single observation, and each column represents a variable.

  • Columns can have different data types, but all values within a column must be of the same type.

Creating a data.frame

You can create a data.frame using the data.frame() function by combining vectors of equal length.

# Create a data frame with numeric, character, and logical columns
my_data <- data.frame(
  ID = c(1, 2, 3),
  Name = c("John", "Sarah", "Mike"),
  Age = c(25, 30, 22),
  IsStudent = c(TRUE, FALSE, TRUE)
)

# View the data frame
my_data
  ID  Name Age IsStudent
1  1  John  25      TRUE
2  2 Sarah  30     FALSE
3  3  Mike  22      TRUE

Accessing Elements in a data.frame

You can access elements by referring to rows and columns:

  • By column name: You can use the $ operator or square brackets [ , ] to extract a column.
# Extract the 'Name' column using $
my_data$Name  # Output: "John" "Sarah" "Mike"
[1] "John"  "Sarah" "Mike" 
# Extract the 'Age' column using square brackets
my_data[, "Age"]  # Output: 25 30 22
[1] 25 30 22

By row number: You can also use row indices to access specific rows or a combination of rows and columns.

# Extract the first row
my_data[1, ]  # Output: 1 "John" 25 TRUE
  ID Name Age IsStudent
1  1 John  25      TRUE
# Extract the value in the second row, third column
my_data[2, 3]  # Output: 30
[1] 30

Adding New Columns or Rows

You can add new columns or rows to an existing data.frame:

  • Adding a new column:
my_data$Grade <- c("A", "B", "A")
my_data
  ID  Name Age IsStudent Grade
1  1  John  25      TRUE     A
2  2 Sarah  30     FALSE     B
3  3  Mike  22      TRUE     A
  • Adding a new row:
new_row <- data.frame(ID = 4, Name = "Emma", Age = 28, IsStudent = FALSE, Grade = "B")
my_data <- rbind(my_data, new_row)
my_data
  ID  Name Age IsStudent Grade
1  1  John  25      TRUE     A
2  2 Sarah  30     FALSE     B
3  3  Mike  22      TRUE     A
4  4  Emma  28     FALSE     B


More Practices

# Data Frame #
name = c("Cheolsu", "Chunhyang", "Gildong")
age = c(22, 20, 25)
gender = factor(c("M", "F", "M"))
blood.type = factor(c("A", "O", "B"))
patients = data.frame(name, age, gender, blood.type)
patients
       name age gender blood.type
1   Cheolsu  22      M          A
2 Chunhyang  20      F          O
3   Gildong  25      M          B
# Can also be written in one line like this:
patients1 = data.frame(name = c("Cheolsu", "Chunhyang", "Gildong"), 
                       age = c(22, 20, 25), 
                       gender = factor(c("M", "F", "M ")), 
                       blood.type = factor(c("A", "O", "B")))

patients1
       name age gender blood.type
1   Cheolsu  22      M          A
2 Chunhyang  20      F          O
3   Gildong  25     M           B
patients$name # Print name attribute value
[1] "Cheolsu"   "Chunhyang" "Gildong"  
patients[1, ] # Print row 1 value
     name age gender blood.type
1 Cheolsu  22      M          A
patients[, 2] # Print 2nd column values
[1] 22 20 25
patients[3, 1] # Prints 3 rows and 1 column values
[1] "Gildong"
patients[patients$name=="Withdrawal", ] # Extract information about withdrawal among patients
[1] name       age        gender     blood.type
<0 rows> (or 0-length row.names)
patients[patients$name=="Cheolsu", c("name", "age")] # Extract only Cheolsu's name and age information
     name age
1 Cheolsu  22
head(cars) # Check the cars data set. The basic function of the head function is to extract the first 6 data.
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
attach(cars) # Use the attach function to use each property of cars as a variable
speed # The variable name speed can be used directly.
 [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
[26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
detach(cars) # Deactivates the use of each property of cars as a variable through the detach function
# speed # Try to access the variable called speed, but there is no such variable.

# Apply functions using data properties
mean(cars$speed)
[1] 15.4
max(cars$speed)
[1] 25
# Apply a function using the with function
with(cars, mean(speed))
[1] 15.4
with(cars, max(speed))
[1] 25
# Extract only data with speed greater than 20
subset(cars, speed > 20)
   speed dist
44    22   66
45    23   54
46    24   70
47    24   92
48    24   93
49    24  120
50    25   85
# Extract only dist data with speed over 20, select multiple columns, separate c() with ,
subset(cars, speed > 20, select = c(dist))
   dist
44   66
45   54
46   70
47   92
48   93
49  120
50   85
# Extract only data excluding dist from data with a speed exceeding 20
subset(cars, speed > 20, select = -c(dist))
   speed
44    22
45    23
46    24
47    24
48    24
49    24
50    25
head(airquality) # airquality data contains NA
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
head(na.omit(airquality)) # Extract by excluding values containing NA
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
7    23     299  8.6   65     5   7
8    19      99 13.8   59     5   8
# merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"), incomparables = NULL, ...)

name = c("Cheolsu", "Chunhyang", "Gildong")
age = c(22, 20, 25)
gender = factor(c("M", "F", "M"))
blood.type = factor(c("A", "O", "B"))
patients1 = data.frame(name, age, gender)
patients1
       name age gender
1   Cheolsu  22      M
2 Chunhyang  20      F
3   Gildong  25      M
patients2 = data.frame(name, blood.type)
patients2
       name blood.type
1   Cheolsu          A
2 Chunhyang          O
3   Gildong          B
patients = merge(patients1, patients2, by = "name")
patients
       name age gender blood.type
1   Cheolsu  22      M          A
2 Chunhyang  20      F          O
3   Gildong  25      M          B
# If there are no column variables with the same name, when merging them into by.x and by.y of the merge function
# You must enter the attribute name of each column to be used.
name1 = c("Cheolsu", "Chunhyang", "Gildong")
name2 = c("Minsu", "Chunhyang", "Gildong")
age = c(22, 20, 25)
gender = factor(c("M", "F", "M"))
blood.type = factor(c("A", "O", "B"))
patients1 = data.frame(name1, age, gender)
patients1
      name1 age gender
1   Cheolsu  22      M
2 Chunhyang  20      F
3   Gildong  25      M
patients2 = data.frame(name2, blood.type)
patients2
      name2 blood.type
1     Minsu          A
2 Chunhyang          O
3   Gildong          B
patients = merge(patients1, patients2, by.x = "name1", by.y = "name2")
patients
      name1 age gender blood.type
1 Chunhyang  20      F          O
2   Gildong  25      M          B
patients = merge(patients1, patients2, by.x = "name1", by.y = "name2", all = TRUE)
patients
      name1 age gender blood.type
1   Cheolsu  22      M       <NA>
2 Chunhyang  20      F          O
3   Gildong  25      M          B
4     Minsu  NA   <NA>          A
x = array(1:12, c(3, 4))

# Currently x is not a data frame
is.data.frame(x) 
[1] FALSE
as.data.frame(x)
  V1 V2 V3 V4
1  1  4  7 10
2  2  5  8 11
3  3  6  9 12
# Just calling the is.data.frame function does not turn x into a data frame
is.data.frame(x)
[1] FALSE
# Convert x to data frame format with the as.data.frame function
x = as.data.frame(x)
x
  V1 V2 V3 V4
1  1  4  7 10
2  2  5  8 11
3  3  6  9 12
# Verify that x has been converted to data frame format
is.data.frame(x)
[1] TRUE
# When converting to a data frame, automatically assigned column names are reassigned to the names function.
names(x) = c("1st", "2nd", "3rd", "4th")
x
  1st 2nd 3rd 4th
1   1   4   7  10
2   2   5   8  11
3   3   6   9  12

List

In R, a list is a data structure that can store multiple types of elements, including vectors, other lists, data frames, functions, and more. Unlike vectors or data frames, lists can contain elements of different types and lengths.

Characteristics of a List

  • A list can hold different data types (numeric, character, logical, etc.) within the same structure.

  • Each element in a list can be of different lengths and types, including even other lists or data frames.

Creating a List

You can create a list in R using the list() function:

# Creating a list with different data types
my_list <- list(
  Name = "John",
  Age = 25,
  Scores = c(90, 85, 88),
  Passed = TRUE
)

# View the list
my_list
$Name
[1] "John"

$Age
[1] 25

$Scores
[1] 90 85 88

$Passed
[1] TRUE

Accessing Elements of a List

You can access elements in a list using the $ operator, double square brackets [[ ]], or single square brackets [ ]:

  • By name (using $ or [[ ]]):
# Access the 'Name' element using $
my_list$Name  # Output: "John"
[1] "John"
# Access the 'Scores' element using [[ ]]
my_list[["Scores"]]  # Output: 90 85 88
[1] 90 85 88
  • By position (using [[ ]]):
# Access the second element (Age) by position
my_list[[2]]  # Output: 25
[1] 25
  • Using single square brackets [ ]: This returns a sublist, rather than the element itself.
# Access the 'Name' element as a sublist
my_list["Name"]  # Output: a sublist containing "Name"
$Name
[1] "John"

Modifying a List

  • Adding new elements: You can add new elements to an existing list by simply assigning a new name.
# Adding a new element 'Grade'
my_list$Grade <- "A"
my_list
$Name
[1] "John"

$Age
[1] 25

$Scores
[1] 90 85 88

$Passed
[1] TRUE

$Grade
[1] "A"
  • Modifying existing elements: You can modify elements by assigning a new value to them.
# Modify the 'Age' element
my_list$Age <- 26
my_list$Age  # Output: 26
[1] 26
  • Removing elements: To remove an element from a list, you can set it to NULL.
# Remove the 'Grade' element
my_list$Grade <- NULL
my_list
$Name
[1] "John"

$Age
[1] 26

$Scores
[1] 90 85 88

$Passed
[1] TRUE

Nested Lists

Lists can also contain other lists, making them very flexible for storing hierarchical or structured data.

# Creating a nested list
nested_list <- list(
  Name = "Sarah",
  Details = list(Age = 28, Occupation = "Data Scientist"),
  Skills = c("R", "Python", "SQL")
)

# Accessing elements within a nested list
nested_list$Details$Age  # Output: 28
[1] 28
nested_list[["Details"]][["Occupation"]]  # Output: "Data Scientist"
[1] "Data Scientist"

Combining Lists

You can combine multiple lists using the c() function:

# Combine two lists
list1 <- list(A = 1, B = 2)
list2 <- list(C = 3, D = 4)
combined_list <- c(list1, list2)
combined_list
$A
[1] 1

$B
[1] 2

$C
[1] 3

$D
[1] 4

Common Functions for Lists

  • length(): Returns the number of elements in a list.
length(my_list)  # Output: 4 (after removing Grade)
[1] 4
  • str(): Displays the structure of the list.
  • lapply(): Applies a function to each element of a list and returns a list.
# Apply the 'mean' function to each element (for lists with numeric values)
num_list <- list(a = c(1, 2, 3), b = c(4, 5, 6))
lapply(num_list, mean)
$a
[1] 2

$b
[1] 5
  • unlist(): Converts a list into a vector (flattening it).
# Convert a list to a vector
unlist(my_list)
   Name     Age Scores1 Scores2 Scores3  Passed 
 "John"    "26"    "90"    "85"    "88"  "TRUE" 


More Practices

# List #
patients = data.frame(name = c("Cheolsu", "Chunhyang", "Gildong"), 
                      age = c(22, 20, 25), 
                      gender = factor(c("M", "F", "M ")), 
                      blood.type = factor(c("A", "O", "B")))

no.patients = data.frame(day = c(1:6), no = c(50, 60, 55, 52, 65, 58))


# Simple addition of data
listPatients = list(patients, no.patients)
listPatients
[[1]]
       name age gender blood.type
1   Cheolsu  22      M          A
2 Chunhyang  20      F          O
3   Gildong  25     M           B

[[2]]
  day no
1   1 50
2   2 60
3   3 55
4   4 52
5   5 65
6   6 58
# Add names to each data
listPatients = list(patients=patients, no.patients = no.patients)
listPatients
$patients
       name age gender blood.type
1   Cheolsu  22      M          A
2 Chunhyang  20      F          O
3   Gildong  25     M           B

$no.patients
  day no
1   1 50
2   2 60
3   3 55
4   4 52
5   5 65
6   6 58
# Enter element name
listPatients$patients 
       name age gender blood.type
1   Cheolsu  22      M          A
2 Chunhyang  20      F          O
3   Gildong  25     M           B
# Enter index
listPatients[[1]] 
       name age gender blood.type
1   Cheolsu  22      M          A
2 Chunhyang  20      F          O
3   Gildong  25     M           B
# Enter the element name in ""
listPatients[["patients"]] 
       name age gender blood.type
1   Cheolsu  22      M          A
2 Chunhyang  20      F          O
3   Gildong  25     M           B
# Enter the element name in ""
listPatients[["no.patients"]] 
  day no
1   1 50
2   2 60
3   3 55
4   4 52
5   5 65
6   6 58
# Calculate the average of no.patients elements
lapply(listPatients$no.patients, mean)
$day
[1] 3.5

$no
[1] 56.66667
# Calculate the average of the patients elements. Anything that is not in numeric form is not averaged.
lapply(listPatients$patients, mean)
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
$name
[1] NA

$age
[1] 22.33333

$gender
[1] NA

$blood.type
[1] NA
sapply(listPatients$no.patients, mean)
     day       no 
 3.50000 56.66667 
# If the simplify option of sapply() is set to F, the same result as lapply() is returned.
sapply(listPatients$no.patients, mean, simplify = F)
$day
[1] 3.5

$no
[1] 56.66667