# Data input, assesment
library(titanic)
library(readr) # Data input with readr::read_csv()
library(descr) # descr::CrossTable() - Frequency by category, check with ratio figures
# Visualization
library(VIM) # Missing values assesment used by VIM::aggr()
library(RColorBrewer) # Plot color setting
library(scales) # plot setting - x, y axis
# Feature engineering, Data Pre-processing
library(tidyverse) # dplyr, ggplot2, purrr, etc.. # Feature Engineering & Data Pre-processing
library(ggpubr)
library(randomForest)
# Model validation
library(caret) # caret::confusionMatrix()
library(ROCR) # Plotting ROC Curve
About ML & Modelling
EDA Review
Before attending class for Week 2, please complete the following tasks:
Watch the pre-class video for Week 2.
Submit Discussion #1. Note that submissions will not be accepted after class.
Ensure that your laptop has R and RStudio installed and ready to use.
Pre-class video
- What is ML?
- What is Modelling?
Discussion
Discussion #1
Class content
1. Introduction
It contains the analysis process of ‘Titanic’, one of the representative ‘Competition’ of ‘Kaggle’.
2. Prepare to work
2.1 Packages
see “What is a package in R”
This is the process of loading (loading) the Packages
I used for analysis, in addition to the representative Packages
of R
, such as tidyverse (including ggplot2
and dplyr
).
2.2 Raw data import
In titanic competition
, train
data used to create Model
and test
data used for actual prediction (estimation) are separated.
Here, we will load those two data and combine them into one. The reason for tying the separate data together is to work the same when feature engineering and pre-processing the input variables used in modeling.
Please see this link if you want to know about the story of Titanic.
%>% glimpse titanic_train
Rows: 891
Columns: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
$ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
%>% glimpse titanic_test
Rows: 418
Columns: 11
$ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903…
$ Pclass <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3…
$ Name <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)", "M…
$ Sex <chr> "male", "female", "male", "male", "female", "male", "femal…
$ Age <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.0…
$ SibSp <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0…
$ Parch <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Ticket <chr> "330911", "363272", "240276", "315154", "3101298", "7538",…
$ Fare <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292, 2…
$ Cabin <chr> "", "", "", "", "", "", "", "", "", "", "", "", "B45", "",…
$ Embarked <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S", "S"…
<- titanic_train
train <- titanic_test
test
<- dplyr::bind_rows(train, test)
full %>% glimpse full
Rows: 1,309
Columns: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
$ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
The reason why rbind()
was not used even when combining the two data into full is that Survived, the dependent variable (target variable, Y) of Titanic competition, does not exist in test. Therefore, the dimensions (dimension) of the two data do not match, so they are not merged with rbind(). However, if you use dplyr::bind_rows(), Survived in test is treated as NA and merged into one.
2.3 variable meaning explanation
variable name | Interpretation (meaning) | Type |
---|---|---|
PassengerID | Unique ID number that identifies the passenger | Int |
Survived | Indicates whether or not the passenger survived. Survival is 1 and death is 0. | Factor |
Pclass | The class of the cabin, with 3 categories from 1st class (1) to 3rd class (3). | Ord.Factor |
Name | Passenger’s name | Factor |
Sex | Passenger’s gender | Factor |
Age | Age of passenger | Numeric |
SibSp | Variable describing the number of siblings or spouses accompanying each passenger. It can range from 0 to 8. | Integer |
Parch | Variable describing the number of parents or children accompanying each passenger, from 0 to 9. | Integer |
Ticket | String variable for the ticket the passenger boarded | Factor |
Fare | Variable for how much the passenger has paid for the trip so far | Numeric |
Cabin | Variable that distinguishes each passenger’s cabin, with too many categories and missing values. | Factor |
Embarked | Indicates the boarding port and departure port, and consists of three categories: C, Q, and S. | Factor |
2.4 Change the variables type
Before the full-scale EDA and feature engineering, let’s transform some variable properties. For example, Pclass is treated as numeric, but actually 1, 2, 3 are factors representing 1st, 2nd, and 3rd grades.
<- full %>%
full ::mutate(Survived = factor(Survived),
dplyrPclass = factor(Pclass, ordered = T),
Name = factor(Name),
Sex = factor(Sex),
Ticket = factor(Ticket),
Cabin = factor(Cabin),
Embarked = factor(Embarked))
3. EDA : Exploratory data analysis
It is the process of exploring and understanding raw data, such as how data is structured and whether there are missing values or outliers in it.
We will use various functions and visualizations here.
3.1 Data confirmation using numerical values
First of all, let’s check the data through the output of various functions such as head() and summary().
3.1.1 head()
head(full, 10)
PassengerId Survived Pclass
1 1 0 3
2 2 1 1
3 3 1 3
4 4 1 1
5 5 0 3
6 6 0 3
7 7 0 1
8 8 0 3
9 9 1 3
10 10 1 2
Name Sex Age SibSp Parch
1 Braund, Mr. Owen Harris male 22 1 0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
3 Heikkinen, Miss. Laina female 26 0 0
4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
5 Allen, Mr. William Henry male 35 0 0
6 Moran, Mr. James male NA 0 0
7 McCarthy, Mr. Timothy J male 54 0 0
8 Palsson, Master. Gosta Leonard male 2 3 1
9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2
10 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0
Ticket Fare Cabin Embarked
1 A/5 21171 7.2500 S
2 PC 17599 71.2833 C85 C
3 STON/O2. 3101282 7.9250 S
4 113803 53.1000 C123 S
5 373450 8.0500 S
6 330877 8.4583 Q
7 17463 51.8625 E46 S
8 349909 21.0750 S
9 347742 11.1333 S
10 237736 30.0708 C
Looking at the result of head(), we can see that there is a missing value (NA) in Age.
If so, is there only Age missing in the entire data?
For the answer, please refer to 3.2 Missing values.
3.1.2 str()
str(full)
'data.frame': 1309 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 156 287 531 430 23 826 775 922 613 855 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 929 levels "110152","110413",..: 721 817 915 66 650 374 110 542 478 175 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 187 levels "","A10","A11",..: 1 108 1 72 1 1 165 1 1 1 ...
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
By combining the train and test data, the total number of observations (record, row, row) is 1309 (train: 891, test: 418), and the number of variables (column, feature, variable, column) is 12.
In addition, you can find out what the attributes of each variable are and how many categories there are for variables that are factor attributes.
In addition, in head(), it can be seen that the missing value (NA), which was thought to exist only in Age, also exists in other variables including Cabin.
3.1.3 summary()
summary(full)
PassengerId Survived Pclass Name
Min. : 1 0 :549 1:323 Connolly, Miss. Kate : 2
1st Qu.: 328 1 :342 2:277 Kelly, Mr. James : 2
Median : 655 NA's:418 3:709 Abbing, Mr. Anthony : 1
Mean : 655 Abbott, Master. Eugene Joseph : 1
3rd Qu.: 982 Abbott, Mr. Rossmore Edward : 1
Max. :1309 Abbott, Mrs. Stanton (Rosa Hunt): 1
(Other) :1301
Sex Age SibSp Parch Ticket
female:466 Min. : 0.17 Min. :0.0000 Min. :0.000 CA. 2343: 11
male :843 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000 1601 : 8
Median :28.00 Median :0.0000 Median :0.000 CA 2144 : 8
Mean :29.88 Mean :0.4989 Mean :0.385 3101295 : 7
3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000 347077 : 7
Max. :80.00 Max. :8.0000 Max. :9.000 347082 : 7
NA's :263 (Other) :1261
Fare Cabin Embarked
Min. : 0.000 :1014 : 2
1st Qu.: 7.896 C23 C25 C27 : 6 C:270
Median : 14.454 B57 B59 B63 B66: 5 Q:123
Mean : 33.295 G6 : 5 S:914
3rd Qu.: 31.275 B96 B98 : 4
Max. :512.329 C22 C26 : 4
NA's :1 (Other) : 271
summary() provides a lot of information about the data.
The representative values of quantitative variables (Integer, Numeric), the number of categories of categorical (Factor) variables, and the number of observations belonging to each category are all shown as numerical values.
Here are the things to check and move on:
Survived: This is the target variable for this competition, and 418 missing values are due to the test data.
Pclass: There are three categories of 1st class, 2nd class, and 3rd class, and 3rd class passengers are the most.
Name: There are people with similar names. So you can see that some passengers are traveling alone, while others are traveling with their families.
Sex: There are almost twice as many males as females.
Age: It ranges from 0.17 to 80 years old, but it seems necessary to check whether it is an outlier that incorrectly entered 17, and there are 263 missing values.
SibSp: From 0 to 8, and the 3rd quartile is 1, so it can be seen that you boarded the Titanic with a couple or siblings.
Parch: It ranges from 0 to 9, but the fact that the 3rd quartile is 0 indicates that there are very few passengers with parents and children.
Both SibSp and Parch are variables representing family relationships. Through this, we will find out the total number of people in the family, although we do not know who was on board, and based on that, we will create a categorical derived variable called FamilySized that represents the size of the family.
Ticket: Looking at the result of 3.1.2 str(), you can see that some passengers have exactly the same ticket, some passengers have tickets overlapping only a certain part, and some passengers have completely different tickets. We plan to use this to create a derived variable called ticket.size.
Fare: 0 to 512, with 1 missing value. I care that the 3rd quartile is 31.275 and the max is 512.
Cabin: It has the most (1014) missing values among a total of 12 features. It’s a variable that represents the ship’s area, but if there’s no way to use it, I think it should be discarded.
Embarked: It consists of a total of 3 categories, S is the most, and there are 2 missing values.
When performing a basic exploration of the data, please look at the outputs of various functions besides summary() and str() while comparing them.
3.2 Missing values
This is the process of checking which variables have missing values mentioned above and how many of them exist.
I’m going to check it numerically and visually at the same time using the dplyr, ggplot2, and VIM packages.
You don’t have to use all the code I’ve run, you can use only the parts you think you need or like as you read.
3.2.1 VIM packages
::aggr(full, prop = FALSE, combined = TRUE, numbers = TRUE,
VIMsortVars = TRUE, sortCombs = TRUE)
Variables sorted by number of missings:
Variable Count
Survived 418
Age 263
Fare 1
PassengerId 0
Pclass 0
Name 0
Sex 0
SibSp 0
Parch 0
Ticket 0
Cabin 0
Embarked 0
3.2.2 tidyverse packages
In addition to checking missing values at once using the VIM package, these are methods for checking missing values using various packages that exist in the tidyverse.
First, find the proportion of missing values for each variable with dplyr.
%>%
full ::summarize_all(funs(sum(is.na(.))/n())) dplyr
Warning: `funs()` was deprecated in dplyr 0.8.0.
ℹ Please use a list of either functions or lambdas:
# Simple named list: list(mean = mean, median = median)
# Auto named with `tibble::lst()`: tibble::lst(mean, median)
# Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
1 0 0.3193277 0 0 0 0.2009167 0 0 0
Fare Cabin Embarked
1 0.0007639419 0 0
There is a way to check the proportion of missing values that exist in variables, but it can also be checked using visual data.
Please see the two bar plots below.
# Calculate the missing value ratio of each feature -> Data Frame property but has a structure of 1 row and 12 columns.
<- full %>%
missing_values ::summarize_all(funs(sum(is.na(.))/n())) dplyr
Warning: `funs()` was deprecated in dplyr 0.8.0.
ℹ Please use a list of either functions or lambdas:
# Simple named list: list(mean = mean, median = median)
# Auto named with `tibble::lst()`: tibble::lst(mean, median)
# Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
%>% head missing_values
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
1 0 0.3193277 0 0 0 0.2009167 0 0 0
Fare Cabin Embarked
1 0.0007639419 0 0
# Generate the missing_values obtained above as a 12X2 data frame
<- tidyr::gather(missing_values,
missing_values key = "feature", value = "missing_pct")
%>% head(12) missing_values
feature missing_pct
1 PassengerId 0.0000000000
2 Survived 0.3193277311
3 Pclass 0.0000000000
4 Name 0.0000000000
5 Sex 0.0000000000
6 Age 0.2009167303
7 SibSp 0.0000000000
8 Parch 0.0000000000
9 Ticket 0.0000000000
10 Fare 0.0007639419
11 Cabin 0.0000000000
12 Embarked 0.0000000000
# Visualization with missing_values
%>%
missing_values # Aesthetic setting : missing_pct 내림차순으로 정렬
ggplot(aes(x = reorder(feature, missing_pct), y = missing_pct)) +
# Bar plot
geom_bar(stat = "identity", fill = "red") +
# Title generation
ggtitle("Rate of missing values in each features") +
# Title detail setting
theme(plot.title = element_text(face = "bold", # 글씨체
hjust = 0.5, # Horizon(가로비율) = 0.5
size = 15, color = "darkblue")) +
# x, y axis label setting
labs(x = "Feature names", y = "Rate") +
# Plot의 x, y축 변환
coord_flip()
If you look at the bar graph above, you can check the percentage of missing values for all features
.
However, what we are actually curious about is which variables have missing values and how many missing values exist in them.
Therefore, after calculating the proportion of missing values using the purrr
package, I extracted only the variables that had at least one and visualized them.
# 변수별 결측치 비율 계산
<- purrr::map_dbl(full, function(x){round((sum(is.na(x))/length(x)) * 100, 1) })
miss_pct
# 결측치 비율이 0%보다 큰 변수들만 선택
<- miss_pct[miss_pct > 0]
miss_pct
# Data Frame 생성
data.frame(miss = miss_pct, var = names(miss_pct), row.names = NULL) %>%
# Aesthetic setting : miss 내림차순으로 정렬
ggplot(aes(x = reorder(var, miss), y = miss)) +
# Bar plot
geom_bar(stat = 'identity', fill = 'red') +
# Plot title setting
ggtitle("Rate of missing values") +
# Title detail setting
theme(plot.title = element_text(face = "bold", # 글씨체
hjust = 0.5, # Horizon(가로비율) = 0.5
size = 15, color = "darkblue")) +
# x, y axis label setting
labs(x = 'Feature names', y = 'Rate of missing values') +
# Plot의 x, y축 변환
coord_flip()
Through this, only 4 variables out of a total of 12 variables have missing values (except Survived
because it is due to test
data), and there are many missing values in the order of Cabin
, Age
, Embarked
, and Fare
.
Now, it is the process of analyzing and exploring feature
through visualization.
3.3 Age
<- full %>%
age.p1 ggplot(aes(Age)) +
geom_histogram(breaks = seq(0, 80, by = 1), # interval setting
col = "red", # bar border color
fill = "green", # bar inner color
alpha = .5) + # Bar Transparency = 50%
# Plot title
ggtitle("All Titanic passengers age hitogram") +
theme(plot.title = element_text(face = "bold",
hjust = 0.5, # Horizon (horizontal ratio) = 0.5
size = 15, color = "darkblue"))
<- full %>%
age.p2 # Exclude values where Survived == NA in the test dataset
filter(!is.na(Survived)) %>%
ggplot(aes(Age, fill = Survived)) +
geom_density(alpha = .5) +
ggtitle("Titanic passengers age density plot") +
theme(plot.title = element_text(face = "bold", hjust = 0.5,
size = 15, color = "darkblue"))
# Display the two graphs created above on one screen
ggarrange(age.p1, age.p2, ncol=2)
3.4 Pclass
Let’s visualize the frequency of passengers for each Pclass
.
After grouping (grouping) by Pclass
using dplyr
package, Data Frame
representing frequency by category was created and visualized with ggplot
.
%>%
full # Get Pclass frequencies using dplyr::group_by(), summarize()
group_by(Pclass) %>%
summarize(N = n()) %>%
# Aesthetic setting
ggplot(aes(Pclass, N)) +
geom_col() +
geom_text(aes(label = N),
size = 5,
vjust = 1.2,
color = "#FFFFFF") +
# Plot title
ggtitle("Number of each Pclass's passengers") +
# Title setting
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15)) +
# x, y axis name change
labs(x = "Pclass", y = "Count")
It can be seen that the largest number of passengers boarded in the 3-class cabin.
3.5 Fare
This is a visualization of the ‘Fare’ variable, which represents the amount paid by the passenger.
Two histograms and boxplots were used.
# Histogram
<- full %>%
Fare.p1 ggplot(aes(Fare)) +
geom_histogram(col = "yellow",
fill = "blue",
alpha = .5) +
ggtitle("Histogram of passengers Fare") +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15))
# Boxplot
<- full %>%
Fare.p2 filter(!is.na(Survived)) %>%
ggplot(aes(Survived, Fare)) +
# Observations are drawn as gray dots, but overlapping areas are spread out.
geom_jitter(col = "gray") +
# Boxplot: 50% transparency
geom_boxplot(alpha = .5) +
ggtitle("Boxplot of passengers Fare") +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15))
ggarrange(Fare.p1, Fare.p2, ncol=2)
You can see that the survivors have a higher ‘Fare’ than the deceased passengers, but not by much.
3.6 Sex
Are there differences in survival rates between men and women? See the plot
below.
<- full %>%
sex.p1 ::group_by(Sex) %>%
dplyrsummarize(N = n()) %>%
ggplot(aes(Sex, N)) +
geom_col() +
geom_text(aes(label = N), size = 5, vjust = 1.2, color = "#FFFFFF") +
ggtitle("Bar plot of Sex") +
labs(x = "Sex", y = "Count")
<- full[1:891, ] %>%
sex.p2 ggplot(aes(Sex, fill = Survived)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
scale_y_continuous(labels = percent) +
ggtitle("Survival Rate by Sex") +
labs(x = "Sex", y = "Rate")
ggarrange(sex.p1, sex.p2, ncol = 2)
mosaicplot(Survived ~ Sex,
data = full[1:891, ], col = TRUE,
main = "Survival rate by passengers gender")
If you interpret the graph, you can see that the survival rate is higher for female passengers, while there are far more males than females.
4. Feature engineering & Data Pre-processing
This is the process of filling in missing values (‘NA’) based on the contents of ‘Chapter 3 EDA’ and creating derived variables at the same time.
4.1 Age -> Age.Group
<- full %>%
full # The missing value (NA) is filled in first, and the average of the values excluding the missing value is filled.
mutate(Age = ifelse(is.na(Age), mean(full$Age, na.rm = TRUE), Age),
# Create a categorical derived variable Age.Group based on Age values
Age.Group = case_when(Age < 13 ~ "Age.0012",
>= 13 & Age < 18 ~ "Age.1317",
Age >= 18 & Age < 60 ~ "Age.1859",
Age >= 60 ~ "Age.60inf"),
Age # Convert Chr attribute to Factor
Age.Group = factor(Age.Group))
4.3 SibSp & Parch -> FamilySized
<- full %>%
full # First create a derived variable called FamilySize by adding SibSp, Parch and 1 (self)
mutate(FamilySize = .$SibSp + .$Parch + 1,
# Create a categorical derived variable FamilySized according to the value of FamilySize
FamilySized = dplyr::case_when(FamilySize == 1 ~ "Single",
>= 2 & FamilySize < 5 ~ "Small",
FamilySize >= 5 ~ "Big"),
FamilySize # Convert the Chr property FamilySized to a factor
# Assign new levels according to the size of the group size
FamilySized = factor(FamilySized, levels = c("Single", "Small", "Big")))
Ceated FamilySized
using SibSp
and Parch
.
Reducing these two variables to one has the advantage of simplifying the model.
A similar use case is to combine height and weight into a BMI index.
4.4 Name & Sex -> title
When looking at the results of ‘Chapter 3.6 Sex’, it was confirmed that the survival rate of women was higher than that of men.
Therefore, in Name
, “Wouldn’t it be useful to extract only names related to gender and categorize them?” I think it is.
First, extract only the column vector named Name
from full
data and save it as title
.
# First, extract only the Name column vector and store it in the title vector
<- full$Name
title %>% head(20) title
[1] Braund, Mr. Owen Harris
[2] Cumings, Mrs. John Bradley (Florence Briggs Thayer)
[3] Heikkinen, Miss. Laina
[4] Futrelle, Mrs. Jacques Heath (Lily May Peel)
[5] Allen, Mr. William Henry
[6] Moran, Mr. James
[7] McCarthy, Mr. Timothy J
[8] Palsson, Master. Gosta Leonard
[9] Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
[10] Nasser, Mrs. Nicholas (Adele Achem)
[11] Sandstrom, Miss. Marguerite Rut
[12] Bonnell, Miss. Elizabeth
[13] Saundercock, Mr. William Henry
[14] Andersson, Mr. Anders Johan
[15] Vestrom, Miss. Hulda Amanda Adolfina
[16] Hewlett, Mrs. (Mary D Kingcome)
[17] Rice, Master. Eugene
[18] Williams, Mr. Charles Eugene
[19] Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)
[20] Masselmani, Mrs. Fatima
1307 Levels: Abbing, Mr. Anthony ... Zimmerman, Mr. Leo
# Using regular expression and gsub(), extract only names that are highly related to gender and save them as title vectors
<- gsub("^.*, (.*?)\\..*$", "\\1", title)
title %>% head(20) title
[1] "Mr" "Mrs" "Miss" "Mrs" "Mr" "Mr" "Mr" "Master"
[9] "Mrs" "Mrs" "Miss" "Miss" "Mr" "Mr" "Miss" "Mrs"
[17] "Master" "Mr" "Mrs" "Mrs"
# Save the title vector saved above to full again, but save it as a title derived variable
$title <- title
full
$title %>% head(20) full
[1] "Mr" "Mrs" "Miss" "Mrs" "Mr" "Mr" "Mr" "Master"
[9] "Mrs" "Mrs" "Miss" "Miss" "Mr" "Mr" "Miss" "Mrs"
[17] "Master" "Mr" "Mrs" "Mrs"
gsub("^.*, (.*?)\\..*$", "\\1", title)
In short, this code uses gsub
to search for and replace a pattern in the title
string, where the pattern is defined by the regular expression "^.*, (.*?)\\..*$"
and the replacement is defined by the string "\\1"
. If you want to understand more about regular expression. Please see my blog post: What are Regular Expressions and How to Use Them in R
Then check what are the Unique
title
s.
unique(full$title)
[1] "Mr" "Mrs" "Miss" "Master" "Don"
[6] "Rev" "Dr" "Mme" "Ms" "Major"
[11] "Lady" "Sir" "Mlle" "Col" "Capt"
[16] "the Countess" "Jonkheer" "Dona"
You can see that there are 18 categories in total.
If you use this derived variable called ‘title’ as it is, the complexity of the model (especially the tree based model) increases considerably, so you need to reduce the category.
Before that, let’s check the frequency and rate for each category using the descr
package.
# Check frequency, ratio by category
::CrossTable(full$title) descr
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
| Capt | Col | Don | Dona | Dr |
|--------------|--------------|--------------|--------------|--------------|
| 1 | 4 | 1 | 1 | 8 |
| 0.001 | 0.003 | 0.001 | 0.001 | 0.006 |
|--------------|--------------|--------------|--------------|--------------|
| Jonkheer | Lady | Major | Master | Miss |
|--------------|--------------|--------------|--------------|--------------|
| 1 | 1 | 2 | 61 | 260 |
| 0.001 | 0.001 | 0.002 | 0.047 | 0.199 |
|--------------|--------------|--------------|--------------|--------------|
| Mlle | Mme | Mr | Mrs | Ms |
|--------------|--------------|--------------|--------------|--------------|
| 2 | 1 | 757 | 197 | 2 |
| 0.002 | 0.001 | 0.578 | 0.150 | 0.002 |
|--------------|--------------|--------------|--------------|--------------|
| Rev | Sir | the Countess |
|--------------|--------------|--------------|
| 8 | 1 | 1 |
| 0.006 | 0.001 | 0.001 |
|--------------|--------------|--------------|
The frequencies and proportions of the 18 categories are very different.
So let’s narrow these down to a total of five categories.
# Simplify into 5 categories
<- full %>%
full # If you use "==" instead of "%in%", it won't work as you want because of Recyling Rule.
mutate(title = ifelse(title %in% c("Mlle", "Ms", "Lady", "Dona"), "Miss", title),
title = ifelse(title == "Mme", "Mrs", title),
title = ifelse(title %in% c("Capt", "Col", "Major", "Dr", "Rev", "Don",
"Sir", "the Countess", "Jonkheer"), "Officer", title),
title = factor(title))
# After creating the derived variable, check the frequency and ratio for each category
::CrossTable(full$title) descr
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
| Master | Miss | Mr | Mrs | Officer |
|---------|---------|---------|---------|---------|
| 61 | 266 | 757 | 198 | 27 |
| 0.047 | 0.203 | 0.578 | 0.151 | 0.021 |
|---------|---------|---------|---------|---------|
4.5 Ticket -> ticket.size
As we saw in Chapter 3.1.3 Summary()
, the number of passengers (train
and test
together) is 1309. However, all passengers’ tickets are not different.
See the results of summary()
and unique()
below.
# We used length() to get only the number of unique categories.
length(unique(full$Ticket))
[1] 929
# Printing all of them was too messy, so only 10 were printed.
head(summary(full$Ticket), 10)
CA. 2343 1601 CA 2144 3101295 347077 347082
11 8 8 7 7 7
PC 17608 S.O.C. 14879 113781 19950
7 7 6 6
Why are there 929 unique tickets when there are no missing values in feature
?
Even the ticket is CA. There are 11 exactly the same number of people as 2343
.
Let’s see who the passengers are.
%>%
full # Filter only 11 passengers with matching tickets
filter(Ticket == "CA. 2343") %>%
# We don't need to check for all variables, so we only want to look at the variables below.
select(Pclass, Name, Age, FamilySized)
Pclass Name Age FamilySized
1 3 Sage, Master. Thomas Henry 29.88114 Big
2 3 Sage, Miss. Constance Gladys 29.88114 Big
3 3 Sage, Mr. Frederick 29.88114 Big
4 3 Sage, Mr. George John Jr 29.88114 Big
5 3 Sage, Miss. Stella Anna 29.88114 Big
6 3 Sage, Mr. Douglas Bullen 29.88114 Big
7 3 Sage, Miss. Dorothy Edith "Dolly" 29.88114 Big
8 3 Sage, Miss. Ada 29.88114 Big
9 3 Sage, Mr. John George 29.88114 Big
10 3 Sage, Master. William Henry 14.50000 Big
11 3 Sage, Mrs. John (Annie Bullen) 29.88114 Big
You can see that the 11 passengers above are all from the same family, brothers.
While there are passengers whose tickets are exactly the same, there are also passengers whose tickets are partially matched.
Create a ticket.unique
derived variable that represents the number of unique numbers (number of characters) of such a ticket.
Let’s create a derived variable ticket.size
with 3 categories based on ticket.unique
.
# First of all, ticket.unique is saved as all 0
<- rep(0, nrow(full))
ticket.unique
# Extract only the unique ones from ticket features and store them in the tickets vector
<- unique(full$Ticket)
tickets
# After extracting only passengers with the same ticket by using overlapping loops, extract and store the length (number of characters) of each ticket.
for (i in 1:length(tickets)) {
<- tickets[i]
current.ticket <- which(full$Ticket == current.ticket)
party.indexes # For loop 중첩
for (k in 1:length(party.indexes)) {
<- length(party.indexes)
ticket.unique[party.indexes[k]]
}
}
# Save ticket.unique calculated above as a derived variable
$ticket.unique <- ticket.unique
full
# Create ticket.size variable by dividing it into three categories according to ticket.unique
<- full %>%
full mutate(ticket.size = case_when(ticket.unique == 1 ~ 'Single',
< 5 & ticket.unique >= 2 ~ "Small",
ticket.unique >= 5 ~ "Big"),
ticket.unique ticket.size = factor(ticket.size,
levels = c("Single", "Small", "Big")))
4.6 Embarked
This is feature
with two missing values (NA
). In the case of Embarked
, replace it with S
, which is the most frequent value among the three categories.
$Embarked <- replace(full$Embarked, # Specify Data$feature to replace
fullwhich(is.na(full$Embarked)), # Find only missing values
'S') # specify the value to replace
4.7 Fare
For Fare
, there was only one missing value.
Based on the histogram seen above (Chapter 3.5 Fare
), missing values are replaced with 0.
$Fare <- replace(full$Fare, which(is.na(full$Fare)), 0) full
At this point, data preprocessing is complete.
The following is the process of selecting the variables to be used for model creation while exploring the derived variables created so far.
In other words, Feature selection
.
5. Relationship to target feature Survived
& Feature selection
Prior to full-scale visualization, since the purpose here is to see how well each variable correlates with the survival rate, we did not use the entire full
data, but only the train
data set that can determine survival and death.
Also, please note that the plot
used above may be duplicated as it is.
5.0 Data set split
First, use the code below to split preprocessed full
data into train
and test
.
# Before feature selection, select all variables first.
<- full[1:891, ]
train
<- full[892:1309, ] test
5.1 Pclass
%>%
train ggplot(aes(Pclass, fill = Survived)) +
geom_bar(position = "fill") +
# Set plot theme: Converts to a more vivid color.
scale_fill_brewer(palette = "Set1") +
# Y axis setting
scale_y_continuous(labels = percent) +
# Set x, y axis names and plot main title, sub title
labs(x = "Pclass", y = "Rate",
title = "Bar plot", subtitle = "How many people survived in each Pclass?")
5.2 Sex
Same as Chapter 3.6 Sex
.
mosaicplot(Survived ~ Sex,
data = train, col = TRUE,
main = "Survival rate by passengers gender")
5.3 Embarked
%>%
train ggplot(aes(Embarked, fill = Survived)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
scale_y_continuous(labels = percent) +
labs(x = "Embarked", y = "Rate",
title = "Bar plot", subtitle = "How many people survived in each Embarked?")
5.4 FamilySized
%>%
train ggplot(aes(FamilySized, fill = Survived)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
scale_y_continuous(labels = percent) +
labs(x = "FamilySized", y = "Rate",
title = "Bar plot", subtitle = "Survival rate by FamilySized")
It can be seen that there is a difference in survival rate depending on the number of people on board, and that ‘FamilySized’ and ‘Survived’ have a non-linear relationship.
5.5 Age.Group
%>%
train ggplot(aes(Age.Group, fill = Survived)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
scale_y_continuous(labels = percent) +
labs(x = "Age group", y = "Rate",
title = "Bar plot", subtitle = "Survival rate by Age group")
5.6 title
%>%
train ggplot(aes(title, fill = Survived)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
scale_y_continuous(labels = percent) +
labs(x = "title", y = "Rate",
title = "Bar plot", subtitle = "Survival rate by passengers title")
5.7 ticket.size
%>%
train ggplot(aes(ticket.size, fill = Survived)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set1") +
scale_y_continuous(labels = percent) +
labs(x = "ticket.size", y = "Rate",
title = "Bar plot", subtitle = "Survival rate by ticket.size")
5.8 Description of actual used features
Now that all the derived variables created so far have been found to be useful, select and save only the variables you will actually use.
The table below is a brief description of the actual selected variables.
variable name | Type | Description |
---|---|---|
Survived | factor | Target feature, survival == 1, death == 0 |
Sex | factor | gender, male or female |
Pclass | factor | Cabin Class, First Class (1), Second Class (2), Third Class (3) |
Embarked | factor | Port of embarkation, Southampton (S), Cherbourg (C), Queenstown (Q) |
FamilySized | factor | Family size, a derived variable created using SibSp and Parch , with 3 categories |
Age.Group | factor | Age group, a derived variable created using Age , with 4 categories |
title | factor | A part of the name, a derived variable made using Name , and 5 categories |
ticket.size | factor | The length of the unique part of the ticket, a derived variable created using ticket , with 3 categories |
# Excluding ID number, select and save 7 input variables and 1 target variable to actually use
<- train %>%
train select("Pclass", "Sex", "Embarked", "FamilySized",
"Age.Group", "title", "ticket.size", "Survived")
# For Submit, extract the Id column vector and store it in ID
<- test$PassengerId
ID
# Select and save the remaining 6 variables except for Id and Survived
<- test %>%
test select("Pclass", "Sex", "Embarked", "FamilySized",
"Age.Group", "title", "ticket.size")
6. Machine learning model generation
Now is the time to create a machine learning model using the train
data set.
Originally, it is correct to create train
, validation
, test
data sets first, create various models, and then select the final model through cross validation (CV
, Cross Validation), but these processes are omitted here and RandomForest After creating only
, we will predict (estimate) the test
data and even create data to Submit
to competition
.
6.1 Random Forest model generation
# Set the seed number for reproducibility.
set.seed(1901)
<- randomForest(Survived ~ ., data = train, importance = T, ntree = 2000) titanic.rf
6.2 Feature importance check
importance(titanic.rf)
0 1 MeanDecreaseAccuracy MeanDecreaseGini
Pclass 47.442449 53.94070 64.73724 36.807804
Sex 54.250630 37.30378 58.66109 57.223102
Embarked -6.328112 38.10930 27.32587 9.632958
FamilySized 32.430898 31.24383 50.13349 18.086894
Age.Group 15.203313 26.72696 29.36321 10.201187
title 48.228450 41.60124 57.02653 73.146999
ticket.size 39.544367 37.80849 59.59915 22.570142
varImpPlot(titanic.rf)
6.3 Predict test data and create submit data
# Prediction
<- predict(object = titanic.rf, newdata = test, type = "class")
pred.rf
# Data frame generation
<- data.frame(PassengerID = ID, Survived = pred.rf)
submit
# Write the submit data frame to file : csv is created in the folder designated by setwd().
write.csv(submit, file = './titanic_submit.csv', row.names = F)