Reading the Titanic Dataset
[codesyntax lang=“php”]train <- read.csv(“C:/[PFAD]/train.csv”, header = TRUE)[/codesyntax]
Information about the data structure
[codesyntax lang=“php”]str (train)[/codesyntax]
Result: ’train.frame’: 891 obs. of 11 variables: $ survived: int 0 1 1 1 0 0 0 0 1 1 … $ pclass : int 3 1 3 1 3 3 1 3 3 2 … $ name : Factor w/ 891 levels “Abbing, Mr. Anthony”,..: 109 191 358 277 16 559 520 629 417 581 … $ sex : Factor w/ 2 levels “female”,“male”: 2 1 1 1 2 2 2 2 1 1 … $ age : num 22 38 26 35 35 NA 54 2 27 14 … $ sibsp : int 1 1 0 1 0 0 0 3 0 1 … $ parch : int 0 0 0 0 0 0 0 1 2 0 … $ ticket : Factor w/ 681 levels “110152”,“110413”,..: 524 597 670 50 473 276 86 396 345 133 … $ fare : num 7.25 71.28 7.92 53.1 8.05 … $ cabin : Factor w/ 148 levels “”,“A10”,“A14”,..: 1 83 1 57 1 1 131 1 1 1 … $ embarked: Factor w/ 4 levels “”,“C”,“Q”,“S”: 4 2 4 4 4 3 4 4 4 2 …
Display the first records including column header
[codesyntax lang=“php”]head (train)[/codesyntax]
Result: survived pclass name sex age sibsp parch ticket fare cabin embarked 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S 6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q
Number of rows & columns or records
[codesyntax lang=“php”]dim(tableName)
or
nrow (train) ncol (train)[/codesyntax]
Result: 891 11
Number of surviving and deceased passengers
[codesyntax lang=“php”]table (train$survived)x[/codesyntax]
Result: 342 survivors 549 deceased
Visualization: [codesyntax lang=“php”]pie ( table(train$survived), labels=c( paste(“Verstorbene =”, table (train$survived), “Personen”), “Überlebende”), main = “Anzahl überlebende und verstorbene Passagiere” )[/codesyntax]

Percentage analysis
[codesyntax lang=“php”]prop.table(table (train$survived))[/codesyntax]
Result: 38% survivors 62% deceased
Overview of passenger gender
[codesyntax lang=“php”]table (train$sex)[/codesyntax]
Result: Number of female passengers: 314 Number of male passengers: 577
Visualization: [codesyntax lang=“php”]barplot(table (train$sex), col = c(“pink”, “blue”), ylab = “Anzahl Passagiere”, main = “Anzahl Passagiere nach Geschlecht unterteilt”, legend.text = TRUE, args.legend = list(x=0.5, y=600), ylim = c(0,600) )[/codesyntax]

Number of surviving and deceased passengers by gender
[codesyntax lang=“php”]table(train$sex, train$survived)[/codesyntax]
Result: Number of surviving women: 233 Number of deceased women: 81 Number of surviving men: 109 Number of deceased men: 468
Visualization: [codesyntax lang=“php”]barplot(table(train$sex, train$survived), col = c(“pink”, “blue”), ylab = “Anzahl Passagiere”, xlab = “0 = verstorben | 1 = überlebt “, main = “Übersicht überlebende und verstorbene Passagiere”, legend.text = TRUE, args.legend = list(x=2.4, y=600), ylim = c(0,600) )[/codesyntax]

Percentage of surviving and deceased passengers by gender (based on total number of passengers)
[codesyntax lang=“php”]prop.table(table(train$sex, train$survived))[/codesyntax]
Result: Percentage of surviving women: 26% Percentage of deceased women: 9% Percentage of surviving men: 12% Percentage of deceased men: 53%
Percentage of surviving and deceased passengers (gender-specific)
[codesyntax lang=“php”]prop.table(table(train$sex, train$survived),1)[/codesyntax]
Result: Percentage of surviving women: 26% of all women Percentage of deceased women: 74% Percentage of surviving men: 19% of all men Percentage of deceased men: 81%
Number of people per class
[codesyntax lang=“php”]table (train$pclass)[/codesyntax]
Result: First class: 216 people Second class: 184 people Third class: 491 people
Number of survivors and deceased per class (based on total number of passengers)
[codesyntax lang=“php”]table(train$survived, train$pclass)[/codesyntax]
Result: First class: Number of survivors: 136 Number of deceased: 80
Second class: Number of survivors: 87 Number of deceased: 97
Third class: Number of survivors: 119 Number of deceased: 372
Visualization: [codesyntax lang=“php”]dotchart(table(train$survived, train$pclass), pch = 15, col=“blue”, gcolor = “red”, lcolor = “orange”, xlab = “Anzahl überlebende und verstorbene Passagiere”, ylab = “Überlebende und versorbene Passagiere nach Klassen”, main = “Auflistung aller überlebenden und verstorbenene Passagiere nach Klassen” )[/codesyntax]

Percentage of survivors per class (based on total number of passengers)
[codesyntax lang=“php”]prop.table(table(train$survived, train$pclass))[/codesyntax]
Result: First class Number of survivors: 15% Number of deceased: 9%
Second class Number of survivors: 10% Number of deceased: 11%
Third class Number of survivors: 13% Number of deceased: 42%
Percentage of surviving and deceased passengers from first class
[codesyntax lang=“php”] 100/216*136 #62.96296 100/216*80 #37.03704[/codesyntax]
Result: 63% of first class passengers survived 37% of first class passengers died
Result for second class: 47% of second class passengers survived 53% of second class passengers died
Result for third class: 24% of third class passengers survived 76% of third class passengers died
Number of surviving women and men from each class
[codesyntax lang=“php”]table(train$sex, train$survived, train$pclass)[/codesyntax]
Result: First class Surviving women: 91 Deceased women: 3 Surviving men: 45 Deceased men: 77
Second class Surviving women: 70 Deceased women: 6 Surviving men: 17 Deceased men: 91
Third class Surviving women: 72 Deceased women: 72 Surviving men: 47 Deceased men: 300
Age of passengers
[codesyntax lang=“php”]qplot(age, ylab = “Number”, main = “Alter der Passagiere”, data = train, geom = “bar”)[/codesyntax]

Average age of passengers
[codesyntax lang=“php”]summary(train$age)[/codesyntax]
Result: Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 0.42 20.12 28.00 29.70 38.00 80.00 177
Average age of surviving women from first class
[codesyntax lang=“php”]summary(train[train$survived==“1” & train$sex==“male” & train$pclass==“1” ,]$age)[/codesyntax]
Result: Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 14.00 23.25 35.00 34.94 44.00 63.00 9
Result for second class: Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 2.00 21.75 28.00 28.08 35.25 55.00 2
Result for third class: Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 0.75 13.50 19.00 19.33 26.50 63.00 25
Number of children in first class (assumption: children are under 18): Input: sum (table(train[train$pclass==“1” & train$age < 18 ,]$age))
Result: 12
Result for second class: 23
Result for third class: 78
Number of deceased children from first class
[codesyntax lang=“php”]sum (table(train[train$survived==“0” & train$pclass==“1” & train$age < 18 ,]$age))[/codesyntax]
Result: 1
Result for second class: 2
Result for third class: 49
Number of surviving children (girls) from first class
[codesyntax lang=“php”]sum (table(train[train$survived==“1” & train$sex==“female” & train$pclass==“1” & train$age < 18 ,]$age))[/codesyntax]
Result: 7
Result for second class: 12
Result for third class: 19
Number of deceased children (girls) from first class
[codesyntax lang=“php”]sum (table(train[train$survived==“0” & train$sex==“female” & train$pclass==“1” & train$age < 18 ,]$age))[/codesyntax]
Result: 1
Result for second class: 0
Result for third class: 16
Information Gain
Def.: Information Gain determines the attribute that provides the most information content. Its goal is to minimize the depth of the decision tree. Source: [Artificial Intelligence: A Modern Approach]
Install and load ‘FSelector’ package
[codesyntax lang=“php”]install.packages(“FSelector”) library (FSelector)[/codesyntax]
Calculation of ‘Information gain’ - ’target value’ corresponds to ’train$survived’.
[codesyntax lang=“php”]information.gain(train)[/codesyntax]
Result: attr_importance pclass 0.053608163 name 1.609435396 sex 0.145685212 age 0.008687071 sibsp 0.000000000 parch 0.000000000 ticket 1.355042740 fare 0.048528443 cabin 0.318976005 embarked 0.022301869
The most important attributes for passenger survival are ’name’ and ’ticket’.
Determine if the ’name’ column contains duplicates
[codesyntax lang=“php”]anyDuplicated( as.character(train$name) )[/codesyntax]
Result: 0 # No duplicates
Filter by (Mrs., Miss., Mr., Master.)
[codesyntax lang=“php”]titles <- NULL mrs <- NULL miss <- NULL mr <- NULL master <- NULL other <- NULL ages <- NULL
parseTitle <- function(name) {
name <- as.character(name)
if (length(grep(“Mrs.”, name)) > 0) { mrs «- c(mrs, name) return (“Mrs.”) } else if (length(grep(“Miss.”, name)) > 0) { miss «- c(miss, name) return (“Miss.”) } else if (length(grep(“Mr.”, name)) > 0) { mr «- c(mr, name) return (“Mr.”) } else if (length(grep(“Master.”, name)) > 0) { master «- c(master, name) return (“Master.”) } else { other «- c(other, name) return (“Other”) } }
for (i in 1:nrow(train)) { titles <- c(titles, parseTitle(train[i, “name”])) ages «- c(ages, train[i,5]) }
table(titles)[/codesyntax]
Result: Master. Miss. Mr. Mrs. Other 40 180 518 129 24
Number of survivors and deceased by title
[codesyntax lang=“php”]table(titles, train$survived)[/codesyntax]
Result: titles 0 1 Master. 17 23 Miss. 54 126 Mr. 436 82 Mrs. 27 102 Other 15 9
Number of survivors and deceased by title (percentage)
[codesyntax lang=“php”]100/40*17 # 42,5% 100/40*23 # 57,5%[/codesyntax]
Result: titles 0 1 Master. 42,5% 57,5% Miss. 30% 70% Mr. 84,2% 15,8% Mrs. 20,9% 79,1% Other 62,5% 37,5%
Titles by class
[codesyntax lang=“php”]table(titles, train$survived, train$pclass)[/codesyntax]
Result: First class titles 0 1 Master. 0 3 Miss. 2 44 Mr. 70 38 Mrs. 1 43 Other 7 8
Second class titles 0 1 Master. 0 9 Miss. 1 31 Mr. 83 8 Mrs. 5 38 Other 8 1
Third class titles 0 1 Master. 17 11 Miss. 51 51 Mr. 283 36 Mrs. 21 21 Other 0 0
Calculate age of passengers within individual classes
[codesyntax lang=“php”]table(titles, ages)[/codesyntax]
Result: # The oldest Master. is 12 # The oldest Miss. is 63 # The oldest Mr. is 80 # The oldest Mrs. is 63
