Storytelling – Kaggle Titanic Competition

```VARIABLE DESCRIPTIONS:
survival        Survival
(0 = No; 1 = Yes)
pclass          Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)```

Now take a look and listen what the Data tell us, first training data set:

```str(train)
'data.frame':	891 obs. of  12 variables:
\$ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
\$ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
\$ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
\$ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
\$ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
\$ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
\$ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
\$ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
\$ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
\$ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
\$ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
\$ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...```
```str(test)
'data.frame':	418 obs. of  11 variables:```

I love the package Amelia and how can we see the miss Data of our information: We can see that we have around 20% of the ages missing.

```summary(train\$Age)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
0.42   20.12   28.00   29.70   38.00   80.00     177```

About of our test Data we can see more or less the same.

```summary(test\$Age)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
0.17   21.00   27.00   30.27   39.00   76.00      86```

In this case it’s important that you fill the NA’s values to have a better fit in your model, you can do it with several methods, for example:Replacing the missings with the average of the available values, in this case 29.70. Or you can do something more advanced like predicting the Age with the rpart algorithm:

```summary(train\$Age)
modelage +                 data=train[!is.na(train\$Age),], method="anova")
train\$Age[is.na(train\$Age)] ```

Another interesting way to see the Data and the correlation of the variables is using the corrgram package:We can see interesting information like the Fare variable it’s very correlated with the Pclass, then you can imagine the location of the passengers with this fare variable. Also another clear relation is that the variable PClass and Survived rate are very related, we can imagine that the people with higher class had more possibilities to survive. It’s curious that the Age variable is not much related with the Survived variable cause the child and women are the first in leave Titanic.

```prop.table(table(train\$Survived))

0         1
0.6161616 0.3838384```

38% of people Survived and if we take a look the survival rate with PClass:

```prop.table(table(train\$Survived,train\$Pclass))

1          2          3
0 0.08978676 0.10886644 0.41750842
1 0.15263749 0.09764310 0.13355780```

A 41% of third class died, third class was the 55% of the passangers, and 15% of first class survived.Well you can see inside the Data a lot of information and if you try to understand the situation you will imagine yourself living the tragedy and trying to understand what happened.Take a look inside the Trevor or Wehrley posts helps to see more deep understanding about the Data and the predictions. In this post I will show you the analysis of two important Algorithms Random Forest and Boosting.First of all you can find a good understanding in the Introduction to Statistical Learning with Application in R book, it’s free and you have the theoretical explanation and R examples to better understand how it works. In the Titanic Competition if you take a look in the forum lot of people try to solve it with RF.Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. This reduces the variance when we average the trees. We build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random selection of m predictors ischosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m predictors. A fresh selection of m predictors is taken at each split, and typically we choose m pp | that is, the number of predictors considered at each split is approximately equal to the square root of the total number of predictors. If we try this algorithm with our Data:

```set.seed(415)
fit +                     data=train, importance=TRUE, ntree=2000)
# Variable importance
varImpPlot(fit)```

We can see the importance of the Tittle variable, I have done a Tittle variable with the name for knowing the tittle of the people which is significant.I just try the algorithm RF with the best approach of trees, if you increase the number of trees the model doesn’t improve the results.

`> Prediction > submit > write.csv(submit, file = "2000treeforest.csv", row.names = FALSE)`

If you’ll do a submission of this Prediction you obtain a 0.77512 that’s give you a 1200 position from 1855 competitors.

```set.seed(415)
fit +                data = train, controls=cforest_unbiased(ntree=2000, mtry=3))
Prediction submit write.csv(submit, file = "2000treesmtry3forest.csv", row.names = FALSE)```

If we make a submission we obtain 0.81340, until the 72 position (well you have people with this result until 200 more or less). This implementation of the random forest (and bagging) algorithm differs from the reference implementation in `randomForest` with respect to the base learners used and the aggregation scheme applied. Conditional inference trees, see `ctree`, are fitted to each of the `ntree` perturbed samples of the learning sample. Most of the hyper parameters in `ctree_control` regulate the construction of the conditional inference trees. Hyper parameters you might want to change are: 1. The number of randomly preselected variables `mtry`, which is fixed to the square root of the number of input variables. 2. The number of trees `ntree`. Use more trees if you have more variables. 3. The depth of the trees, regulated by `mincriterion`. Usually unstopped and unpruned trees are used in random forests. To grow large trees, set `mincriterion` to a small value. The aggregation scheme works by averaging observation weights extracted from each of the `ntree` trees and NOT by averaging predictions directly as in `randomForest`. See Hothorn et al. (2004) and Meinshausen (2006) for a description. Predictions can be computed using `predict`. For observations with zero weights, predictions are computed from the fitted tree when `newdata = NULL`. Ensembles of conditional inference trees have not yet been extensively tested, so this routine is meant for the expert user only and its current state is rather experimental. However, there are some things available in `cforest` that can’t be done with `randomForest`, for example fitting forests to censored response variables (see Hothorn et al., 2004, 2006a) or to multivariate and ordered responses. Using the rich `partykit` infrastructure allows additional functionality in `cforest`, such as parallel tree growing and probabilistic forecasting (for example via quantile regression forests). Also plotting of single trees from a forest is much easier now. Unlike `cforest`, `cforest` is entirely written in R which makes customisation much easier at the price of longer computing times. However, trees can be grown in parallel with this R only implemention which renders speed less of an issue. Note that the default values are different from those used in package `party`, most importantly the default for mtry is now data-dependent. `predict(, type = "node")` replaces the `where` function and `predict(, type = "prob")` the `treeresponse` function. Moreover, when predictors vary in their scale of measurement of number of categories, variable selection and computation of variable importance is biased in favor of variables with many potential cutpoints in `randomForest`, while in `cforest` unbiased trees and an adequate resampling scheme are used by default. I have to make some improvements with this RF algorithm to better fit the model, I will do some testing and I’ll inform you about the results. Let’s take a look about boosting, like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classication. We only discuss boosting for decision trees. Recall that bagging involves creating multiple copies of the original training data set using the bootstrap, setting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model. Notably, each tree is built on a bootstrap data set,independent of the other trees. Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees.

```Grid +   n.trees = c(2000),
+   interaction.depth = c(10) ,
+   shrinkage = 0.001)
>
> # Define the parameters for cross validation
> fitControl
>
> # Initialize randomization seed
> set.seed(1805)
> GBMmodel +                   data = train,
+                   method = "gbm",
+                   trControl = fitControl,
+                   verbose = TRUE,
+                   tuneGrid = Grid,
+                   metric = "ROC")```

The most interesting parameters are:

```Grid +   n.trees = c(2000),
+   interaction.depth = c(10) ,
+   shrinkage = 0.001)```

you have to test the best approach for each parameters.If we check the results with the training data:

```> GBMpredTest = predict(GBMmodel, newdata = train)
> confusionMatrix(GBMpredTest, train\$Survived)
Confusion Matrix and Statistics

Reference
Prediction   0   1
0 518  81
1  31 261

Accuracy : 0.8743
95% CI : (0.8507, 0.8954)
No Information Rate : 0.6162
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.7267
Mcnemar's Test P-Value : 3.656e-06

Sensitivity : 0.9435
Specificity : 0.7632
Pos Pred Value : 0.8648
Neg Pred Value : 0.8938
Prevalence : 0.6162
Detection Rate : 0.5814
Detection Prevalence : 0.6723
Balanced Accuracy : 0.8533

'Positive' Class : 0```

We have an accuracy of 87% and a very good rate for the other parameters.If we do the submission we obtain a result 0.81818, a little bit better than RF, you obtain a position of 35 to 71 (with the same result).But what happens if you put a better shrinkage parameter 0.1, this are the results:

```> GBMpredTest = predict(GBMmodel, newdata = train)
> confusionMatrix(GBMpredTest, train\$Survived)
Confusion Matrix and Statistics

Reference
Prediction   0   1
0 543   8
1   6 334

Accuracy : 0.9843
95% CI : (0.9738, 0.9914)
No Information Rate : 0.6162
P-Value [Acc > NIR] :
An incredible accuracy, more than 98.4%. and the rest of parameters are close to 1, then we can conclude that the model fits almost perfectly the Training Data model, but what happens with the Test Data if we do a submit with this model the result is 0.71770, but how it's possible? well this is an example of over fitting, yes the model is perfect for the Training Data but the Test Data it's a little bit different then, with this prediction we obtain a worst result. Now the challenge here is to try to understand how to obtain a 1 prediction, maybe the difference is cleaning or adapting the data to the story in a better way or maybe I have to test with some improvement in the training parameters to better fit the model.At this moment we can observe the potential of Data Science but this is only the top of the iceberg.```

3 Comments on “Storytelling – Kaggle Titanic Competition”

1. nice way to tell back the history 🙂

Like

2. cgperal says:

I’ve just tested the SVM algorithm with the Titanic competition data.
In the book An Introduction to Statistical Learning with Application in R http://www-bcf.usc.edu/~gareth/ISL/ , chapter 9, you have the details about how it works and do classification.
For training and testing I’ve used the kernlab package http://cran.r-project.org/web/packages/kernlab/index.html with the ksvm function for Suppot Vector Machines.

After tunning the training model with the tune function I can obtain the best results with this configuration:

GBMmodel <- ksvm(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID,
data = train,
cost=1,
type="C-svc",
tol=0.001,
nu=0.1,
epsilon=0.001,
shrinking=TRUE)

But mainly the parameters more significant are kernel and type.

The results obtained in Kaggle competition are below 0.80 then at the moment less effective than GBM.

Can you tell me how to improve this Algorithms (RF, GBM and SVM)?

Like

3. duvvurum says:

Thank you very much. I like the the way you made this tutorial. IT is a perfect combination of methodology and objective.

Liked by 1 person