In one of my previous post I’ve attached you the following image: https://datasciencecgp.files.wordpress.com/2015/01/roadtodatascientist1.png , in this image we have an interesting roadmap to follow to be an Horizontal Data Scientist.
Then the following station will be do a deep insight in the heart of inferential statistics the Central Limit Theorem.
The central limit theorem has an interesting history. The first version of this theorem was postulated by the French-born mathematician Abraham de Moivre who, in a remarkable article published in 1733, used the normal distribution to approximate the distribution of the number of heads resulting from many tosses of a fair coin. This finding was far ahead of its time, and was nearly forgotten until the famous French mathematician Pierre-Simon Laplace rescued it from obscurity in his monumental work Théorie Analytique des Probabilités, which was published in 1812. Laplace expanded De Moivre’s finding by approximating the binomial distribution with the normal distribution. But as with De Moivre, Laplace’s finding received little attention in his own time. It was not until the nineteenth century was at an end that the importance of the central limit theorem was discerned, when, in 1901, Russian mathematician Aleksandr Lyapunov defined it in general terms and proved precisely how it worked mathematically. Nowadays, the central limit theorem is considered to be the unofficial sovereign of probability theory. (Source: Wikipedia)
Turning back to reality, the Central Limit Theorem tells us that, for a reasonable size n, the sampling distribution (the distribution of all the means of all the possible samples of size n) is approximated by a Normal curve whose mean is mu, the mean of the population, and whose standard deviation is the standard deviation of the population divided by the square root of the sample size, n.
Read the rest of this entry »
As I told you in the first post I’d like to do some Competitions as my level increased. Now is time to start my Kaggle Competitions. Titanic, Machine Learning from disaster is one of the most helpful Competitions to start learning about Data Science. In this Kaggle page you will find a lot of help and you can learn how to start with different kind of languages Python, Excel and R, well in this blog I will do some Storytelling using R, concretely with Random Forest and Boosting. In near future I will post in comments or other posts more algorithms like Neural Network or SVM to solve this or other Competitions. But, what we know about the Titanic Disaster? Read the rest of this entry »