Data Science with Python (I)Posted: 21 May, 2017
Data Science involves you make smart decisions for big or complex problems. That’s why you need evolved tools to manage this situations.
For example Pandas is the perfect Data Analysis toolkit or if you want to deep in Machine Learning Scikit-learn is your tool with lots of solutions with different Algorithms and Applications like Classification for Image recognition or Spam detection.
You can find specialized IDEs for developing Python such Anaconda with IPython Notebook concept or the Google solution python(x,y) for window users.
The main programming language competitor (free source) you will find R Project, is the most used programming language used for Data Science but Python is growing day by day and is really close to R. You can find an interesting analysis with Pros and Cons and a clear conclusion it depends what are you looking for 🙂
Going back to Python, I must highlight that the license is free, you will find the list of versions and license in https://docs.python.org/2/license.html .
Obviously we need an IDE installation, we can find two versions of Python 2.7.x and 3.4.x. Data Science mainly use 2.7.x, then if yo want to play Data Science please choose this option. As an IDE a recommend you install Anaconda, download at https://www.continuum.io/downloads .
As I told you before you can use the pythonxy Integrated Development Environment hosted by Google with the following characteristics:
- rapid prototyping, with IPython shell
- small/big projects developed using Spyder
- scientific purpose: experiment modeling, signal processing, …
Now that we have an IDE we can start analyzing data with the different libraries. I will start making a little review of Pandas what is the most powerful and flexible open source data analysis / manipulation tool available in any language.
You can find in the Pandas Site what it do well, as a interesting sum up:
- Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
- Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining data sets
- Flexible reshaping and pivoting of data sets
- Hierarchical labeling of axes (possible to have multiple labels per tick)
- Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
As an example I show you the code for reading different types of files:
First of all you need to import the library – import pandas as pd
For reading txt files: country_table = pd.io.parsers.read_table(“Country.txt”)
For reading csv files: titanic = pd.io.parsers.read_csv(“Titanic.csv”) X = titanic[[‘age’]]
For reading excel files: xls = pd.ExcelFile(“Excel.xls”)
parse_data = xls.parse(‘Sheet1’, index_col=None, na_values=[‘NA’])
I will show you more about Panda in next Data Science analysis Posts.
That’s all folks in this first post about Python and Data Science, next post soon.