You are on page 1of 11

More Python packages for Data Science - Dataiku

https://www.dataiku.com/learn/guide/academy/python-and-r/...

(/)

HOWTO

Table of contents

More Python packages for Data

Analytics
Graph analytics

Science

Time series
Natural Language
Processing

May 15, 2015

Bayesian Inference
Web scraping

There are a tremendous number of Python packages (https://pypi.python.org

Data Visualization

/pypi/?), devoted to all sorts of applications: from web development to data


analysis to pretty much everything. We list here packages we have found

Related resources

essential for data science.

Howto

The Basic Stack

Getting started with

There are six fundamental packages for data science in Python:

started-

Python (gettingwith-python.html)

numpy
scipy
matplotlib
IPython
pandas
scikit-learn
If you have already done data science in Python, you probably already know
them. If you are new to Python, and need a quick introduction to these
packages, you can check out our Getting started with Python (getting-startedwith-python.html) post. We also list there tutorials and useful resources to help
you get started.
Contact us (/dss/contact/)

Follow us

+1 646-568-7477 Nltk, Statsmodels,


Analytics: Networkx,
(http://www.linkedin.com
PyMC
(http://twitter.com
(http://slideshare.net
(http://www.youtube.com

1 de 11

1/3/16 17:13

More Python packages for Data Science - Dataiku

https://www.dataiku.com/learn/guide/academy/python-and-r/...

Scikit-learn (http://scikit-learn.org/stable/index.html) is the main Python


Dataiku 2012-2016 - Legal Notice (/legal-notice.html)
package for machine learning. It contains many unsupervised and supervised
learning algorithms for discovering patterns in your data or building predictive
models.
However, besides scikit-learn, there are several others packages for more
advanced, specific applications. Packages like networkx
(https://networkx.github.io/) for graph data, nltk (http://www.nltk.org/) for text
data, or statsmodels (http://statsmodels.sourceforge.net/) for temporal data,
nicely complement scikit-learn, either for feature engineering or even modeling.
Also some packages oers a dierent approach to data analysis and modeling,
such as statsmodels for traditional statistical analysis, or PyMC (http://pymcdevs.github.io/pymc/README.html) for Bayesian inference.

Graph Analytics
Graph analytics is particularly useful for social network analysis: uncovering
communities, finding central agents in the network.
Networkx (https://networkx.github.io/) is the most popular Python package for
graph analytics. It contains many functions for generating, analyzing and
drawing graphs.
However, networkx may not scale well for large-scale graphs. For such graphs,
you should also consider igraph (http://igraph.org/) (available in R and python),
graph-tool (http://graph-tool.skewed.de/) or GraphLab CreateTM
(https://dato.com/products/create/quick-start-guide.html).

Sample code
This starter code illustrates how you can include networkx in your data
processing flow. As a starting point, you will generally have a dataframe
representing links in a network. A link could denote, for example, that the two
users are friends on Facebook.

2 de 11

1/3/16 17:13

More Python packages for Data Science - Dataiku

https://www.dataiku.com/learn/guide/academy/python-and-r/...

You first need to convert the links dataframe into a graph. You can then, for
example, find the connected components of the graph, sorted by size. You can
also restrict your analysis to a subgraph, for instance the largest connected
component.
To find the most influential people in the network, you can explore several
centrality measures, such as degree, betweenness, and pagerank. Finally, you
can easily output the centrality measures in a dataframe, for further analyses.
# build graph from links dataframe
import networkx as nx
g=nx.Graph()
g.add_edges_from(zip(links.user1,links.user2))
print nx.info(g)
# connected components sorted by size
cc =

sorted(nx.connected_components(g), key = len, reverse=True)

print "number of connected components: ", len(cc)


print "size of largest connected component: ", len(cc[0])
print "size of second largest: ", len(cc[1])
# largest connected component
G = g.subgraph(cc[0])
# output centrality measures in a dataframe
centrality = pd.DataFrame({'user':G.nodes()})
centrality['degree'] = centrality.user.map(nx.degree(G))
centrality['pagerank'] = centrality.user.map(nx.pagerank(G))
centrality['betweenness'] = centrality.user.map(nx.betweenness_centrality(G))

Time Series Analysis And Forecasting


In many applications, predictions are aected by temporal factors: seasonality,
an underlying trend, lags. The purpose of time-series analysis is to uncover such

3 de 11

1/3/16 17:13

More Python packages for Data Science - Dataiku

https://www.dataiku.com/learn/guide/academy/python-and-r/...

patterns in temporal data, and then build models upon them for forecasting.
Statsmodels (http://statsmodels.sourceforge.net/) is the main python package
for time-series analysis and forecasting. It nicely integrates with pandas
time-series. This packages also contains many statistical tests, such as ANOVA or
t-test, used in traditional approaches to statistical data analysis.

Sample code
In the code below, taken from the examples section of statsmodels
(http://statsmodels.sourceforge.net/devel/examples/index.html), we fit an
auto-regressive model to the sunspots acitivity data, and use it for forecasting.
We also plot the autocorrelation function which reveals that values are
correlated with past values.
import statsmodels.api as sm
# sunspots activity data
print sm.datasets.sunspots.NOTE
data = sm.datasets.sunspots.load().endog
dates = sm.tsa.datetools.dates_from_range('1700', '2008')
ts = pd.TimeSeries(data, index=dates)
# plot the acf
sm.graphics.tsa.plot_acf(ts.values, lags=40)
# fit an AR model and forecast
ar_fitted = sm.tsa.AR(ts).fit(maxlag=9, method='mle', disp=-1)
ts_forecast = ar_fitted.predict(start='2008', end='2050')

4 de 11

1/3/16 17:13

More Python packages for Data Science - Dataiku

https://www.dataiku.com/learn/guide/academy/python-and-r/...

Natural Language Processing


Analyzing text is a diicult and broad task. The nltk (http://www.nltk.org/)
package is a very complete package for that purpose. It implements many tools
useful for natural language processing and modeling, like tokenization,
stemming, and parsing to name a few.

Bayesian Inference
Finally, PyMC (http://pymc-devs.github.io/pymc/README.html) is a Python
package devoted to Bayesian inference. This package allows you to easily
construct, fit, and analyze your probabilistic models.
If you are not familiar with Bayesian inference, we recommend you the excellent
Probabilistic Programming and Bayesian Methods for Hackers
(http://camdavidsonpilon.github.io/Probabilistic-Programming-and-BayesianMethods-for-Hackers/) by Cameron Davidson Pilon. The book is entirely written
as IPython notebooks, and contains lots of concrete examples using PyMC code.

Sample code
Here's one simple example, taken from the notebook. The purpose is to infer if
the user has changed his text-message behavior, based on a time-series of
text-message count data.

The first step in Bayesian inference is to propose a probabilistic model for the
data. For instance here, it is assumed that the user changed its behavior at a
time tau: before the event, he was sending messages at a rate lambda_1, and
after the event, at a rate lambda_2. Then the Bayesian approach allows to infer
the whole probability distribution of tau, lambda_1 or lambda_2 given the
observed data, not just a single estimate.

5 de 11

1/3/16 17:13

More Python packages for Data Science - Dataiku

https://www.dataiku.com/learn/guide/academy/python-and-r/...

The probability distributions are usually obtained by Markov Chain Monte Carlo
sampling (http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo), as done in
the example code.
import pymc as pm
# probabilistic model
alpha = 1.0 / count_data.mean()

# count_data is the variable


# that holds our txt counts

lambda_1 = pm.Exponential("lambda_1", alpha)


lambda_2 = pm.Exponential("lambda_2", alpha)
tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data)
@pm.deterministic
def lambda_(tau=tau, lambda_1=lambda_1, lambda_2=lambda_2):
out = np.zeros(n_count_data)
out[:tau] = lambda_1

# lambda before tau is lambda1

out[tau:] = lambda_2

# lambda after (and including) tau is lambda2

return out
observation = pm.Poisson("obs", lambda_, value=count_data, observed=True)
model = pm.Model([observation, lambda_1, lambda_2, tau])
# MCMC sampling
mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000, 1)
lambda_1_samples = mcmc.trace('lambda_1')[:]
lambda_2_samples = mcmc.trace('lambda_2')[:]
tau_samples = mcmc.trace('tau')[:]

Web Scraping: Beautifulsoup, Urllib2, ...

6 de 11

1/3/16 17:13

More Python packages for Data Science - Dataiku

https://www.dataiku.com/learn/guide/academy/python-and-r/...

Many websites contain a lot of interesting and useful data. Unfortunately, the
data is rarely available in a nice tabular format to download! The data is only
displayed, disseminated across the web page, or even dispatched on dierent
pages.
Suppose you wish to retrieve data on the most popular movies of 2014
(http://www.imdb.com/year/2014), displayed on the IMDb site. Unfortunately, as
you can see, the movie title, its rating and so on are disseminated across the
web page. In Chrome, if you left-click on any element, such as the movie title,
and select "Inspect element", you will see to which part of the HTML code it
corresponds to.

The goal of web scraping is to systemically recover data displayed on websites.


Several Python packages are useful to this end.
First, requests (http://docs.python-requests.org/en/latest/) or urllib2
(https://docs.python.org/2/library/urllib2.html), allow you to retrieve the HTML
content of the pages.
You can then industrialize your browsing, and systemically fetch the related
content. Then, BeautifulSoup (http://www.crummy.com/software
/BeautifulSoup/) or lxml (http://lxml.de/) allow you to eiciently parse the HTML
content. If you understand the page structure, you can then easily get each
datum displayed on the page.

Sample code
In the code below for instance, we first parse the HTML content to get the list of
all movies. Then for each movie, we retrieve its ranking number, title, outline,
rating and genre.
One of our data scientists used this kind of web scrapping to build his
personalized movie recommender system]!

7 de 11

1/3/16 17:13

More Python packages for Data Science - Dataiku

https://www.dataiku.com/learn/guide/academy/python-and-r/...

import urllib2
from bs4 import BeautifulSoup
# get the html content
url = "http://www.imdb.com/year/2014"
page = urllib2.urlopen(url).read()
# parsing HTML
soup = BeautifulSoup(page)
# find all movies
movies = soup.find("table", {"class":"results"}).findAll('tr')
# get information for each movie
records = []
for movie in movies:
record = {}
record['number'] = movie.find('td', {"class":"number"}).text
record['title'] = movie.find('a')['title']
record['rating'] = movie.find('div', {"class":"rating-list"})['title']
record['outline'] = movie.find('span', {"class":"outline"}).text
record['credit'] = movie.find('span', {"class":"credit"}).text
record['genres'] = movie.find('span', {"class":"genre"}).text.split('|')
records += [record]
# output in a dataframe
df = pd.DataFrame(records)

Visualization: Seaborn, Ggplot


The standard plotting package in Python is matplotlib (http://matplotlib.org/),
that enables you to can make simple plots rather easily. Matplotlib is also a very
flexible plotting library. You can use it to make arbitrarily complex plots and
customize them at will.
However using matplotlib can be frustrating at times, for two reasons. First,
matplotlib default aesthetics is not specially attractive, and you may end-up
doing a lot of manual tweaking to get awesome-looking plots. Second,
matplotlib is not well suited for exploratory data analysis, when you want to
quickly analyze your data across several dimensions. Your code will often end up
being verbose and lengthy.

8 de 11

1/3/16 17:13

More Python packages for Data Science - Dataiku

https://www.dataiku.com/learn/guide/academy/python-and-r/...

Fortunately, there are additional packages to make better visualizations, and


more easily.
First there are the plotting capabilities of pandas (http://pandas.pydata.org
/pandas-docs/stable/visualization.html), the data manipulation package. This
greatly simplifies the exploratory data analysis, as you get visualizations straight
out from your dataframes.
import pandas as pd
# bar plot
df.plot(kind='bar')
# kernel density estimate
df.plot(kind='kde');
# scatter plot of A vs B, color given by D, and size by C
df.plot(kind='scatter', x='A', y='B', s=100*df['C'], c='D');

Second, seaborn (http://stanford.edu/%7Emwaskom/software/seaborn/) is a


great way to enhance the aesthetics of your matplotlib visualizations. Indeed,
simply add at the beginning of your notebook:
import seaborn as sns
sns.set()

and all your matplotlib plots will be much more pretty! Seaborn also comes with
better color palettes and utility functions for removing chartjunk. Seaborn has
also a lot of very useful functions for exploratory data analysis, such as the
clustermap (http://stanford.edu/%7Emwaskom/software/seaborn/examples
/structured_heatmap.html), the pairplot (http://stanford.edu/%7Emwaskom
/software/seaborn/examples/scatterplot_matrix.html), or the corrplot and
lmplot as in the example below.

9 de 11

1/3/16 17:13

More Python packages for Data Science - Dataiku

https://www.dataiku.com/learn/guide/academy/python-and-r/...

import seaborn as sns


# corrplot of iris data
df = sns.load_dataset("iris")
sns.corrplot(df, hue="species", size=2.5)
# faceted logistic regression of titanic data
df = sns.load_dataset("titanic")
pal = dict(male="#6495ED", female="#F08080")
g = sns.lmplot("age", "survived", col="sex", hue="sex", data=df,
palette=pal, y_jitter=.02, logistic=True)
g.set(xlim=(0, 80), ylim=(-.05, 1.05))

Be sure to check out the gallery (http://stanford.edu/%7Emwaskom/software


/seaborn/examples/index.html) for many more examples.
And finally, there is the ggplot (https://github.com/yhat/ggplot) package, which
is based on the R ggplot2 (http://ggplot2.org/) package. Based on the grammar
of graphics, it allows you to build visualizations from a dataframe with a very
clear syntax. This is how for instance you can do a scatter plot of A vs B, and add
a trend line.
from ggplot import *
p = ggplot(aes(x='A', y='B'), data=df)
p + geom_point() + stat_smooth(color='blue')

10 de 11

1/3/16 17:13

More Python packages for Data Science - Dataiku

0 Comments

https://www.dataiku.com/learn/guide/academy/python-and-r/...

1
!

Dataiku

Recommend 1

Share

Login

Sort by Best

Start the discussion

WHAT'S THIS?

ALSO ON DATAIKU

Visualizing your LinkedIn graph


using Gephi - Part 1 Visualizzare
1
a year ago
il comment
graficoLinkedIn

What is the dierence between


Business Intelligence and Data
2 comments 9 months ago
Science?

Yash Sherry I am having

Bura akr The main dierence

troubles with the code: it says


Invalid response 403. Any
suggestions? I did run an http

is the external, dynamic, and


unstructrued data types which are
very hard to integrate them in

Setting up a cool Data Science


platform for cheap

A take on the Tour de France with


Dataiku Science Studio.

1 comment 2 years ago

1 comment 8 months ago

11 de 11

Raj Rohit Hii, A very useful post

limegimlet The 2015 real time

!! But, can you host the pics on a


dierent platform; the one they
have been hosted seems out of

data (beta) is now live:


http://letour-livetracking.dim...

1/3/16 17:13

You might also like