Professional Documents
Culture Documents
https://www.dataiku.com/learn/guide/academy/python-and-r/...
(/)
HOWTO
Table of contents
Analytics
Graph analytics
Science
Time series
Natural Language
Processing
Bayesian Inference
Web scraping
Data Visualization
Related resources
Howto
started-
Python (gettingwith-python.html)
numpy
scipy
matplotlib
IPython
pandas
scikit-learn
If you have already done data science in Python, you probably already know
them. If you are new to Python, and need a quick introduction to these
packages, you can check out our Getting started with Python (getting-startedwith-python.html) post. We also list there tutorials and useful resources to help
you get started.
Contact us (/dss/contact/)
Follow us
1 de 11
1/3/16 17:13
https://www.dataiku.com/learn/guide/academy/python-and-r/...
Graph Analytics
Graph analytics is particularly useful for social network analysis: uncovering
communities, finding central agents in the network.
Networkx (https://networkx.github.io/) is the most popular Python package for
graph analytics. It contains many functions for generating, analyzing and
drawing graphs.
However, networkx may not scale well for large-scale graphs. For such graphs,
you should also consider igraph (http://igraph.org/) (available in R and python),
graph-tool (http://graph-tool.skewed.de/) or GraphLab CreateTM
(https://dato.com/products/create/quick-start-guide.html).
Sample code
This starter code illustrates how you can include networkx in your data
processing flow. As a starting point, you will generally have a dataframe
representing links in a network. A link could denote, for example, that the two
users are friends on Facebook.
2 de 11
1/3/16 17:13
https://www.dataiku.com/learn/guide/academy/python-and-r/...
You first need to convert the links dataframe into a graph. You can then, for
example, find the connected components of the graph, sorted by size. You can
also restrict your analysis to a subgraph, for instance the largest connected
component.
To find the most influential people in the network, you can explore several
centrality measures, such as degree, betweenness, and pagerank. Finally, you
can easily output the centrality measures in a dataframe, for further analyses.
# build graph from links dataframe
import networkx as nx
g=nx.Graph()
g.add_edges_from(zip(links.user1,links.user2))
print nx.info(g)
# connected components sorted by size
cc =
3 de 11
1/3/16 17:13
https://www.dataiku.com/learn/guide/academy/python-and-r/...
patterns in temporal data, and then build models upon them for forecasting.
Statsmodels (http://statsmodels.sourceforge.net/) is the main python package
for time-series analysis and forecasting. It nicely integrates with pandas
time-series. This packages also contains many statistical tests, such as ANOVA or
t-test, used in traditional approaches to statistical data analysis.
Sample code
In the code below, taken from the examples section of statsmodels
(http://statsmodels.sourceforge.net/devel/examples/index.html), we fit an
auto-regressive model to the sunspots acitivity data, and use it for forecasting.
We also plot the autocorrelation function which reveals that values are
correlated with past values.
import statsmodels.api as sm
# sunspots activity data
print sm.datasets.sunspots.NOTE
data = sm.datasets.sunspots.load().endog
dates = sm.tsa.datetools.dates_from_range('1700', '2008')
ts = pd.TimeSeries(data, index=dates)
# plot the acf
sm.graphics.tsa.plot_acf(ts.values, lags=40)
# fit an AR model and forecast
ar_fitted = sm.tsa.AR(ts).fit(maxlag=9, method='mle', disp=-1)
ts_forecast = ar_fitted.predict(start='2008', end='2050')
4 de 11
1/3/16 17:13
https://www.dataiku.com/learn/guide/academy/python-and-r/...
Bayesian Inference
Finally, PyMC (http://pymc-devs.github.io/pymc/README.html) is a Python
package devoted to Bayesian inference. This package allows you to easily
construct, fit, and analyze your probabilistic models.
If you are not familiar with Bayesian inference, we recommend you the excellent
Probabilistic Programming and Bayesian Methods for Hackers
(http://camdavidsonpilon.github.io/Probabilistic-Programming-and-BayesianMethods-for-Hackers/) by Cameron Davidson Pilon. The book is entirely written
as IPython notebooks, and contains lots of concrete examples using PyMC code.
Sample code
Here's one simple example, taken from the notebook. The purpose is to infer if
the user has changed his text-message behavior, based on a time-series of
text-message count data.
The first step in Bayesian inference is to propose a probabilistic model for the
data. For instance here, it is assumed that the user changed its behavior at a
time tau: before the event, he was sending messages at a rate lambda_1, and
after the event, at a rate lambda_2. Then the Bayesian approach allows to infer
the whole probability distribution of tau, lambda_1 or lambda_2 given the
observed data, not just a single estimate.
5 de 11
1/3/16 17:13
https://www.dataiku.com/learn/guide/academy/python-and-r/...
The probability distributions are usually obtained by Markov Chain Monte Carlo
sampling (http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo), as done in
the example code.
import pymc as pm
# probabilistic model
alpha = 1.0 / count_data.mean()
out[tau:] = lambda_2
return out
observation = pm.Poisson("obs", lambda_, value=count_data, observed=True)
model = pm.Model([observation, lambda_1, lambda_2, tau])
# MCMC sampling
mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000, 1)
lambda_1_samples = mcmc.trace('lambda_1')[:]
lambda_2_samples = mcmc.trace('lambda_2')[:]
tau_samples = mcmc.trace('tau')[:]
6 de 11
1/3/16 17:13
https://www.dataiku.com/learn/guide/academy/python-and-r/...
Many websites contain a lot of interesting and useful data. Unfortunately, the
data is rarely available in a nice tabular format to download! The data is only
displayed, disseminated across the web page, or even dispatched on dierent
pages.
Suppose you wish to retrieve data on the most popular movies of 2014
(http://www.imdb.com/year/2014), displayed on the IMDb site. Unfortunately, as
you can see, the movie title, its rating and so on are disseminated across the
web page. In Chrome, if you left-click on any element, such as the movie title,
and select "Inspect element", you will see to which part of the HTML code it
corresponds to.
Sample code
In the code below for instance, we first parse the HTML content to get the list of
all movies. Then for each movie, we retrieve its ranking number, title, outline,
rating and genre.
One of our data scientists used this kind of web scrapping to build his
personalized movie recommender system]!
7 de 11
1/3/16 17:13
https://www.dataiku.com/learn/guide/academy/python-and-r/...
import urllib2
from bs4 import BeautifulSoup
# get the html content
url = "http://www.imdb.com/year/2014"
page = urllib2.urlopen(url).read()
# parsing HTML
soup = BeautifulSoup(page)
# find all movies
movies = soup.find("table", {"class":"results"}).findAll('tr')
# get information for each movie
records = []
for movie in movies:
record = {}
record['number'] = movie.find('td', {"class":"number"}).text
record['title'] = movie.find('a')['title']
record['rating'] = movie.find('div', {"class":"rating-list"})['title']
record['outline'] = movie.find('span', {"class":"outline"}).text
record['credit'] = movie.find('span', {"class":"credit"}).text
record['genres'] = movie.find('span', {"class":"genre"}).text.split('|')
records += [record]
# output in a dataframe
df = pd.DataFrame(records)
8 de 11
1/3/16 17:13
https://www.dataiku.com/learn/guide/academy/python-and-r/...
and all your matplotlib plots will be much more pretty! Seaborn also comes with
better color palettes and utility functions for removing chartjunk. Seaborn has
also a lot of very useful functions for exploratory data analysis, such as the
clustermap (http://stanford.edu/%7Emwaskom/software/seaborn/examples
/structured_heatmap.html), the pairplot (http://stanford.edu/%7Emwaskom
/software/seaborn/examples/scatterplot_matrix.html), or the corrplot and
lmplot as in the example below.
9 de 11
1/3/16 17:13
https://www.dataiku.com/learn/guide/academy/python-and-r/...
10 de 11
1/3/16 17:13
0 Comments
https://www.dataiku.com/learn/guide/academy/python-and-r/...
1
!
Dataiku
Recommend 1
Share
Login
Sort by Best
WHAT'S THIS?
ALSO ON DATAIKU
11 de 11
1/3/16 17:13