# Python data analysis platform for beginners PDF training

Télécharger Python data analysis platform for beginners PDF training

★★★★★★★★★★3.5 étoiles sur 5 basé sur 1 votes.
Votez ce document:

# Python As A Data Analysis Platform

•  People who consider themselves programmers

•  People who write code on a daily basis

•  People who consider Python their primary language

•  People who write data driven applications

## My goals for this talk

•   Increase development of data driven applications using Python

•   Increase the number of Python based stories on HN front page

•   Introduce users to Python libraries for analyzing / visualizing data Python

# Lets Pick A Problem

Analyzing Weather Data

## Source of data

 •   Temperature •   Dew point •   Sea level pressure •   Station pressure •  Max   windspeed •  Max wind gust   •  Max temp   •  Min temp   •  Precipitation   •  Snow depth

•   Visibility

•   Windspeed

## Storing Your Data  Transient Storage

### Holding in Numpy arrays

•   Numpy -> N-dimensional homogeneous array implemented in C: fast, & memory efficient

>>>  a = np.random.randn(1000, 100) ; b = a[::2,:]

Numpy arrays are full featured: 60 methods out of the box (max, mean, conjugate, ) + SCIPY packages add MANY more + Scikits projects (Statsmodel, TimeSeries, ).

•   Structured arrays offer a labeling of fields

>>> dt = np.dtype([(‘Station name’, “S10”), (“Elevation”, np.float), (“Lat”, )])

>>> arr = genfromtxt(“”, dtype = dt, )

>>> print arr[“Station name”]

### Big data: memmap’ed arrays

Memory mapping allows to manipulate arrays of data requiring more than available

Data related fields

RAM:

>>> from numpy import memmap

>>> image = memmap('',         dtype=uint16,    image:

#### 2D NumPy array

mode='r+',   shape: 5, 5        shape=(5,5),   dtype: uint16         offset=header_size)

>>> mean_value = ()

110111…

>>> scaled_img = image * .5     0110000001

>>> np.multiply(image,.5,scaled_img)          00100101110110001101001001000100

Very efficient thanks to 1. OS caching and 2. the                  11110101010000100010111000101011

implementation of Numpy arrays (typically 2-3x   00011110101011… slower than in memory).

### Limitations of memmap

Numpy’s memmap module relies on python’s mmap which carries OS dependent limitations:

>>> from numpy import memmap >>> a = memmap('’,dtype=uint8,   mode=’write’, shape=(N,))

Responses (python2.7, MacOS with 8GB RAM, 11GB free HD):

 Mac OS (32bit python) Win 7 & MacOS (64bit, 3Gb RAM) Linux Ubuntu 11.04 (64bit, 3Gb RAM) N = 10**9 OK (du = 0.9G) OK (du = 0.9G) OK (du = 4K) N = 3x10**9 Overflow error OK (du = 3G) OK (du = 4K) N = 10**13 No space left on device No space left on device OK (du = 4K)

### Holding data in Pandas I

Pandas (now version 0.7.1) offers thin wrappers around 1,2,3D Numpy arrays.

Author: Wes McKinney, Lambda Foundry,

•   axis labeling, for example using datetime steps, and nice representation in ipython

•   data alignment, data merge (incl. priorities for the various datasets),

•   management of missing data

•   MANY statistical tools (describe, moving average, covariance, correlation, )

•   Easy visualization (line, bar chart, boxplot, ) with Matplotlib

>>> from pandas import *

>>> a = [12.3, 15.3, 14.6, , 17.1, 13.6]

>>> ts = Series(a, index = DateRange(‘1/1/2000’, periods = 6, offset = ), name = “Temperature”) # 1D >>> df = DataFrame(ts) # 2D

>>> df[‘var’] = ts2  # Add another columns

Access components: df.values (np.ndarray), df.index (pandas.Index)

### Holding data in Pandas II

Pretty representation:

 >>> print ts 2000-01-01  12.3 2000-01-02  15.3 2000-01-03  14.6 2000-01-04  NaN 2000-01-05  17.1 Name: Temperature >>> print df       Temperature  var   2000-01-01  12.3    -1.452   2000-01-02  15.3     1.851   2000-01-03  14.6    -0.09037 2000-01-04  NaN     -0.3942  2000-01-05  17.1     1.446

Data alignment, data reduction, missing value management

ts.align(ts2) ; ts.reindex(ts2.index) ; ts.groupby().apply() ts.fillna(0.0) ; ts.dropna() ; ts.to_sparse()

>>>  HDFStore(), ExcelFile()

Persistent Storage

### Some Options

Some universal file format (built into the data-structure): -   txt, csv

-  binary (watch out!)

Some standard labeled file formats:

-  json: json

-  HDF: pytables, h5py, pyhdf

-  netCDF: netCDF4, (also .netcdf, .netcdf)

Some database options

-  SQL: sqlalchemy, sqlite3, mysql-python, psycopg…

-  No SQL: couchdb, mongodb, cassandra, …

### Storing data to HDF5

HDF5 files is the best way to store large datasets during/after processing.

FEATURES

•   HDF5 file format is self-describing: good for complex data objects

•   HDF5 files are portable: cross-platform, cross-language (C, C++, Fortran, Java)

•   HDF5 is optimized: direct access to parts of the file without parsing the entire contents.

See

PYTHON LIBRARIES

•   h5py - "thin wrapper" around the C HDF5 library.

•   PyTables - Provides some higher level abstractions and efficient tools for retrieval, compression and out-of-core functionalities.

FAST!

EFFICIENT!

Source:

### Out of core calcs w/ Pytables

FAST!

EFFICIENT

Source:

Visualizing Data Wonder if there is a way to see those stations on a map.

Multiple Cities

Source code at

## Comparing … Even More Data

### Scatter plot matrix

Filename:

Can I learn something from this data?

### Learning from data

•  Classify data into categories

•  Optimize a function wrt input paremeters

•  Create predictive model from data

Support vector machines

## Brief Interlude Into Classifiers

### Examples

•  Predict if a mail is spam or not

•  Sort incoming mail into folders

•  Predict if a transaction is fraudulent

•  Predict if a patient has a disease

### Feature vectors

 Mail # Word1 Word2 Spam? 1 0 1 Y 2 0 1 Y 3 1 0 N 4 1 1 Y 5 1 0 N 6 1 1 N

### Classifying data

Source: Berwick2003

### Support vectors

Source: Berwick2003

### Slide showing predictor

from import SVR

clf = SVR(epsilon=0.2)

(X, y) pred = clf.predict(test)

Filename:

### Applications for this analysis

•  Impact of sales campaign

•  Effect of hiring star athlete

•  Effect of upgrading computer infrastructure

•  Predict stock prices ?

Source code repository

### Credits for talk

•  Jonathan Rocher – This talk builds upon his talk from PyCon

•  Naveen Michaud Agrawal – Wrote code for mapping weather stations

•  Chris Colbert – Helped debug several issues and and gave Enaml advice

•  Sean Ross – Feedback on this talk

•  Urllib2

•  Requests

•  Paramiko

• Numpy

• Pandas

• Blist

• Bitarray

•  Chaco

•  Matplotlib

•  Networkx

0