Python courses

Python data analysis platform for beginners PDF training


Télécharger Python data analysis platform for beginners PDF training

★★★★★★★★★★3.5 étoiles sur 5 basé sur 1 votes.
Votez ce document:

Télécharger aussi :


Python As A Data Analysis Platform

Question about audience

•  People who consider themselves programmers

•  People who write code on a daily basis

•  People who consider Python their primary language

•  People who write data driven applications 

My goals for this talk

•   Increase development of data driven applications using Python

•   Increase the number of Python based stories on HN front page

•   Introduce users to Python libraries for analyzing / visualizing data Python

Trajectory of successful project

Trajectory of unsuccessful project

Lets Pick A Problem 

Analyzing Weather Data

Source of data

•   Temperature

•   Dew point

•   Sea level pressure

•   Station pressure

•  Max   windspeed

•  Max wind gust  

•  Max temp  

•  Min temp  

•  Precipitation  

•  Snow depth  

•   Visibility

•   Windspeed

Storing Your Data  Transient Storage

Holding in Numpy arrays

•   Numpy -> N-dimensional homogeneous array implemented in C: fast, & memory efficient

>>>  a = np.random.randn(1000, 100) ; b = a[::2,:]

Numpy arrays are full featured: 60 methods out of the box (max, mean, conjugate, ) + SCIPY packages add MANY more + Scikits projects (Statsmodel, TimeSeries, ).

•   Structured arrays offer a labeling of fields

 

>>> dt = np.dtype([(‘Station name’, “S10”), (“Elevation”, np.float), (“Lat”, )])

>>> arr = genfromtxt(“”, dtype = dt, )

>>> print arr[“Station name”] 

Big data: memmap’ed arrays

Memory mapping allows to manipulate arrays of data requiring more than available

Data related fields

RAM: 

>>> from numpy import memmap

>>> image = memmap('',         dtype=uint16,    image:

  2D NumPy array

       mode='r+',   shape: 5, 5        shape=(5,5),   dtype: uint16         offset=header_size)

>>> mean_value = ()  

110111…

>>> scaled_img = image * .5     0110000001

>>> np.multiply(image,.5,scaled_img)          00100101110110001101001001000100 

Very efficient thanks to 1. OS caching and 2. the                  11110101010000100010111000101011 

implementation of Numpy arrays (typically 2-3x   00011110101011… slower than in memory).



Limitations of memmap

Numpy’s memmap module relies on python’s mmap which carries OS dependent limitations:

>>> from numpy import memmap >>> a = memmap('’,dtype=uint8,   mode=’write’, shape=(N,))

Responses (python2.7, MacOS with 8GB RAM, 11GB free HD):

 

Mac OS (32bit python)

Win 7 & MacOS (64bit, 3Gb RAM)

Linux Ubuntu 11.04 (64bit, 3Gb RAM)

N = 10**9

OK (du = 0.9G)

OK (du = 0.9G)

OK (du = 4K)

N = 3x10**9

Overflow error

OK (du = 3G)

OK (du = 4K)

N = 10**13

No space left on device

No space left on device

OK (du = 4K)

Holding data in Pandas I

Pandas (now version 0.7.1) offers thin wrappers around 1,2,3D Numpy arrays.

Author: Wes McKinney, Lambda Foundry,

•   axis labeling, for example using datetime steps, and nice representation in ipython

•   data alignment, data merge (incl. priorities for the various datasets), 

•   management of missing data

•   MANY statistical tools (describe, moving average, covariance, correlation, )

•   Easy visualization (line, bar chart, boxplot, ) with Matplotlib

>>> from pandas import *

>>> a = [12.3, 15.3, 14.6, , 17.1, 13.6]

>>> ts = Series(a, index = DateRange(‘1/1/2000’, periods = 6, offset = ), name = “Temperature”) # 1D >>> df = DataFrame(ts) # 2D

>>> df[‘var’] = ts2  # Add another columns

Access components: df.values (np.ndarray), df.index (pandas.Index)

Holding data in Pandas II

Pretty representation:

>>> print ts

2000-01-01  12.3

2000-01-02  15.3

2000-01-03  14.6 2000-01-04  NaN

2000-01-05  17.1

Name: Temperature

>>> print df

      Temperature  var  

2000-01-01  12.3    -1.452  

2000-01-02  15.3     1.851  

2000-01-03  14.6    -0.09037



2000-01-04  NaN     -0.3942 

2000-01-05  17.1     1.446

Data alignment, data reduction, missing value management

ts.align(ts2) ; ts.reindex(ts2.index) ; ts.groupby().apply() ts.fillna(0.0) ; ts.dropna() ; ts.to_sparse()

Loading data from/to files:

  >>> read_csv, read_table, ts.tofile, ts.to_csv

  >>>  HDFStore(), ExcelFile()

Persistent Storage

Some Options

Some universal file format (built into the data-structure): -   txt, csv

-  binary (watch out!)

Some standard labeled file formats:

-  json: json

-  HDF: pytables, h5py, pyhdf

-  netCDF: netCDF4, (also .netcdf, .netcdf)

Some database options

-  SQL: sqlalchemy, sqlite3, mysql-python, psycopg…

-  No SQL: couchdb, mongodb, cassandra, …

Storing data to HDF5

HDF5 files is the best way to store large datasets during/after processing. 

FEATURES

•   HDF5 file format is self-describing: good for complex data objects

•   HDF5 files are portable: cross-platform, cross-language (C, C++, Fortran, Java)

•   HDF5 is optimized: direct access to parts of the file without parsing the entire contents.

See

PYTHON LIBRARIES

•   h5py - "thin wrapper" around the C HDF5 library.

•   PyTables - Provides some higher level abstractions and efficient tools for retrieval, compression and out-of-core functionalities.

Benchmarking Pytables

FAST!

EFFICIENT!

Source:

Out of core calcs w/ Pytables

FAST!

EFFICIENT

Source:

Visualizing Data Wonder if there is a way to see those stations on a map.

   

Compare Weather From

Multiple Cities

Plot weather data

 

Source code at  

Comparing … Even More Data

Scatter plot matrix

 

Filename:  

Can I learn something from this data?

Learning from data

•  Classify data into categories

•  Optimize a function wrt input paremeters

•  Create predictive model from data

Support vector machines

Brief Interlude Into Classifiers

Examples

•  Predict if a mail is spam or not

•  Sort incoming mail into folders

•  Predict if a transaction is fraudulent

•  Predict if a patient has a disease

Feature vectors

Mail #

Word1

Word2

Spam?

1

0

1

Y

2



0

1

Y

3

1

0

N

4

1

1

Y

5

1

0

N

6

1

1

N

Classifying data

 

Source: Berwick2003

Support vectors

 

Source: Berwick2003

Support Vector Regression

 

Scikits learn

 

Slide showing predictor

from import SVR

clf = SVR(epsilon=0.2)

(X, y) pred = clf.predict(test)

Learn from weather data

 

Filename:  

Applications for this analysis

•  Impact of sales campaign

•  Effect of hiring star athlete

•  Effect of upgrading computer infrastructure

•  Predict stock prices ?

Source code repository

Credits for talk

•  Jonathan Rocher – This talk builds upon his talk from PyCon

•  Naveen Michaud Agrawal – Wrote code for mapping weather stations

•  Chris Colbert – Helped debug several issues and and gave Enaml advice

•  Sean Ross – Feedback on this talk

Network IO

•  Urllib2

•  Requests

•  Paramiko

Python data structures

• Numpy   

• Pandas

• Blist   

• Bitarray   

Python Visualization / Plotting

•  Chaco

•  Matplotlib

•  Networkx


0