SlideShare a Scribd company logo
1 of 23
Python for Data Analysis
Andrew Henshaw
Georgia Tech Research Institute
• PowerPoint
• IPython (ipython –pylab=inline)
• Custom bridge (ipython2powerpoint)
• Python for Data
Analysis
• Wes McKinney
• Lead developer of
pandas
• Quantitative Financial
Analyst
• Python library to provide data analysis features, similar to:
• R
• MATLAB
• SAS
• Built on NumPy, SciPy, and to some extent, matplotlib
• Key components provided by pandas:
• Series
• DataFrame
• One-dimensional
array-like object
containing data
and labels (or
index)
• Lots of ways to
build a Series
>>> import pandas as pd
>>> s = pd.Series(list('abcdef'))
>>> s
0 a
1 b
2 c
3 d
4 e
5 f
>>> s = pd.Series([2, 4, 6, 8])
>>> s
0 2
1 4
2 6
3 8
• A Series index can be
specified
• Single values can be
selected by index
• Multiple values can be
selected with multiple
indexes
>>> s = pd.Series([2, 4, 6, 8],
index = ['f', 'a', 'c', 'e'])
>>>
>>> s
f 2
a 4
c 6
e 8
>>> s['a']
4
>>> s[['a', 'c']]
a 4
c 6
• Think of a Series as a
fixed-length, ordered
dict
• However, unlike a dict,
index items don't have
to be unique
>>> s2 = pd.Series(range(4),
index = list('abab'))
>>> s2
a 0
b 1
a 2
b 3
>>> s['a]
>>>
>>> s['a']
4
>>> s2['a']
a 0
a 2
>>> s2['a'][0]
0
• Filtering
• NumPy-type operations
on data
>>> s
f 2
a 4
c 6
e 8
>>> s[s > 4]
c 6
e 8
>>> s>4
f False
a False
c True
e True
>>> s*2
f 4
a 8
c 12
e 16
• pandas can
accommodate
incomplete data
>>> sdata = {'b':100, 'c':150, 'd':200}
>>> s = pd.Series(sdata)
>>> s
b 100
c 150
d 200
>>> s = pd.Series(sdata, list('abcd'))
>>> s
a NaN
b 100
c 150
d 200
>>> s*2
a NaN
b 200
c 300
d 400
• Unlike in a NumPy
ndarray, data is
automatically aligned
>>> s2 = pd.Series([1, 2, 3],
index = ['c', 'b', 'a'])
>>> s2
c 1
b 2
a 3
>>> s
a NaN
b 100
c 150
d 200
>>> s*s2
a NaN
b 200
c 150
d NaN
• Spreadsheet-like data structure containing an
ordered collection of columns
• Has both a row and column index
• Consider as dict of Series (with shared index)
>>> data = {'state': ['FL', 'FL', 'GA', 'GA', 'GA'],
'year': [2010, 2011, 2008, 2010, 2011],
'pop': [18.8, 19.1, 9.7, 9.7, 9.8]}
>>> frame = pd.DataFrame(data)
>>> frame
pop state year
0 18.8 FL 2010
1 19.1 FL 2011
2 9.7 GA 2008
3 9.7 GA 2010
4 9.8 GA 2011
>>> pop_data = {'FL': {2010:18.8, 2011:19.1},
'GA': {2008: 9.7, 2010: 9.7, 2011:9.8}}
>>> pop = pd.DataFrame(pop_data)
>>> pop
FL GA
2008 NaN 9.7
2010 18.8 9.7
2011 19.1 9.8
• Columns can be retrieved
as a Series
• dict notation
• attribute notation
• Rows can retrieved by
position or by name (using
ix attribute)
>>> frame['state']
0 FL
1 FL
2 GA
3 GA
4 GA
Name: state
>>> frame.describe
<bound method DataFrame.describe
of pop state year
0 18.8 FL 2010
1 19.1 FL 2011
2 9.7 GA 2008
3 9.7 GA 2010
4 9.8 GA 2011>
• New columns can be
added (by computation
or direct assignment)
>>> frame['other'] = NaN
>>> frame
pop state year other
0 18.8 FL 2010 NaN
1 19.1 FL 2011 NaN
2 9.7 GA 2008 NaN
3 9.7 GA 2010 NaN
4 9.8 GA 2011 NaN
>>> frame['calc'] = frame['pop'] * 2
>>> frame
pop state year other calc
0 18.8 FL 2010 NaN 37.6
1 19.1 FL 2011 NaN 38.2
2 9.7 GA 2008 NaN 19.4
3 9.7 GA 2010 NaN 19.4
4 9.8 GA 2011 NaN 19.6
>>> obj = pd.Series(['blue', 'purple', 'red'],
index=[0,2,4])
>>> obj
0 blue
2 purple
4 red
>>> obj.reindex(range(4))
0 blue
1 NaN
2 purple
3 NaN
>>> obj.reindex(range(5), fill_value='black')
0 blue
1 black
2 purple
3 black
4 red
>>> obj.reindex(range(5), method='ffill')
0 blue
1 blue
2 purple
3 purple
4 red
• Creation of new
object with the
data conformed
to a new index
>>> pop
FL GA
2008 NaN 9.7
2010 18.8 9.7
2011 19.1 9.8
>>> pop.sum()
FL 37.9
GA 29.2
>>> pop.mean()
FL 18.950000
GA 9.733333
>>> pop.describe()
FL GA
count 2.000000 3.000000
mean 18.950000 9.733333
std 0.212132 0.057735
min 18.800000 9.700000
25% 18.875000 9.700000
50% 18.950000 9.700000
75% 19.025000 9.750000
max 19.100000 9.800000
>>> pop
FL GA
2008 NaN 9.7
2010 18.8 9.7
2011 19.1 9.8
>>> pop < 9.8
FL GA
2008 False True
2010 False True
2011 False False
>>> pop[pop < 9.8] = 0
>>> pop
FL GA
2008 NaN 0.0
2010 18.8 0.0
2011 19.1 9.8
• pandas supports several ways to handle data loading
• Text file data
• read_csv
• read_table
• Structured data (JSON, XML, HTML)
• works well with existing libraries
• Excel (depends upon xlrd and openpyxl packages)
• Database
• pandas.io.sql module (read_frame)
>>> tips = pd.read_csv('/users/ah6/Desktop/pandas
talk/data/tips.csv')
>>> tips.ix[:2]
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
>>> party_counts = pd.crosstab(tips.day, tips.size)
>>> party_counts
size 1 2 3 4 5 6
day
Fri 1 16 1 1 0 0
Sat 2 53 18 13 1 0
Sun 0 39 15 18 3 1
Thur 1 48 4 5 1 3
>>> sum_by_day = party_counts.sum(1).astype(float)
>>> party_pcts = party_counts.div(sum_by_day, axis=0)
>>> party_pcts
size 1 2 3 4 5 6
day
Fri 0.052632 0.842105 0.052632 0.052632 0.000000 0.000000
Sat 0.022989 0.609195 0.206897 0.149425 0.011494 0.000000
Sun 0.000000 0.513158 0.197368 0.236842 0.039474 0.013158
Thur 0.016129 0.774194 0.064516 0.080645 0.016129 0.048387
>>> party_pcts.plot(kind='bar', stacked=True)
<matplotlib.axes.AxesSubplot at 0x6bf2ed0>
>>> tips['tip_pct'] = tips['tip'] / tips['total_bill']
>>> tips['tip_pct'].hist(bins=50)
<matplotlib.axes.AxesSubplot at 0x6c10d30>
>>> tips['tip_pct'].describe()
count 244.000000
mean 0.160803
std 0.061072
min 0.035638
25% 0.129127
50% 0.154770
75% 0.191475
max 0.710345
• Data Aggregation
• GroupBy
• Pivot Tables
• Time Series
• Periods/Frequencies
• Operations with Time Series with Different Frequencies
• Downsampling/Upsampling
• Plotting with TimeSeries (auto-adjust scale)
• Advanced Analysis
• Decile and Quartile Analysis
• Signal Frontier Analysis
• Future Contract Rolling
• Rolling Correlation and Linear Regression

More Related Content

What's hot

Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandasAkshitaKanther
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)PyData
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python PandasNeeru Mittal
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization Sourabh Sahu
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Dr. Volkan OBAN
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPyDevashish Kumar
 
Introduction to NumPy
Introduction to NumPyIntroduction to NumPy
Introduction to NumPyHuy Nguyen
 
Python: Modules and Packages
Python: Modules and PackagesPython: Modules and Packages
Python: Modules and PackagesDamian T. Gordon
 
DataFrame in Python Pandas
DataFrame in Python PandasDataFrame in Python Pandas
DataFrame in Python PandasSangita Panchal
 
Python Pandas
Python PandasPython Pandas
Python PandasSunil OS
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysisPramod Toraskar
 
Introduction to numpy Session 1
Introduction to numpy Session 1Introduction to numpy Session 1
Introduction to numpy Session 1Jatin Miglani
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
 
Python pandas tutorial
Python pandas tutorialPython pandas tutorial
Python pandas tutorialHarikaReddy115
 
Arrays in python
Arrays in pythonArrays in python
Arrays in pythonmoazamali28
 

What's hot (20)

Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python Pandas
 
NumPy.pptx
NumPy.pptxNumPy.pptx
NumPy.pptx
 
Pandas
PandasPandas
Pandas
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPy
 
Introduction to NumPy
Introduction to NumPyIntroduction to NumPy
Introduction to NumPy
 
Python: Modules and Packages
Python: Modules and PackagesPython: Modules and Packages
Python: Modules and Packages
 
DataFrame in Python Pandas
DataFrame in Python PandasDataFrame in Python Pandas
DataFrame in Python Pandas
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Introduction to numpy Session 1
Introduction to numpy Session 1Introduction to numpy Session 1
Introduction to numpy Session 1
 
NUMPY
NUMPY NUMPY
NUMPY
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
Python pandas tutorial
Python pandas tutorialPython pandas tutorial
Python pandas tutorial
 
Numpy tutorial
Numpy tutorialNumpy tutorial
Numpy tutorial
 
Arrays in python
Arrays in pythonArrays in python
Arrays in python
 

Similar to pandas - Python Data Analysis

R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
Sas Talk To R Users Group
Sas Talk To R Users GroupSas Talk To R Users Group
Sas Talk To R Users Groupgeorgette1200
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineChester Chen
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
What's New for Developers in SQL Server 2008?
What's New for Developers in SQL Server 2008?What's New for Developers in SQL Server 2008?
What's New for Developers in SQL Server 2008?ukdpe
 
SQL Server 2008 Overview
SQL Server 2008 OverviewSQL Server 2008 Overview
SQL Server 2008 OverviewEric Nelson
 
Prog1 chap1 and chap 2
Prog1 chap1 and chap 2Prog1 chap1 and chap 2
Prog1 chap1 and chap 2rowensCap
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStoreMariaDB plc
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTjixuan1989
 
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...Kangaroot
 
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStoreBig Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStoreMatt Stubbs
 

Similar to pandas - Python Data Analysis (20)

R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Sas Talk To R Users Group
Sas Talk To R Users GroupSas Talk To R Users Group
Sas Talk To R Users Group
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
SAS - overview of SAS
SAS - overview of SASSAS - overview of SAS
SAS - overview of SAS
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
What's New for Developers in SQL Server 2008?
What's New for Developers in SQL Server 2008?What's New for Developers in SQL Server 2008?
What's New for Developers in SQL Server 2008?
 
SQL Server 2008 Overview
SQL Server 2008 OverviewSQL Server 2008 Overview
SQL Server 2008 Overview
 
Prog1 chap1 and chap 2
Prog1 chap1 and chap 2Prog1 chap1 and chap 2
Prog1 chap1 and chap 2
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStore
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoT
 
HivePart1.pptx
HivePart1.pptxHivePart1.pptx
HivePart1.pptx
 
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
 
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStoreBig Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

pandas - Python Data Analysis

  • 1. Python for Data Analysis Andrew Henshaw Georgia Tech Research Institute
  • 2. • PowerPoint • IPython (ipython –pylab=inline) • Custom bridge (ipython2powerpoint)
  • 3. • Python for Data Analysis • Wes McKinney • Lead developer of pandas • Quantitative Financial Analyst
  • 4. • Python library to provide data analysis features, similar to: • R • MATLAB • SAS • Built on NumPy, SciPy, and to some extent, matplotlib • Key components provided by pandas: • Series • DataFrame
  • 5. • One-dimensional array-like object containing data and labels (or index) • Lots of ways to build a Series >>> import pandas as pd >>> s = pd.Series(list('abcdef')) >>> s 0 a 1 b 2 c 3 d 4 e 5 f >>> s = pd.Series([2, 4, 6, 8]) >>> s 0 2 1 4 2 6 3 8
  • 6. • A Series index can be specified • Single values can be selected by index • Multiple values can be selected with multiple indexes >>> s = pd.Series([2, 4, 6, 8], index = ['f', 'a', 'c', 'e']) >>> >>> s f 2 a 4 c 6 e 8 >>> s['a'] 4 >>> s[['a', 'c']] a 4 c 6
  • 7. • Think of a Series as a fixed-length, ordered dict • However, unlike a dict, index items don't have to be unique >>> s2 = pd.Series(range(4), index = list('abab')) >>> s2 a 0 b 1 a 2 b 3 >>> s['a] >>> >>> s['a'] 4 >>> s2['a'] a 0 a 2 >>> s2['a'][0] 0
  • 8. • Filtering • NumPy-type operations on data >>> s f 2 a 4 c 6 e 8 >>> s[s > 4] c 6 e 8 >>> s>4 f False a False c True e True >>> s*2 f 4 a 8 c 12 e 16
  • 9. • pandas can accommodate incomplete data >>> sdata = {'b':100, 'c':150, 'd':200} >>> s = pd.Series(sdata) >>> s b 100 c 150 d 200 >>> s = pd.Series(sdata, list('abcd')) >>> s a NaN b 100 c 150 d 200 >>> s*2 a NaN b 200 c 300 d 400
  • 10. • Unlike in a NumPy ndarray, data is automatically aligned >>> s2 = pd.Series([1, 2, 3], index = ['c', 'b', 'a']) >>> s2 c 1 b 2 a 3 >>> s a NaN b 100 c 150 d 200 >>> s*s2 a NaN b 200 c 150 d NaN
  • 11. • Spreadsheet-like data structure containing an ordered collection of columns • Has both a row and column index • Consider as dict of Series (with shared index)
  • 12. >>> data = {'state': ['FL', 'FL', 'GA', 'GA', 'GA'], 'year': [2010, 2011, 2008, 2010, 2011], 'pop': [18.8, 19.1, 9.7, 9.7, 9.8]} >>> frame = pd.DataFrame(data) >>> frame pop state year 0 18.8 FL 2010 1 19.1 FL 2011 2 9.7 GA 2008 3 9.7 GA 2010 4 9.8 GA 2011
  • 13. >>> pop_data = {'FL': {2010:18.8, 2011:19.1}, 'GA': {2008: 9.7, 2010: 9.7, 2011:9.8}} >>> pop = pd.DataFrame(pop_data) >>> pop FL GA 2008 NaN 9.7 2010 18.8 9.7 2011 19.1 9.8
  • 14. • Columns can be retrieved as a Series • dict notation • attribute notation • Rows can retrieved by position or by name (using ix attribute) >>> frame['state'] 0 FL 1 FL 2 GA 3 GA 4 GA Name: state >>> frame.describe <bound method DataFrame.describe of pop state year 0 18.8 FL 2010 1 19.1 FL 2011 2 9.7 GA 2008 3 9.7 GA 2010 4 9.8 GA 2011>
  • 15. • New columns can be added (by computation or direct assignment) >>> frame['other'] = NaN >>> frame pop state year other 0 18.8 FL 2010 NaN 1 19.1 FL 2011 NaN 2 9.7 GA 2008 NaN 3 9.7 GA 2010 NaN 4 9.8 GA 2011 NaN >>> frame['calc'] = frame['pop'] * 2 >>> frame pop state year other calc 0 18.8 FL 2010 NaN 37.6 1 19.1 FL 2011 NaN 38.2 2 9.7 GA 2008 NaN 19.4 3 9.7 GA 2010 NaN 19.4 4 9.8 GA 2011 NaN 19.6
  • 16. >>> obj = pd.Series(['blue', 'purple', 'red'], index=[0,2,4]) >>> obj 0 blue 2 purple 4 red >>> obj.reindex(range(4)) 0 blue 1 NaN 2 purple 3 NaN >>> obj.reindex(range(5), fill_value='black') 0 blue 1 black 2 purple 3 black 4 red >>> obj.reindex(range(5), method='ffill') 0 blue 1 blue 2 purple 3 purple 4 red • Creation of new object with the data conformed to a new index
  • 17. >>> pop FL GA 2008 NaN 9.7 2010 18.8 9.7 2011 19.1 9.8 >>> pop.sum() FL 37.9 GA 29.2 >>> pop.mean() FL 18.950000 GA 9.733333 >>> pop.describe() FL GA count 2.000000 3.000000 mean 18.950000 9.733333 std 0.212132 0.057735 min 18.800000 9.700000 25% 18.875000 9.700000 50% 18.950000 9.700000 75% 19.025000 9.750000 max 19.100000 9.800000
  • 18. >>> pop FL GA 2008 NaN 9.7 2010 18.8 9.7 2011 19.1 9.8 >>> pop < 9.8 FL GA 2008 False True 2010 False True 2011 False False >>> pop[pop < 9.8] = 0 >>> pop FL GA 2008 NaN 0.0 2010 18.8 0.0 2011 19.1 9.8
  • 19. • pandas supports several ways to handle data loading • Text file data • read_csv • read_table • Structured data (JSON, XML, HTML) • works well with existing libraries • Excel (depends upon xlrd and openpyxl packages) • Database • pandas.io.sql module (read_frame)
  • 20. >>> tips = pd.read_csv('/users/ah6/Desktop/pandas talk/data/tips.csv') >>> tips.ix[:2] total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 >>> party_counts = pd.crosstab(tips.day, tips.size) >>> party_counts size 1 2 3 4 5 6 day Fri 1 16 1 1 0 0 Sat 2 53 18 13 1 0 Sun 0 39 15 18 3 1 Thur 1 48 4 5 1 3 >>> sum_by_day = party_counts.sum(1).astype(float)
  • 21. >>> party_pcts = party_counts.div(sum_by_day, axis=0) >>> party_pcts size 1 2 3 4 5 6 day Fri 0.052632 0.842105 0.052632 0.052632 0.000000 0.000000 Sat 0.022989 0.609195 0.206897 0.149425 0.011494 0.000000 Sun 0.000000 0.513158 0.197368 0.236842 0.039474 0.013158 Thur 0.016129 0.774194 0.064516 0.080645 0.016129 0.048387 >>> party_pcts.plot(kind='bar', stacked=True) <matplotlib.axes.AxesSubplot at 0x6bf2ed0>
  • 22. >>> tips['tip_pct'] = tips['tip'] / tips['total_bill'] >>> tips['tip_pct'].hist(bins=50) <matplotlib.axes.AxesSubplot at 0x6c10d30> >>> tips['tip_pct'].describe() count 244.000000 mean 0.160803 std 0.061072 min 0.035638 25% 0.129127 50% 0.154770 75% 0.191475 max 0.710345
  • 23. • Data Aggregation • GroupBy • Pivot Tables • Time Series • Periods/Frequencies • Operations with Time Series with Different Frequencies • Downsampling/Upsampling • Plotting with TimeSeries (auto-adjust scale) • Advanced Analysis • Decile and Quartile Analysis • Signal Frontier Analysis • Future Contract Rolling • Rolling Correlation and Linear Regression