SlideShare a Scribd company logo
1 of 22
Download to read offline
Financial data analysis in Python with pandas

                  Wes McKinney
                   @wesmckinn


                    10/17/2011




@wesmckinn ()     Data analysis with pandas   10/17/2011   1 / 22
My background




   3 years as a quant hacker at AQR, now consultant / entrepreneur
   Math and statistics background with the zest of computer science
   Active in scientific Python community
   My blog: http://blog.wesmckinney.com
   Twitter: @wesmckinn




    @wesmckinn ()          Data analysis with pandas      10/17/2011   2 / 22
Bare essentials for financial research



    Fast time series functionality
         Easy data alignment
         Date/time handling
         Moving window statistics
         Resamping / frequency conversion
    Fast data access (SQL databases, flat files, etc.)
    Data visualization (plotting)
    Statistical models
         Linear regression
         Time series models: ARMA, VAR, ...




     @wesmckinn ()            Data analysis with pandas   10/17/2011   3 / 22
Would be nice to have




   Portfolio and risk analytics, backtesting
        Easy enough to write yourself, though most people do a bad job of it
   Portfolio optimization
        Most financial firms use a 3rd party library anyway
   Derivative pricing
        Can use QuantLib in most languages




    @wesmckinn ()            Data analysis with pandas          10/17/2011   4 / 22
What are financial firms using?



   HFT: a C++ and hardware arms race, a different topic
   Research
        Mainstream: R, MATLAB, Python, ...
        Econometrics: Stata, eViews, RATS, etc.
        Non-programmatic environments: ClariFI, Palantir, ...
   Production
        Popular: Java, C#, C++
        Less popular, but growing: Python
        Fringe: Functional languages (Ocaml, Haskell, F#)




    @wesmckinn ()            Data analysis with pandas          10/17/2011   5 / 22
What are financial firms using?



   Many hybrid languages environments (e.g. Java/R, C++/R,
   C++/MATLAB, Python/C++)
        Which is the main implementation language?
        If main language is Java/C++, result is lower productivity and higher
        cost to prototyping new functionality
   Trends
        Banks and hedge funds are realizing that Java-based production
        systems can be replaced with 20% as much Python code (or less)
        MATLAB is being increasingly ditched in favor of Python. R and
        Python use for research generally growing




    @wesmckinn ()            Data analysis with pandas          10/17/2011   6 / 22
Python language



   Simple, expressive syntax
   Designed for readability, like “runnable pseudocode”
   Easy-to-use, powerful built-in types and data structures:
        Lists and tuples (fixed-size, immutable lists)
        Dicts (hash maps / associative arrays) and sets
   Everything’s an object, including functions
   “There should be one, and preferably only one way to do it”
   “Batteries included”: great general purpose standard library




    @wesmckinn ()            Data analysis with pandas         10/17/2011   7 / 22
A simple example: quicksort


Pseudocode from Wikipedia:

function qsort(array)
    if length(array) < 2
        return array
    var list less, greater
    select and remove a pivot value pivot from array
    for each x in array
        if x < pivot then append x to less
        else append x to greater
    return concat(qsort(less), pivot, qsort(greater))




     @wesmckinn ()           Data analysis with pandas   10/17/2011   8 / 22
A simple example: quicksort

First try Python implementation:
def qsort ( array ):
    if len ( array ) < 2:
        return array

    less , greater = [] , []

    pivot , rest = array [0] , array [1:]

    for x in rest :
        if x < pivot :
             less . append ( x )
        else :
             greater . append ( x )

    return qsort ( less ) + [ pivot ] + qsort ( greater )



      @wesmckinn ()         Data analysis with pandas       10/17/2011   9 / 22
A simple example: quicksort



Use list comprehensions:
def qsort ( array ):
    if len ( array ) < 2:
        return array

     pivot , rest = array [0] , array [1:]
     less = [ x for x in rest if x < pivot ]
     greater = [ x for x in rest if x >= pivot ]

     return qsort ( less ) + [ pivot ] + qsort ( greater )




      @wesmckinn ()         Data analysis with pandas        10/17/2011   10 / 22
A simple example: quicksort




Heck, fit it onto one line!
qs = lambda r : ( r if len ( r ) < 2
                  else ( qs ([ x for x in r [1:] if x < r [0]])
                         + [ r [0]]
                         + qs ([ x for x in r [1:] if x >= r [0]])))

Though that’s starting to look like Lisp code...




      @wesmckinn ()           Data analysis with pandas   10/17/2011   11 / 22
A simple example: quicksort


A quicksort using NumPy arrays
def qsort ( array ):
    if len ( array ) < 2:
        return array
    pivot , rest = array [0] , array [1:]
    less = rest [ rest < pivot ]
    greater = rest [ rest >= pivot ]
    return np . r_ [ qsort ( less ) , [ pivot ] , qsort ( greater )]

Of course no need for this when you can just do:

sorted_array = np.sort(array)




      @wesmckinn ()           Data analysis with pandas       10/17/2011   12 / 22
Python: drunk with power
This comic has way too much airtime but:




     @wesmckinn ()          Data analysis with pandas   10/17/2011   13 / 22
Staples of Python for science: MINS




   (M) matplotlib: plotting and data visualization
   (I) IPython: rich interactive computing and development environment
   (N) NumPy: multi-dimensional arrays, linear algebra, FFTs, random
   number generation, etc.
   (S) SciPy: optimization, probability distributions, signal processing,
   ODEs, sparse matrices, ...




     @wesmckinn ()          Data analysis with pandas        10/17/2011   14 / 22
Why did Python become popular in science?



   NumPy traces its roots to 1995
   Extremely easy to integrate C/C++/Fortran code
   Access fast low level algorithms in a high level, interpreted language
   The language itself
        “It fits in your head”
        “It [Python] doesn’t get in my way” - Robert Kern
   Python is good at all the things other scientific programming
   languages are not good at (e.g. networking, string processing, OOP)
   Liberal BSD license: can use Python for commercial applications




    @wesmckinn ()            Data analysis with pandas       10/17/2011   15 / 22
Some exciting stuff in the last few years


    Cython
         “Augmented” Python language with type declarations, for generating
         compiled extensions
         C-like speedups with Python-like development time
    IPython: enhanced interactive Python interpreter
         The best research and software development env for Python
         An integrated parallel / distributed computing backend
         GUI console with inline plotting and a rich HTML notebook (more on
         this later)
    PyCUDA / PyOpenCL: GPU computing in Python
         Transformed Python overnight into one of the best languages for doing
         GPU computing




     @wesmckinn ()            Data analysis with pandas        10/17/2011   16 / 22
Where has Python historically been weak?



   Rich data structures for data analysis and statistics
         NumPy arrays, while powerful, feel distinctly “lower level” if you’re
         used to R’s data.frame
         pandas has filled this gap over the last 2 years
   Statistics libraries
         Nowhere near the depth of R’s CRAN repository
         statsmodels provides tested implementations a lot of standard
         regression and time series models
         Turns out that most financial data analysis requires only fairly
         elementary statistical models




     @wesmckinn ()             Data analysis with pandas           10/17/2011    17 / 22
pandas library



    Began building at AQR in 2008, open-sourced late 2009
    Why
         R / MATLAB, while good for research / data analysis, are not suitable
         implementation languages for large-scale production systems
                (I personally don’t care for them for data analysis)
         Existing data structures for time series in R / MATLAB were too
         limited / not flexible enough my needs
    Core idea: indexed data structures capable of storing heterogeneous
    data
    Etymology: panel data structures




     @wesmckinn ()                Data analysis with pandas            10/17/2011   18 / 22
pandas in a nutshell



    A clean axis indexing design to support fast data alignment, lookups,
    hierarchical indexing, and more
    High-performance data structures
         Series/TimeSeries: 1D labeled vector
         DataFrame: 2D spreadsheet-like structure
         Panel: 3D labeled array, collection of DataFrames
    SQL-like functionality: GroupBy, joining/merging, etc.
    Missing data handling
    Time series functionality




     @wesmckinn ()              Data analysis with pandas    10/17/2011   19 / 22
pandas design philosophy



   “Think outside the matrix”: stop thinking about shape and start
   thinking about indexes
   Indexing and data alignment are essential
   Fault-tolerance: save you from common blunders caused by coding
   errors (specifically misaligned data)
   Lift the best features of other data analysis environments (R,
   MATLAB, Stata, etc.) and make them better, faster
   Performance and usability equally important




     @wesmckinn ()          Data analysis with pandas       10/17/2011   20 / 22
The pandas killer feature: indexing



    Each axis has an index
    Automatic alignment between differently-indexed objects: makes it
    nearly impossible to accidentally combine misaligned data
    Hierarchical indexing provides an intuitive way of structuring and
    working with higher-dimensional data
    Natural way of expressing “group by” and join-type operations
    Better integrated and more flexible indexing than anything available
    in R or MATLAB




     @wesmckinn ()           Data analysis with pandas       10/17/2011   21 / 22
Tutorial time




                     To the IPython console!




     @wesmckinn ()          Data analysis with pandas   10/17/2011   22 / 22

More Related Content

What's hot

7. sequence and collaboration diagrams
7. sequence and collaboration diagrams7. sequence and collaboration diagrams
7. sequence and collaboration diagramsAPU
 
An introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable serviceAn introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable serviceJisc
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 
Graphics pipelining
Graphics pipeliningGraphics pipelining
Graphics pipeliningAreena Javed
 
Python pandas tutorial
Python pandas tutorialPython pandas tutorial
Python pandas tutorialHarikaReddy115
 
software project management
software project managementsoftware project management
software project managementdeep sharma
 
Software engineering a practitioners approach 8th edition pressman solutions ...
Software engineering a practitioners approach 8th edition pressman solutions ...Software engineering a practitioners approach 8th edition pressman solutions ...
Software engineering a practitioners approach 8th edition pressman solutions ...Drusilla918
 
Component and Deployment Diagram - Brief Overview
Component and Deployment Diagram - Brief OverviewComponent and Deployment Diagram - Brief Overview
Component and Deployment Diagram - Brief OverviewRajiv Kumar
 
R Programming: Introduction To R Packages
R Programming: Introduction To R PackagesR Programming: Introduction To R Packages
R Programming: Introduction To R PackagesRsquared Academy
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classificationDr-Dipali Meher
 
R Programming Language
R Programming LanguageR Programming Language
R Programming LanguageNareshKarela1
 
SAD06 - Use Case Diagrams
SAD06 - Use Case DiagramsSAD06 - Use Case Diagrams
SAD06 - Use Case DiagramsMichael Heron
 
Structured Vs, Object Oriented Analysis and Design
Structured Vs, Object Oriented Analysis and DesignStructured Vs, Object Oriented Analysis and Design
Structured Vs, Object Oriented Analysis and DesignMotaz Saad
 
Cs8092 computer graphics and multimedia unit 3
Cs8092 computer graphics and multimedia unit 3Cs8092 computer graphics and multimedia unit 3
Cs8092 computer graphics and multimedia unit 3SIMONTHOMAS S
 

What's hot (20)

Ch 11-component-level-design
Ch 11-component-level-designCh 11-component-level-design
Ch 11-component-level-design
 
7. sequence and collaboration diagrams
7. sequence and collaboration diagrams7. sequence and collaboration diagrams
7. sequence and collaboration diagrams
 
Asp.net file types
Asp.net file typesAsp.net file types
Asp.net file types
 
An introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable serviceAn introduction to Jupyter notebooks and the Noteable service
An introduction to Jupyter notebooks and the Noteable service
 
NUMPY
NUMPY NUMPY
NUMPY
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Graphics pipelining
Graphics pipeliningGraphics pipelining
Graphics pipelining
 
Python pandas tutorial
Python pandas tutorialPython pandas tutorial
Python pandas tutorial
 
software project management
software project managementsoftware project management
software project management
 
Software engineering a practitioners approach 8th edition pressman solutions ...
Software engineering a practitioners approach 8th edition pressman solutions ...Software engineering a practitioners approach 8th edition pressman solutions ...
Software engineering a practitioners approach 8th edition pressman solutions ...
 
Pandas
PandasPandas
Pandas
 
Component and Deployment Diagram - Brief Overview
Component and Deployment Diagram - Brief OverviewComponent and Deployment Diagram - Brief Overview
Component and Deployment Diagram - Brief Overview
 
R Programming: Introduction To R Packages
R Programming: Introduction To R PackagesR Programming: Introduction To R Packages
R Programming: Introduction To R Packages
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classification
 
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
 
SAD06 - Use Case Diagrams
SAD06 - Use Case DiagramsSAD06 - Use Case Diagrams
SAD06 - Use Case Diagrams
 
Structured Vs, Object Oriented Analysis and Design
Structured Vs, Object Oriented Analysis and DesignStructured Vs, Object Oriented Analysis and Design
Structured Vs, Object Oriented Analysis and Design
 
Cs8092 computer graphics and multimedia unit 3
Cs8092 computer graphics and multimedia unit 3Cs8092 computer graphics and multimedia unit 3
Cs8092 computer graphics and multimedia unit 3
 
NLP in Cognitive Systems
NLP in Cognitive SystemsNLP in Cognitive Systems
NLP in Cognitive Systems
 

Similar to Python for Financial Data Analysis with pandas

SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchainJie-Han Chen
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageMajid Abdollahi
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"Jihyun Ahn
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
 
Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?Fru Louis
 

Similar to Python for Financial Data Analysis with pandas (20)

SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 
Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 

Recently uploaded

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Python for Financial Data Analysis with pandas

  • 1. Financial data analysis in Python with pandas Wes McKinney @wesmckinn 10/17/2011 @wesmckinn () Data analysis with pandas 10/17/2011 1 / 22
  • 2. My background 3 years as a quant hacker at AQR, now consultant / entrepreneur Math and statistics background with the zest of computer science Active in scientific Python community My blog: http://blog.wesmckinney.com Twitter: @wesmckinn @wesmckinn () Data analysis with pandas 10/17/2011 2 / 22
  • 3. Bare essentials for financial research Fast time series functionality Easy data alignment Date/time handling Moving window statistics Resamping / frequency conversion Fast data access (SQL databases, flat files, etc.) Data visualization (plotting) Statistical models Linear regression Time series models: ARMA, VAR, ... @wesmckinn () Data analysis with pandas 10/17/2011 3 / 22
  • 4. Would be nice to have Portfolio and risk analytics, backtesting Easy enough to write yourself, though most people do a bad job of it Portfolio optimization Most financial firms use a 3rd party library anyway Derivative pricing Can use QuantLib in most languages @wesmckinn () Data analysis with pandas 10/17/2011 4 / 22
  • 5. What are financial firms using? HFT: a C++ and hardware arms race, a different topic Research Mainstream: R, MATLAB, Python, ... Econometrics: Stata, eViews, RATS, etc. Non-programmatic environments: ClariFI, Palantir, ... Production Popular: Java, C#, C++ Less popular, but growing: Python Fringe: Functional languages (Ocaml, Haskell, F#) @wesmckinn () Data analysis with pandas 10/17/2011 5 / 22
  • 6. What are financial firms using? Many hybrid languages environments (e.g. Java/R, C++/R, C++/MATLAB, Python/C++) Which is the main implementation language? If main language is Java/C++, result is lower productivity and higher cost to prototyping new functionality Trends Banks and hedge funds are realizing that Java-based production systems can be replaced with 20% as much Python code (or less) MATLAB is being increasingly ditched in favor of Python. R and Python use for research generally growing @wesmckinn () Data analysis with pandas 10/17/2011 6 / 22
  • 7. Python language Simple, expressive syntax Designed for readability, like “runnable pseudocode” Easy-to-use, powerful built-in types and data structures: Lists and tuples (fixed-size, immutable lists) Dicts (hash maps / associative arrays) and sets Everything’s an object, including functions “There should be one, and preferably only one way to do it” “Batteries included”: great general purpose standard library @wesmckinn () Data analysis with pandas 10/17/2011 7 / 22
  • 8. A simple example: quicksort Pseudocode from Wikipedia: function qsort(array) if length(array) < 2 return array var list less, greater select and remove a pivot value pivot from array for each x in array if x < pivot then append x to less else append x to greater return concat(qsort(less), pivot, qsort(greater)) @wesmckinn () Data analysis with pandas 10/17/2011 8 / 22
  • 9. A simple example: quicksort First try Python implementation: def qsort ( array ): if len ( array ) < 2: return array less , greater = [] , [] pivot , rest = array [0] , array [1:] for x in rest : if x < pivot : less . append ( x ) else : greater . append ( x ) return qsort ( less ) + [ pivot ] + qsort ( greater ) @wesmckinn () Data analysis with pandas 10/17/2011 9 / 22
  • 10. A simple example: quicksort Use list comprehensions: def qsort ( array ): if len ( array ) < 2: return array pivot , rest = array [0] , array [1:] less = [ x for x in rest if x < pivot ] greater = [ x for x in rest if x >= pivot ] return qsort ( less ) + [ pivot ] + qsort ( greater ) @wesmckinn () Data analysis with pandas 10/17/2011 10 / 22
  • 11. A simple example: quicksort Heck, fit it onto one line! qs = lambda r : ( r if len ( r ) < 2 else ( qs ([ x for x in r [1:] if x < r [0]]) + [ r [0]] + qs ([ x for x in r [1:] if x >= r [0]]))) Though that’s starting to look like Lisp code... @wesmckinn () Data analysis with pandas 10/17/2011 11 / 22
  • 12. A simple example: quicksort A quicksort using NumPy arrays def qsort ( array ): if len ( array ) < 2: return array pivot , rest = array [0] , array [1:] less = rest [ rest < pivot ] greater = rest [ rest >= pivot ] return np . r_ [ qsort ( less ) , [ pivot ] , qsort ( greater )] Of course no need for this when you can just do: sorted_array = np.sort(array) @wesmckinn () Data analysis with pandas 10/17/2011 12 / 22
  • 13. Python: drunk with power This comic has way too much airtime but: @wesmckinn () Data analysis with pandas 10/17/2011 13 / 22
  • 14. Staples of Python for science: MINS (M) matplotlib: plotting and data visualization (I) IPython: rich interactive computing and development environment (N) NumPy: multi-dimensional arrays, linear algebra, FFTs, random number generation, etc. (S) SciPy: optimization, probability distributions, signal processing, ODEs, sparse matrices, ... @wesmckinn () Data analysis with pandas 10/17/2011 14 / 22
  • 15. Why did Python become popular in science? NumPy traces its roots to 1995 Extremely easy to integrate C/C++/Fortran code Access fast low level algorithms in a high level, interpreted language The language itself “It fits in your head” “It [Python] doesn’t get in my way” - Robert Kern Python is good at all the things other scientific programming languages are not good at (e.g. networking, string processing, OOP) Liberal BSD license: can use Python for commercial applications @wesmckinn () Data analysis with pandas 10/17/2011 15 / 22
  • 16. Some exciting stuff in the last few years Cython “Augmented” Python language with type declarations, for generating compiled extensions C-like speedups with Python-like development time IPython: enhanced interactive Python interpreter The best research and software development env for Python An integrated parallel / distributed computing backend GUI console with inline plotting and a rich HTML notebook (more on this later) PyCUDA / PyOpenCL: GPU computing in Python Transformed Python overnight into one of the best languages for doing GPU computing @wesmckinn () Data analysis with pandas 10/17/2011 16 / 22
  • 17. Where has Python historically been weak? Rich data structures for data analysis and statistics NumPy arrays, while powerful, feel distinctly “lower level” if you’re used to R’s data.frame pandas has filled this gap over the last 2 years Statistics libraries Nowhere near the depth of R’s CRAN repository statsmodels provides tested implementations a lot of standard regression and time series models Turns out that most financial data analysis requires only fairly elementary statistical models @wesmckinn () Data analysis with pandas 10/17/2011 17 / 22
  • 18. pandas library Began building at AQR in 2008, open-sourced late 2009 Why R / MATLAB, while good for research / data analysis, are not suitable implementation languages for large-scale production systems (I personally don’t care for them for data analysis) Existing data structures for time series in R / MATLAB were too limited / not flexible enough my needs Core idea: indexed data structures capable of storing heterogeneous data Etymology: panel data structures @wesmckinn () Data analysis with pandas 10/17/2011 18 / 22
  • 19. pandas in a nutshell A clean axis indexing design to support fast data alignment, lookups, hierarchical indexing, and more High-performance data structures Series/TimeSeries: 1D labeled vector DataFrame: 2D spreadsheet-like structure Panel: 3D labeled array, collection of DataFrames SQL-like functionality: GroupBy, joining/merging, etc. Missing data handling Time series functionality @wesmckinn () Data analysis with pandas 10/17/2011 19 / 22
  • 20. pandas design philosophy “Think outside the matrix”: stop thinking about shape and start thinking about indexes Indexing and data alignment are essential Fault-tolerance: save you from common blunders caused by coding errors (specifically misaligned data) Lift the best features of other data analysis environments (R, MATLAB, Stata, etc.) and make them better, faster Performance and usability equally important @wesmckinn () Data analysis with pandas 10/17/2011 20 / 22
  • 21. The pandas killer feature: indexing Each axis has an index Automatic alignment between differently-indexed objects: makes it nearly impossible to accidentally combine misaligned data Hierarchical indexing provides an intuitive way of structuring and working with higher-dimensional data Natural way of expressing “group by” and join-type operations Better integrated and more flexible indexing than anything available in R or MATLAB @wesmckinn () Data analysis with pandas 10/17/2011 21 / 22
  • 22. Tutorial time To the IPython console! @wesmckinn () Data analysis with pandas 10/17/2011 22 / 22