SlideShare a Scribd company logo
1 of 39
Wes McKinney @wesmckinn
SHARED INFRASTRUCTURE FOR
DATA SCIENCE
WES MCKINNEY @WESMCKINN
Rice Data Science Conference | October 2017
ME
2
I M P O R TA N T L E G A L I N F O R M AT I O N
• The information presented here is offered for informational purposes only and should not be used for any other purpose
(including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes
only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any
offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of
Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at
any time.
• Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such
copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for
identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright
or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma,
nor vice versa.
• Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
Wes McKinney @wesmckinn 3
THINKING ON THE LAST 10 YEARS
4
2007 2017
CLOSED SOURCE OPEN SOURCE
5
Shared front-ends
for data science
THE NEXT 10 YEARS AND BEYOND
7
2017 2027 …
THE AI ARMS RACE
Wes McKinney @wesmckinn 8
CHANGING HARDWARE LANDSCAPE
DISK
PROCESSIN
G
MEMORY
9
T
DATA SCIENCE “LANGUAGE “SILOS”
FRONT-END
PYTHON R JVM JULIA …
10
WHAT’S IN A
SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
11
WHAT’S IN A
SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
pandas NumPy
pandas
NumPy
pandas
scikit-learn
12
RENOVATING PANDAS
Wes McKinney @wesmckinn 13
27
T
MAKING THE SILOS “SMALLER”
FRONT-END
PYTHON R JVM JULIA
?
…
14
PROGRAMMING LANGUAGES
AS USER INTERFACES
15
GRAPHIC: Iceberg under sea (only top
part visible to naked eye)
T
df <- read_csv(…)
df % group_by(…) % summarise(…)
df = read_csv(…)
df.groupby(…).aggregate(…)
PYTHON
R
SAME ANALYSIS, DIFFERENT
IMPLEMENTATION
17
T
A SHARED RUNTIME FOR DATA SCIENCE
FRONT-END
PYTHON R JVM JULIA
SHARED DATA SCIENCE RUNTIME
…
18
FROM IDEA TO ACTION
19
T
PART 1: STANDARD IN-MEMORY FORMAT
R
PYTHON
JVM
PORTABLE DATA
FRAME
Non-Portable Data Frames
20…
T
PART 2: ZERO COPY INTERCHANGE
RPYTHON JVM
SHARED MEMORY + STANDARD MEMORY FORMATS
…
21
T
PART 3: HIGH PERFORMANCE DATA
ACCESS
BINARY
COLUMNAR
CSV
SQL
PORTABLE
DATA FRAME
Storage Formats/ Databases
… 22
T
PART 4: FLEXIBLE COMPUTATION ENGINE
• Zero-overhead User-defined Functions
• Portable Operator “Graphs”
• “Embeddable” in Larger Systems
23
APACHE ARROW
Language-agnostic Data Frame Format
Zero-Copy Interchange
24
24
Without Arrow With Arrow
Simple, fast data interchange
24
• Cache-efficient columnar memory: optimized for CPU affinity and
SIMD / parallel processing, O(1) random value access
• Zero-copy messaging / IPC: Language-agnostic metadata,
batch/file-based and streaming binary formats
• Complex schema support: Flat and nested data types
• Main implementations in C++ and Java: with integration tests
• Bindings / implementations for C, Python, Ruby, Javascript in
various stages of development
Big picture Arrow goals
T
BUILDING THE ARROW FORMAT
• “Superset” of representations supported by
R, pandas, SQL engines
• Optimized for CPU cache affinity
• ASF Governance: Open + Transparent
Community Project
25
FEATHER: MINIMALIST ARROW ON DISK
Some Arrow OSS Users
Feather Format
Ray Project
27
FROM ARROW TO PANDAS2
28
Logical Operator Graphs
27
(a + b).log()
Log Add
a
b
Terminology
27
• Kernel functions: atomic units of
computation
• Operator nodes: input/output types,
operator parallelism properties
Parallel Execution of Operator Graphs
27
a b
ADD LOG
tmp out
Some Optimization strategies
27
• Multicore scheduling
• Elimination of temporaries
• Operator fusion / pipelinng
A
28
Arrow-optimized data connectors
Arrow in-memory format
Logical Data Frame Expression Graphs
Parallel Dataflow Execution Engine
Python user API, DataFrame semantics,
User-defined functions
pandas2
Apache Arrow
BUILDING THE FUTURE
28
Wes McKinney @wesmckinn
THANK YOU
WES MCKINNEY @WESMCKINN
Apache Arrow: http://arrow.apache.org

More Related Content

What's hot

Capex opex
Capex opexCapex opex
Capex opexRaju_p1
 
Errors in process chains
Errors in process chainsErrors in process chains
Errors in process chainsSiva Kollipara
 
SAP FI AP: End User Guide for Beginners
SAP FI AP: End User Guide for BeginnersSAP FI AP: End User Guide for Beginners
SAP FI AP: End User Guide for Beginnerssapdocs. info
 
Anil kumar sap security and grc consultant
Anil kumar sap security and grc consultantAnil kumar sap security and grc consultant
Anil kumar sap security and grc consultantAnil Kumar
 
_Using Selective Deletion in Process Chains.pdf
_Using Selective Deletion in Process Chains.pdf_Using Selective Deletion in Process Chains.pdf
_Using Selective Deletion in Process Chains.pdfssuserfe1f82
 
SAP HANA SPS09 - Multitenant Database Containers
SAP HANA SPS09 - Multitenant Database ContainersSAP HANA SPS09 - Multitenant Database Containers
SAP HANA SPS09 - Multitenant Database ContainersSAP Technology
 
Enhancing data sources with badi in SAP ABAP
Enhancing data sources with badi in SAP ABAPEnhancing data sources with badi in SAP ABAP
Enhancing data sources with badi in SAP ABAPAabid Khan
 
ShangriLa Anime APIを利用してアニメ関連のビッグデータ解析を最速で行う
ShangriLa Anime APIを利用してアニメ関連のビッグデータ解析を最速で行うShangriLa Anime APIを利用してアニメ関連のビッグデータ解析を最速で行う
ShangriLa Anime APIを利用してアニメ関連のビッグデータ解析を最速で行うJunichi Noda
 
SAP Project Management: Major Responsibilities And Key Task
SAP Project Management: Major Responsibilities And Key TaskSAP Project Management: Major Responsibilities And Key Task
SAP Project Management: Major Responsibilities And Key TaskAnjali Rao
 
How to create Leave balance carry forward in SAP HCM
How to create Leave balance carry forward in SAP HCMHow to create Leave balance carry forward in SAP HCM
How to create Leave balance carry forward in SAP HCMAli Khan
 
User Status Profile (SAP PS)
User Status Profile (SAP PS)User Status Profile (SAP PS)
User Status Profile (SAP PS)Soumya De
 
SAP BI Generic Extraction Using a Function Module.pdf
SAP BI Generic Extraction Using a Function Module.pdfSAP BI Generic Extraction Using a Function Module.pdf
SAP BI Generic Extraction Using a Function Module.pdfKoushikGuna
 
Sap fico tutorial
Sap fico tutorialSap fico tutorial
Sap fico tutorialus268612
 
SAP Certification books and exam dump
SAP Certification books and exam dumpSAP Certification books and exam dump
SAP Certification books and exam dumpERP Training
 
見せたいデータに応じた取得方法を選ぼう!画面更新、ポーリング、WebSocketの比較
見せたいデータに応じた取得方法を選ぼう!画面更新、ポーリング、WebSocketの比較見せたいデータに応じた取得方法を選ぼう!画面更新、ポーリング、WebSocketの比較
見せたいデータに応じた取得方法を選ぼう!画面更新、ポーリング、WebSocketの比較Core Concept Technologies
 

What's hot (20)

Capex opex
Capex opexCapex opex
Capex opex
 
Errors in process chains
Errors in process chainsErrors in process chains
Errors in process chains
 
SAP FI AP: End User Guide for Beginners
SAP FI AP: End User Guide for BeginnersSAP FI AP: End User Guide for Beginners
SAP FI AP: End User Guide for Beginners
 
Anil kumar sap security and grc consultant
Anil kumar sap security and grc consultantAnil kumar sap security and grc consultant
Anil kumar sap security and grc consultant
 
SAP Basis Overview
SAP Basis OverviewSAP Basis Overview
SAP Basis Overview
 
SAP BI/BW
SAP BI/BWSAP BI/BW
SAP BI/BW
 
_Using Selective Deletion in Process Chains.pdf
_Using Selective Deletion in Process Chains.pdf_Using Selective Deletion in Process Chains.pdf
_Using Selective Deletion in Process Chains.pdf
 
SAP HANA SPS09 - Multitenant Database Containers
SAP HANA SPS09 - Multitenant Database ContainersSAP HANA SPS09 - Multitenant Database Containers
SAP HANA SPS09 - Multitenant Database Containers
 
Enhancing data sources with badi in SAP ABAP
Enhancing data sources with badi in SAP ABAPEnhancing data sources with badi in SAP ABAP
Enhancing data sources with badi in SAP ABAP
 
Sap hana tutorial
Sap hana tutorialSap hana tutorial
Sap hana tutorial
 
ShangriLa Anime APIを利用してアニメ関連のビッグデータ解析を最速で行う
ShangriLa Anime APIを利用してアニメ関連のビッグデータ解析を最速で行うShangriLa Anime APIを利用してアニメ関連のビッグデータ解析を最速で行う
ShangriLa Anime APIを利用してアニメ関連のビッグデータ解析を最速で行う
 
SAP Project Management: Major Responsibilities And Key Task
SAP Project Management: Major Responsibilities And Key TaskSAP Project Management: Major Responsibilities And Key Task
SAP Project Management: Major Responsibilities And Key Task
 
How to create Leave balance carry forward in SAP HCM
How to create Leave balance carry forward in SAP HCMHow to create Leave balance carry forward in SAP HCM
How to create Leave balance carry forward in SAP HCM
 
User Status Profile (SAP PS)
User Status Profile (SAP PS)User Status Profile (SAP PS)
User Status Profile (SAP PS)
 
SAP BI Generic Extraction Using a Function Module.pdf
SAP BI Generic Extraction Using a Function Module.pdfSAP BI Generic Extraction Using a Function Module.pdf
SAP BI Generic Extraction Using a Function Module.pdf
 
Sap fico tutorial
Sap fico tutorialSap fico tutorial
Sap fico tutorial
 
Sap bw4 hana
Sap bw4 hanaSap bw4 hana
Sap bw4 hana
 
SAP Certification books and exam dump
SAP Certification books and exam dumpSAP Certification books and exam dump
SAP Certification books and exam dump
 
Introduction to SAP Business One HANA
Introduction to SAP Business One HANAIntroduction to SAP Business One HANA
Introduction to SAP Business One HANA
 
見せたいデータに応じた取得方法を選ぼう!画面更新、ポーリング、WebSocketの比較
見せたいデータに応じた取得方法を選ぼう!画面更新、ポーリング、WebSocketの比較見せたいデータに応じた取得方法を選ぼう!画面更新、ポーリング、WebSocketの比較
見せたいデータに応じた取得方法を選ぼう!画面更新、ポーリング、WebSocketの比較
 

Similar to Shared Infrastructure for Data Science

Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Got Big Data? Get OpenSplice!
Got Big Data? Get OpenSplice!Got Big Data? Get OpenSplice!
Got Big Data? Get OpenSplice!Angelo Corsaro
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
Digital Innovation Trends in Government Blockchain Machine Learning and Inter...
Digital Innovation Trends in Government Blockchain Machine Learning and Inter...Digital Innovation Trends in Government Blockchain Machine Learning and Inter...
Digital Innovation Trends in Government Blockchain Machine Learning and Inter...scoopnewsgroup
 
5 Tips to Building a Successful Big Data Strategy
5 Tips to Building a Successful Big Data Strategy5 Tips to Building a Successful Big Data Strategy
5 Tips to Building a Successful Big Data StrategyWestern Digital
 
Big Data Scotland 2017
Big Data Scotland 2017Big Data Scotland 2017
Big Data Scotland 2017Ray Bugg
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Spark Summit
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Big Data Mining Keynote presentation Sept 2013 09012013
Big Data Mining Keynote presentation Sept 2013 09012013Big Data Mining Keynote presentation Sept 2013 09012013
Big Data Mining Keynote presentation Sept 2013 09012013Julio Da Silva
 
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016Quantopian
 
District Office of Info and KM - Proposed - by Joel Magnussen - 2004
District Office of Info and KM - Proposed - by Joel Magnussen - 2004District Office of Info and KM - Proposed - by Joel Magnussen - 2004
District Office of Info and KM - Proposed - by Joel Magnussen - 2004Peter Stinson
 
Building ML platforms in Financial Services with serverless technology - FSV2...
Building ML platforms in Financial Services with serverless technology - FSV2...Building ML platforms in Financial Services with serverless technology - FSV2...
Building ML platforms in Financial Services with serverless technology - FSV2...Amazon Web Services
 
GPSBUS201-GPS Demystifying Artificial Intelligence
GPSBUS201-GPS Demystifying Artificial IntelligenceGPSBUS201-GPS Demystifying Artificial Intelligence
GPSBUS201-GPS Demystifying Artificial IntelligenceAmazon Web Services
 
Taking Complexity Out of Data Science with AWS and Zoomdata PPT
Taking Complexity Out of Data Science with AWS and Zoomdata PPTTaking Complexity Out of Data Science with AWS and Zoomdata PPT
Taking Complexity Out of Data Science with AWS and Zoomdata PPTAmazon Web Services
 
Watson data platform_sofia_20171017
Watson data platform_sofia_20171017Watson data platform_sofia_20171017
Watson data platform_sofia_20171017Mladen Jovanovski
 
Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013nkabra
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACAdam Muise
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013Adam Muise
 

Similar to Shared Infrastructure for Data Science (20)

Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Got Big Data? Get OpenSplice!
Got Big Data? Get OpenSplice!Got Big Data? Get OpenSplice!
Got Big Data? Get OpenSplice!
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Digital Innovation Trends in Government Blockchain Machine Learning and Inter...
Digital Innovation Trends in Government Blockchain Machine Learning and Inter...Digital Innovation Trends in Government Blockchain Machine Learning and Inter...
Digital Innovation Trends in Government Blockchain Machine Learning and Inter...
 
5 Tips to Building a Successful Big Data Strategy
5 Tips to Building a Successful Big Data Strategy5 Tips to Building a Successful Big Data Strategy
5 Tips to Building a Successful Big Data Strategy
 
Big Data Scotland 2017
Big Data Scotland 2017Big Data Scotland 2017
Big Data Scotland 2017
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Big Data Mining Keynote presentation Sept 2013 09012013
Big Data Mining Keynote presentation Sept 2013 09012013Big Data Mining Keynote presentation Sept 2013 09012013
Big Data Mining Keynote presentation Sept 2013 09012013
 
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016
 
District Office of Info and KM - Proposed - by Joel Magnussen - 2004
District Office of Info and KM - Proposed - by Joel Magnussen - 2004District Office of Info and KM - Proposed - by Joel Magnussen - 2004
District Office of Info and KM - Proposed - by Joel Magnussen - 2004
 
Building ML platforms in Financial Services with serverless technology - FSV2...
Building ML platforms in Financial Services with serverless technology - FSV2...Building ML platforms in Financial Services with serverless technology - FSV2...
Building ML platforms in Financial Services with serverless technology - FSV2...
 
GPSBUS201-GPS Demystifying Artificial Intelligence
GPSBUS201-GPS Demystifying Artificial IntelligenceGPSBUS201-GPS Demystifying Artificial Intelligence
GPSBUS201-GPS Demystifying Artificial Intelligence
 
Taking Complexity Out of Data Science with AWS and Zoomdata PPT
Taking Complexity Out of Data Science with AWS and Zoomdata PPTTaking Complexity Out of Data Science with AWS and Zoomdata PPT
Taking Complexity Out of Data Science with AWS and Zoomdata PPT
 
Watson data platform_sofia_20171017
Watson data platform_sofia_20171017Watson data platform_sofia_20171017
Watson data platform_sofia_20171017
 
Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
Big data
Big dataBig data
Big data
 
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
 

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 

Recently uploaded

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Shared Infrastructure for Data Science

  • 1. Wes McKinney @wesmckinn SHARED INFRASTRUCTURE FOR DATA SCIENCE WES MCKINNEY @WESMCKINN Rice Data Science Conference | October 2017
  • 3. I M P O R TA N T L E G A L I N F O R M AT I O N • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved Wes McKinney @wesmckinn 3
  • 4. THINKING ON THE LAST 10 YEARS 4 2007 2017
  • 7. THE NEXT 10 YEARS AND BEYOND 7 2017 2027 …
  • 8. THE AI ARMS RACE Wes McKinney @wesmckinn 8
  • 10. T DATA SCIENCE “LANGUAGE “SILOS” FRONT-END PYTHON R JVM JULIA … 10
  • 11. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS 11
  • 12. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS pandas NumPy pandas NumPy pandas scikit-learn 12
  • 14. 27
  • 15. T MAKING THE SILOS “SMALLER” FRONT-END PYTHON R JVM JULIA ? … 14
  • 17. GRAPHIC: Iceberg under sea (only top part visible to naked eye)
  • 18. T df <- read_csv(…) df % group_by(…) % summarise(…) df = read_csv(…) df.groupby(…).aggregate(…) PYTHON R SAME ANALYSIS, DIFFERENT IMPLEMENTATION 17
  • 19. T A SHARED RUNTIME FOR DATA SCIENCE FRONT-END PYTHON R JVM JULIA SHARED DATA SCIENCE RUNTIME … 18
  • 20. FROM IDEA TO ACTION 19
  • 21. T PART 1: STANDARD IN-MEMORY FORMAT R PYTHON JVM PORTABLE DATA FRAME Non-Portable Data Frames 20…
  • 22. T PART 2: ZERO COPY INTERCHANGE RPYTHON JVM SHARED MEMORY + STANDARD MEMORY FORMATS … 21
  • 23. T PART 3: HIGH PERFORMANCE DATA ACCESS BINARY COLUMNAR CSV SQL PORTABLE DATA FRAME Storage Formats/ Databases … 22
  • 24. T PART 4: FLEXIBLE COMPUTATION ENGINE • Zero-overhead User-defined Functions • Portable Operator “Graphs” • “Embeddable” in Larger Systems 23
  • 25. APACHE ARROW Language-agnostic Data Frame Format Zero-Copy Interchange 24
  • 26. 24 Without Arrow With Arrow Simple, fast data interchange
  • 27. 24 • Cache-efficient columnar memory: optimized for CPU affinity and SIMD / parallel processing, O(1) random value access • Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based and streaming binary formats • Complex schema support: Flat and nested data types • Main implementations in C++ and Java: with integration tests • Bindings / implementations for C, Python, Ruby, Javascript in various stages of development Big picture Arrow goals
  • 28. T BUILDING THE ARROW FORMAT • “Superset” of representations supported by R, pandas, SQL engines • Optimized for CPU cache affinity • ASF Governance: Open + Transparent Community Project 25
  • 30. Some Arrow OSS Users Feather Format Ray Project 27
  • 31. FROM ARROW TO PANDAS2 28
  • 32. Logical Operator Graphs 27 (a + b).log() Log Add a b
  • 33. Terminology 27 • Kernel functions: atomic units of computation • Operator nodes: input/output types, operator parallelism properties
  • 34. Parallel Execution of Operator Graphs 27 a b ADD LOG tmp out
  • 35. Some Optimization strategies 27 • Multicore scheduling • Elimination of temporaries • Operator fusion / pipelinng
  • 36. A 28 Arrow-optimized data connectors Arrow in-memory format Logical Data Frame Expression Graphs Parallel Dataflow Execution Engine Python user API, DataFrame semantics, User-defined functions pandas2 Apache Arrow
  • 38.
  • 39. Wes McKinney @wesmckinn THANK YOU WES MCKINNEY @WESMCKINN Apache Arrow: http://arrow.apache.org