SlideShare a Scribd company logo
1 of 27
www.twosigma.com
Memory Interoperability for
Analytics and Machine Learning
March 26, 2017All Rights Reserved
Wes McKinney @wesmckinn
ScaledML @ Stanford
March 25, 2017
Me
March 26, 2017
• Currently: Software Architect at Two Sigma Investments
• Creator of Python pandas project
• PMC member for Apache Arrow and Apache Parquet
• Author of Python for Data Analysis
• Other Python projects: Ibis, Feather, statsmodels
All Rights Reserved 2
Important Legal Information
March 26, 2017
The information presented here is offered for informational purposes only and should not be used for
any other purpose (including, without limitation, the making of investment decisions). Examples
provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing
herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest;
tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments,
LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any
time.
Some of the images, logos or other material used herein may be protected by copyright and/or
trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the
material and are used purely for identification and comment as fair use under international copyright
and/or trademark laws. Use of such image, copyright or trademark does not imply any association with
such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
All Rights Reserved 3
This talk
4March 26, 2017
• Benefits of interoperable data and metadata
• Challenges to sharing memory between runtime environments
• Apache Arrow: Purpose and C++ architecture
• Opportunities for collaboration
• Example application: pandas 2.0
All Rights Reserved
Changing hardware landscape
March 26, 2017
• Intel has released first production 3D Xpoint SSD
• Reported 1000x faster than NAND, less expensive than RAM
• Convergence between RAM vs. shared memory / mmap performance
All Rights Reserved 5
Changing software landscape
March 26, 2017
• Next-gen ML / AI frameworks (TensorFlow, Torch, etc.)
• DIY open source architectures for machine learning in production
• Streaming / batch data processing pipelines
• Data cleaning and feature engineering
• Model fitting / scoring / serving
All Rights Reserved 6
“Zero-copy” memory interfaces
March 26, 2017
• Enables computational tools to process a dataset without any additional
serialization, or transfer to a different memory space
• Can do random access on a dataset that does not fit in RAM
• Another interpretation: reading a dataset is a metadata-only conversion
All Rights Reserved 7
Challenges to zero-copy memory sharing
March 26, 2017
• Cross-language issues
• Type metadata + logical types
• Byte/bit-level memory layout
• Language-specific issues
• In-memory data structures
• Memory allocation and sharing constructs
All Rights Reserved 8
What is pandas?
March 26, 2017
• Popular in-memory data manipulation tool for Python
• Focused on tabular datasets (“data frames”)
• Sprawling codebase spanning multiple areas
• IO for many data formats
• Array manipulations / data preparation
• OLAP-style analytics
• Internals implemented using NumPy array objects
All Rights Reserved 9
NumPy
March 26, 2017
• Tensor memory model ("ndarray") for numeric data
• Strided, homogeneously-typed, byte-addressable memory
• APL-inspired semantics
• Zero-copy construction from compatible memory layouts
• Computational tools support both strided and contiguous memory access
All Rights Reserved 10
pandas: Technical debt + Architectural issues
March 26, 2017
• Tensor library like NumPy awkward fit for pandas use cases
• Multidimensionality + strided memory access complicated algorithms
• Lack of built-in missing value support
• Weak on native string, variable length, or nested types
• pandas at core a “in-memory columnar” problem, similar to analytical SQL
engines
All Rights Reserved 11
Thesis: Tensors and Tables
March 26, 2017
• 2 data structures best suited for zero-copy sharing
• Tensors: N-dimensional, homogeneously-typed arrays
• Tables: Column-oriented, heterogeneously typed
• These data structures can be defined using common memory and metadata
primitives
All Rights Reserved 12
Observations
March 26, 2017
• A Tensor is semantically a multidimensional view of a 1D block of memory
• Writing computational code targeting arbitrary tensors is much more difficult
than 1D contiguous arrays
• Tensors of non-fixed size types (e.g. strings) occur less frequently
All Rights Reserved 13
Apache Arrow
March 26, 2017
• github.com/apache/arrow
• Collaboration amongst broad set of OSS projects around language-agnostic
shared data structures
• Initial focus
• In-memory columnar tables
• Canonical metadata
• Interoperability between JVM and native code (C/C++) ecosystem
All Rights Reserved 14
High performance data interchange
March 26, 2017All Rights Reserved
Today With Arrow
Source: Apache Arrow
15
What does Apache Arrow give you?
March 26, 2017
• Cache-efficient columnar memory: optimized for CPU affinity and SIMD /
parallel processing, O(1) random value access
• Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based
and streaming binary formats
• Complex schema support: Flat and nested data types
• Main implementations in C++ and Java: with integration tests
• Bindings / implementations for C, Python, Ruby, Javascript in various stages
of development
All Rights Reserved 16
Arrow in C++
March 26, 2017
• Reusable memory management and IO subsystem for native code applications
• Layered in multiple components
• Memory management
• Type metadata / schemas
• Array / Table containers
• IO interfaces
• Zero-copy IPC / messaging
All Rights Reserved 17
Arrow C++: Memory management
March 26, 2017
• arrow::Buffer
• RAII-based memory lifetime with std::shared_ptr<Buffer>
• arrow::MemoryMappedBuffer: for memory maps
• arrow::MemoryPool
• Abstract memory allocator for tracking all allocations
All Rights Reserved 18
Arrow C++: Type metadata
March 26, 2017
• arrow::DataType
• Base class for fixed size, variable size, and nested datatypes
• arrow::Field
• Type + name + additional metadata
• arrow::Schema
• Collection of fields
All Rights Reserved 19
Arrow C++: Array / Table containers
March 26, 2017
• arrow::Array
• 1-dimensional columnar arrays: Int32Array, ListArray, StructArray, etc.
• Support for dictionary-encoded arrays
• arrow::RecordBatch
• Collection of equal-length arrays
• arrow::Column
• Logical table “column” as chunked array
• arrow::Table
• Collection of columns
All Rights Reserved 20
Arrow C++: IO interfaces
March 26, 2017
• arrow::{InputStream, OutputStream}
• arrow::RandomAccessFile
• Abstract file interface
• arrow::MemoryMappedFile
• Zero-copy reads to arrow::Buffer
• Specific implementations for OS files, HDFS, etc.
All Rights Reserved 21
Arrow C++: Messaging / IPC
March 26, 2017
• Metadata read/write using Google’s Flatbuffers library
• Encapsulated Message type
• Write record batches, read with zero-copy
• arrow::{FileWriter, FileReader}
• Random access / “batch” binary format
• arrow::{StreamWriter, StreamReader}
• Streaming binary format
All Rights Reserved 22
In development: arrow::Tensor
March 26, 2017
• Targeting interoperability with memory layouts as used in NumPy,
TensorFlow, Torch, or other standard tensor-based frameworks
• data: arrow::Buffer
• shape: dimension sizes
• strides: memory ordering
• Zero-copy reads using Arrow’s shared memory tools
• Support Tensor math libraries for C++ like xtensor
All Rights Reserved 23
Example use: Ray ML framework from Berkeley RISELab
March 26, 2017All Rights Reserved 24
Source: https://arxiv.org/abs/1703.03924
• Shared memory-based object
store
• Zero-copy tensor reads using
Arrow libraries
Example use: pandas 2.0
March 26, 2017
• In-planning rearchitecture of pandas’s internals
• libpandas — largely Python-agnostic C++11 library
• Decoupling pandas data structures from NumPy tensors
• Support analytics targeting native Arrow memory
• Multicore / parallel algorithms
• Leverage latest SIMD intrinsics
• Lazy-loading DataFrames from primary input formats
• CSV, JSON, HDF5, Apache Parquet
All Rights Reserved 25
Other examples
March 26, 2017
• Spark integration (SPARK-13534)
• Weld integration (ARROW-649)
All Rights Reserved 26
Thank you
March 26, 2017
• Building code and community around
• IO subsystems
• Metadata
• Data structures and in-memory formats
All Rights Reserved 27

More Related Content

What's hot

Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsWes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next GenerationWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Ibis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceWes McKinney
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
PyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynoteWes McKinney
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Future of pandas
Future of pandasFuture of pandas
Future of pandasJeff Reback
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latestWes McKinney
 

What's hot (20)

Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Ibis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data Experience
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
PyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Future of pandas
Future of pandasFuture of pandas
Future of pandas
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 

Viewers also liked

Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
ドローン農業最前線
ドローン農業最前線ドローン農業最前線
ドローン農業最前線tetsuya furukawa
 
Goをカンストさせる話
Goをカンストさせる話Goをカンストさせる話
Goをカンストさせる話Moriyoshi Koizumi
 
Angular of things: angular2 + web bluetooth
Angular of things: angular2 + web bluetoothAngular of things: angular2 + web bluetooth
Angular of things: angular2 + web bluetoothSergio Castillo Yrizales
 
マイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPCマイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPCdisc99_
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
 
Mapping Experiences - Workshop Presentation
Mapping Experiences - Workshop PresentationMapping Experiences - Workshop Presentation
Mapping Experiences - Workshop PresentationJim Kalbach
 
Chainerを使って細胞を数えてみた
Chainerを使って細胞を数えてみたChainerを使って細胞を数えてみた
Chainerを使って細胞を数えてみたsamacoba1983
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Jen Aman
 
HoloLens x Graphics 入門
HoloLens x Graphics 入門HoloLens x Graphics 入門
HoloLens x Graphics 入門hecomi
 
Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...
Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...
Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...Teresa Clotilde Ojeda Sánchez
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
 
Surrounded by flowers (Michael and Inessa Garmash )
Surrounded by flowers (Michael and Inessa Garmash )Surrounded by flowers (Michael and Inessa Garmash )
Surrounded by flowers (Michael and Inessa Garmash )Makala (D)
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
フォントの選び方・使い方
フォントの選び方・使い方フォントの選び方・使い方
フォントの選び方・使い方k maztani
 
Krenn algorithmic democracy_ab_jan_2016
Krenn algorithmic democracy_ab_jan_2016Krenn algorithmic democracy_ab_jan_2016
Krenn algorithmic democracy_ab_jan_2016democracyGPS
 

Viewers also liked (20)

Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
ドローン農業最前線
ドローン農業最前線ドローン農業最前線
ドローン農業最前線
 
Functional go
Functional goFunctional go
Functional go
 
Goをカンストさせる話
Goをカンストさせる話Goをカンストさせる話
Goをカンストさせる話
 
Startup Pitch Decks
Startup Pitch DecksStartup Pitch Decks
Startup Pitch Decks
 
Angular of things: angular2 + web bluetooth
Angular of things: angular2 + web bluetoothAngular of things: angular2 + web bluetooth
Angular of things: angular2 + web bluetooth
 
マイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPCマイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPC
 
Dolor en rn
Dolor en rnDolor en rn
Dolor en rn
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017
 
Mapping Experiences - Workshop Presentation
Mapping Experiences - Workshop PresentationMapping Experiences - Workshop Presentation
Mapping Experiences - Workshop Presentation
 
Chainerを使って細胞を数えてみた
Chainerを使って細胞を数えてみたChainerを使って細胞を数えてみた
Chainerを使って細胞を数えてみた
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
HoloLens x Graphics 入門
HoloLens x Graphics 入門HoloLens x Graphics 入門
HoloLens x Graphics 入門
 
Unreal engine4を使ったVRコンテンツ製作で 120%役に立つtips集+GDC情報をご紹介
Unreal engine4を使ったVRコンテンツ製作で 120%役に立つtips集+GDC情報をご紹介Unreal engine4を使ったVRコンテンツ製作で 120%役に立つtips集+GDC情報をご紹介
Unreal engine4を使ったVRコンテンツ製作で 120%役に立つtips集+GDC情報をご紹介
 
Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...
Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...
Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
Surrounded by flowers (Michael and Inessa Garmash )
Surrounded by flowers (Michael and Inessa Garmash )Surrounded by flowers (Michael and Inessa Garmash )
Surrounded by flowers (Michael and Inessa Garmash )
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
フォントの選び方・使い方
フォントの選び方・使い方フォントの選び方・使い方
フォントの選び方・使い方
 
Krenn algorithmic democracy_ab_jan_2016
Krenn algorithmic democracy_ab_jan_2016Krenn algorithmic democracy_ab_jan_2016
Krenn algorithmic democracy_ab_jan_2016
 

Similar to Memory Interoperability in Analytics and Machine Learning

Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsHisham Arafat
 
Big data berlin
Big data berlinBig data berlin
Big data berlinkammeyer
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsAshish Mrig
 
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloudJeff Hung
 
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.iraminnezarat
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowPyData
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyHakka Labs
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Indrajit Poddar
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big DataDataWorks Summit
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha Talagala
 
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Peter Wang
 
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهمعرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهWeb Standards School
 
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017Amazon Web Services
 

Similar to Memory Interoperability in Analytics and Machine Learning (20)

Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
 
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
Threat hunting using notebook technologies
Threat hunting using notebook technologiesThreat hunting using notebook technologies
Threat hunting using notebook technologies
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
 
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهمعرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
 
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 

More from Wes McKinney (10)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Memory Interoperability in Analytics and Machine Learning

  • 1. www.twosigma.com Memory Interoperability for Analytics and Machine Learning March 26, 2017All Rights Reserved Wes McKinney @wesmckinn ScaledML @ Stanford March 25, 2017
  • 2. Me March 26, 2017 • Currently: Software Architect at Two Sigma Investments • Creator of Python pandas project • PMC member for Apache Arrow and Apache Parquet • Author of Python for Data Analysis • Other Python projects: Ibis, Feather, statsmodels All Rights Reserved 2
  • 3. Important Legal Information March 26, 2017 The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved All Rights Reserved 3
  • 4. This talk 4March 26, 2017 • Benefits of interoperable data and metadata • Challenges to sharing memory between runtime environments • Apache Arrow: Purpose and C++ architecture • Opportunities for collaboration • Example application: pandas 2.0 All Rights Reserved
  • 5. Changing hardware landscape March 26, 2017 • Intel has released first production 3D Xpoint SSD • Reported 1000x faster than NAND, less expensive than RAM • Convergence between RAM vs. shared memory / mmap performance All Rights Reserved 5
  • 6. Changing software landscape March 26, 2017 • Next-gen ML / AI frameworks (TensorFlow, Torch, etc.) • DIY open source architectures for machine learning in production • Streaming / batch data processing pipelines • Data cleaning and feature engineering • Model fitting / scoring / serving All Rights Reserved 6
  • 7. “Zero-copy” memory interfaces March 26, 2017 • Enables computational tools to process a dataset without any additional serialization, or transfer to a different memory space • Can do random access on a dataset that does not fit in RAM • Another interpretation: reading a dataset is a metadata-only conversion All Rights Reserved 7
  • 8. Challenges to zero-copy memory sharing March 26, 2017 • Cross-language issues • Type metadata + logical types • Byte/bit-level memory layout • Language-specific issues • In-memory data structures • Memory allocation and sharing constructs All Rights Reserved 8
  • 9. What is pandas? March 26, 2017 • Popular in-memory data manipulation tool for Python • Focused on tabular datasets (“data frames”) • Sprawling codebase spanning multiple areas • IO for many data formats • Array manipulations / data preparation • OLAP-style analytics • Internals implemented using NumPy array objects All Rights Reserved 9
  • 10. NumPy March 26, 2017 • Tensor memory model ("ndarray") for numeric data • Strided, homogeneously-typed, byte-addressable memory • APL-inspired semantics • Zero-copy construction from compatible memory layouts • Computational tools support both strided and contiguous memory access All Rights Reserved 10
  • 11. pandas: Technical debt + Architectural issues March 26, 2017 • Tensor library like NumPy awkward fit for pandas use cases • Multidimensionality + strided memory access complicated algorithms • Lack of built-in missing value support • Weak on native string, variable length, or nested types • pandas at core a “in-memory columnar” problem, similar to analytical SQL engines All Rights Reserved 11
  • 12. Thesis: Tensors and Tables March 26, 2017 • 2 data structures best suited for zero-copy sharing • Tensors: N-dimensional, homogeneously-typed arrays • Tables: Column-oriented, heterogeneously typed • These data structures can be defined using common memory and metadata primitives All Rights Reserved 12
  • 13. Observations March 26, 2017 • A Tensor is semantically a multidimensional view of a 1D block of memory • Writing computational code targeting arbitrary tensors is much more difficult than 1D contiguous arrays • Tensors of non-fixed size types (e.g. strings) occur less frequently All Rights Reserved 13
  • 14. Apache Arrow March 26, 2017 • github.com/apache/arrow • Collaboration amongst broad set of OSS projects around language-agnostic shared data structures • Initial focus • In-memory columnar tables • Canonical metadata • Interoperability between JVM and native code (C/C++) ecosystem All Rights Reserved 14
  • 15. High performance data interchange March 26, 2017All Rights Reserved Today With Arrow Source: Apache Arrow 15
  • 16. What does Apache Arrow give you? March 26, 2017 • Cache-efficient columnar memory: optimized for CPU affinity and SIMD / parallel processing, O(1) random value access • Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based and streaming binary formats • Complex schema support: Flat and nested data types • Main implementations in C++ and Java: with integration tests • Bindings / implementations for C, Python, Ruby, Javascript in various stages of development All Rights Reserved 16
  • 17. Arrow in C++ March 26, 2017 • Reusable memory management and IO subsystem for native code applications • Layered in multiple components • Memory management • Type metadata / schemas • Array / Table containers • IO interfaces • Zero-copy IPC / messaging All Rights Reserved 17
  • 18. Arrow C++: Memory management March 26, 2017 • arrow::Buffer • RAII-based memory lifetime with std::shared_ptr<Buffer> • arrow::MemoryMappedBuffer: for memory maps • arrow::MemoryPool • Abstract memory allocator for tracking all allocations All Rights Reserved 18
  • 19. Arrow C++: Type metadata March 26, 2017 • arrow::DataType • Base class for fixed size, variable size, and nested datatypes • arrow::Field • Type + name + additional metadata • arrow::Schema • Collection of fields All Rights Reserved 19
  • 20. Arrow C++: Array / Table containers March 26, 2017 • arrow::Array • 1-dimensional columnar arrays: Int32Array, ListArray, StructArray, etc. • Support for dictionary-encoded arrays • arrow::RecordBatch • Collection of equal-length arrays • arrow::Column • Logical table “column” as chunked array • arrow::Table • Collection of columns All Rights Reserved 20
  • 21. Arrow C++: IO interfaces March 26, 2017 • arrow::{InputStream, OutputStream} • arrow::RandomAccessFile • Abstract file interface • arrow::MemoryMappedFile • Zero-copy reads to arrow::Buffer • Specific implementations for OS files, HDFS, etc. All Rights Reserved 21
  • 22. Arrow C++: Messaging / IPC March 26, 2017 • Metadata read/write using Google’s Flatbuffers library • Encapsulated Message type • Write record batches, read with zero-copy • arrow::{FileWriter, FileReader} • Random access / “batch” binary format • arrow::{StreamWriter, StreamReader} • Streaming binary format All Rights Reserved 22
  • 23. In development: arrow::Tensor March 26, 2017 • Targeting interoperability with memory layouts as used in NumPy, TensorFlow, Torch, or other standard tensor-based frameworks • data: arrow::Buffer • shape: dimension sizes • strides: memory ordering • Zero-copy reads using Arrow’s shared memory tools • Support Tensor math libraries for C++ like xtensor All Rights Reserved 23
  • 24. Example use: Ray ML framework from Berkeley RISELab March 26, 2017All Rights Reserved 24 Source: https://arxiv.org/abs/1703.03924 • Shared memory-based object store • Zero-copy tensor reads using Arrow libraries
  • 25. Example use: pandas 2.0 March 26, 2017 • In-planning rearchitecture of pandas’s internals • libpandas — largely Python-agnostic C++11 library • Decoupling pandas data structures from NumPy tensors • Support analytics targeting native Arrow memory • Multicore / parallel algorithms • Leverage latest SIMD intrinsics • Lazy-loading DataFrames from primary input formats • CSV, JSON, HDF5, Apache Parquet All Rights Reserved 25
  • 26. Other examples March 26, 2017 • Spark integration (SPARK-13534) • Weld integration (ARROW-649) All Rights Reserved 26
  • 27. Thank you March 26, 2017 • Building code and community around • IO subsystems • Metadata • Data structures and in-memory formats All Rights Reserved 27