Rust is for "Big Data"

•

7 likes•2,648 views

Andy Grove

Presentation given at the Boulder/Denver Rust Meetup on 4/11/18.

Software

Rust is for “Big Data”
Andy Grove @ Boulder/Denver Rust Meetup 4/11/18

What’s wrong with Spark/JVM?
• Spark is actually pretty neat, but …
• Garbage collection overheads can be huge
• OutOfMemory errors are common
• Java serialization is inefﬁcient, even with Kryo
• Expensive up-front query planning and code-generation make it
inefﬁcient for interactive queries and small data sets
• Difﬁcult to conﬁgure, monitor, and debug
• Generally row-oriented, even when working with columnar data
sources

Let’s build something better!
• Rust > JVM:
• Raw performance of compiled code
• Efﬁcient memory usage
• Predictable memory usage
• No serialization overhead to map raw bytes to Rust
structs
• Access to hardware (SIMD, DMA, etc)

Keep Calm and Keep Columnar
• Column-oriented > Row-oriented
• Just load the columns you need from disk (efﬁcient
projections)
• “a > b” and “a + b” are now vectorized operations that can
take advantage of SIMD (Same Instruction, Multiple Data)
• Apache Arrow is a standardized columnar in-memory
format for zero-copy data interchange between systems
• Apache Parquet is a columnar ﬁle-format with efﬁcient per-
column encoding and compression

DataFusion
• DataFusion is a proof-of-concept of a modern distributed compute
platform, implemented in Rust
• Programming model is similar to Apache Spark (DataFrame and SQL
APIs)
• Apache Arrow is used for the core memory model
• Apache Parquet is partially supported (read-only and no support for
nested types yet)
• CSV is supported too (where there is Big Data, there is CSV)
• etcd is used for co-ordination between nodes
• Kubernetes/Docker deployment model (planned)

First Benchmark
• Simple job to convert lat/lng pairs into ESRI WKT
(Well-known text) format
• SELECT ST_AsText(ST_Point(lat, lng)) FROM locations
• Reads from CSV ﬁle
• Calls two UDFs, and creates one UDT
• Writes results to CSV ﬁle
• Single thread, single core

Detailed Results
(throughput rows/second)
# Rows DataFusion 0.2.6 Apache Spark 2.2.1 Ratio
10^1 18,191
1,044,030 256,213 4
2 7,523.8
10^2 47,489 437 108.7
10^3 607,057 3,731 162.7
10^4 820,819 32,258 25.4
10^5 957,025 181,159 5.3
10^6 1,044,030 256,213 4.1
10^7 797,224 268,853 3.0
10^8 1,026,443 271,022 3.8
10^9 958,960 282,576 3.4

Thanks!
• Resources:
• DataFusion: https://datafusion.rs/
• My blog: https://andygrove.io
• Apache Arrow: https://arrow.apache.org/
• Contact me:
• LinkedIn: https://www.linkedin.com/in/andygrove/
• Twitter: @andygrove73
• Email: andygrove73@gmail.com

What's hot

Ursa Labs and Apache Arrow in 2019Wes McKinney

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn

Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney

ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney

How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

pandas.(to/from)_sql is simple but not fastUwe Korn

Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupfvanvollenhoven

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScyllaDB

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit

Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa

Should I use a document database?Oren Eini

Introduction to apache sparkUserReport

HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall

Yet another intro to Apache SparkSimon Lia-Jonassen

Big Data Certifications Workshop - 201711 - Introduction and Database EssentialsDurga Gadiraju

Adios hadoop, Hola Spark! T3chfest 2015dhiguero

Build Low-Latency Applications in Rust on ScyllaDBScyllaDB

What's hot (20)

Ursa Labs and Apache Arrow in 2019

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Apache Arrow Workshop at VLDB 2019 / BOSS Session

ACM TechTalks : Apache Arrow and the Future of Data Frames

How Apache Arrow and Parquet boost cross-language interoperability

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

Apache Arrow -- Cross-language development platform for in-memory data

pandas.(to/from)_sql is simple but not fast

Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group

Apache Arrow at DataEngConf Barcelona 2018

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)

Performant data processing with PySpark, SparkR and DataFrame API

Should I use a document database?

Introduction to apache spark

HUG_Ireland_Apache_Arrow_Tomer_Shiran

Yet another intro to Apache Spark

Big Data Certifications Workshop - 201711 - Introduction and Database Essentials

Adios hadoop, Hola Spark! T3chfest 2015

Build Low-Latency Applications in Rust on ScyllaDB

Similar to Rust is for "Big Data"

Apache Spark FundamentalsZahra Eskandari

Big Data (NJ SQL Server User Group)Don Demcsak

Intro to Apache SparkMarius Soutier

20151015 zagreb spark_notebooksAndrey Vykhodtsev

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

20160524 ibm fast data meetupshinolajla

Introduction to Cassandra and CQL for Java developersJulien Anguenot

Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformDataStax Academy

Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen

Apache Spark on HDinsight TrainingSynergetics Learning and Cloud Consulting

A Java Implementer's Guide to Better Apache Spark PerformanceTim Ellison

Big Data Beyond the JVM - Strata San Jose 2018Holden Karau

Apache Spark in IndustryDorian Beganovic

Giraph+Gora in ApacheCon14Renato Javier Marroquín Mogrovejo

Scala in Model-Driven development for Apparel Cloud PlatformTomoharu ASAMI

Apache spark-melbourne-april-2015-meetupNed Shawa

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn

Lightning Fast Dataframes with PolarsAlberto Danese

Intro to Big Data and NoSQLDon Demcsak

Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow

Similar to Rust is for "Big Data" (20)

Apache Spark Fundamentals

Big Data (NJ SQL Server User Group)

Intro to Apache Spark

20151015 zagreb spark_notebooks

Apache Spark for Everyone - Women Who Code Workshop

20160524 ibm fast data meetup

Introduction to Cassandra and CQL for Java developers

Large Scale Data Analytics with Spark and Cassandra on the DSE Platform

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

Apache Spark on HDinsight Training

A Java Implementer's Guide to Better Apache Spark Performance

Big Data Beyond the JVM - Strata San Jose 2018

Apache Spark in Industry

Giraph+Gora in ApacheCon14

Scala in Model-Driven development for Apparel Cloud Platform

Apache spark-melbourne-april-2015-meetup

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

Lightning Fast Dataframes with Polars

Intro to Big Data and NoSQL

Big Data Developers Moscow Meetup 1 - sql on hadoop

Recently uploaded

%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba

tonesoftglanshi9

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2

WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2

WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2

WSO2CON 2024 - How to Run a Security ProgramWSO2

Direct Style Effect Systems -The Print[A] Example- A Comprehension AidPhilip Schwarz

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2

%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba

%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba

Recently uploaded (20)

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

tonesoftg

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source

WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...

WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity

WSO2CON 2024 - How to Run a Security Program

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...

%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg

%in Midrand+277-882-255-28 abortion pills for sale in midrand

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...

Rust is for "Big Data"

1. Rust is for “Big Data” Andy Grove @ Boulder/Denver Rust Meetup 4/11/18

2. About Me • I’ve been a software engineer for ~30 years • 20 years of that using Java • Also some management/founder roles • In my day job I mostly work with Scala, Spark, Parquet, Kudu, Thrift, and HDFS • Yay! I'm a Big Data Engineer TM • I have been learning Rust in my spare time on and off over the past couple years • One of my goals for 2018 was to become proﬁcient in Rust so I decided to take on a substantial project

3. What’s wrong with Spark/JVM? • Spark is actually pretty neat, but … • Garbage collection overheads can be huge • OutOfMemory errors are common • Java serialization is inefficient, even with Kryo • Expensive up-front query planning and code-generation make it inefficient for interactive queries and small data sets • Difficult to configure, monitor, and debug • Generally row-oriented, even when working with columnar data sources

4. A typical day in Spark-land …

5. Let’s build something better! • Rust > JVM: • Raw performance of compiled code • Efﬁcient memory usage • Predictable memory usage • No serialization overhead to map raw bytes to Rust structs • Access to hardware (SIMD, DMA, etc)

6. Keep Calm and Keep Columnar • Column-oriented > Row-oriented • Just load the columns you need from disk (efficient projections) • “a > b” and “a + b” are now vectorized operations that can take advantage of SIMD (Same Instruction, Multiple Data) • Apache Arrow is a standardized columnar in-memory format for zero-copy data interchange between systems • Apache Parquet is a columnar file-format with efficient per- column encoding and compression

8. DataFusion • DataFusion is a proof-of-concept of a modern distributed compute platform, implemented in Rust • Programming model is similar to Apache Spark (DataFrame and SQL APIs) • Apache Arrow is used for the core memory model • Apache Parquet is partially supported (read-only and no support for nested types yet) • CSV is supported too (where there is Big Data, there is CSV) • etcd is used for co-ordination between nodes • Kubernetes/Docker deployment model (planned)

9. Arrow Memory Layout

10. Source code example

11. First Benchmark • Simple job to convert lat/lng pairs into ESRI WKT (Well-known text) format • SELECT ST_AsText(ST_Point(lat, lng)) FROM locations • Reads from CSV ﬁle • Calls two UDFs, and creates one UDT • Writes results to CSV ﬁle • Single thread, single core

12.

13. Detailed Results (throughput rows/second) # Rows DataFusion 0.2.6 Apache Spark 2.2.1 Ratio 10^1 18,191 1,044,030 256,213 4 2 7,523.8 10^2 47,489 437 108.7 10^3 607,057 3,731 162.7 10^4 820,819 32,258 25.4 10^5 957,025 181,159 5.3 10^6 1,044,030 256,213 4.1 10^7 797,224 268,853 3.0 10^8 1,026,443 271,022 3.8 10^9 958,960 282,576 3.4

14. Thanks! • Resources: • DataFusion: https://datafusion.rs/ • My blog: https://andygrove.io • Apache Arrow: https://arrow.apache.org/ • Contact me: • LinkedIn: https://www.linkedin.com/in/andygrove/ • Twitter: @andygrove73 • Email: andygrove73@gmail.com

Rust is for "Big Data"

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Rust is for "Big Data"

Similar to Rust is for "Big Data" (20)

Recently uploaded

Recently uploaded (20)

Rust is for "Big Data"