Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Rust is for "Big Data"

2,460 views

Published on

Presentation given at the Boulder/Denver Rust Meetup on 4/11/18.

Published in: Software
  • Login to see the comments

Rust is for "Big Data"

  1. 1. Rust is for “Big Data” Andy Grove @ Boulder/Denver Rust Meetup 4/11/18
  2. 2. About Me • I’ve been a software engineer for ~30 years • 20 years of that using Java • Also some management/founder roles • In my day job I mostly work with Scala, Spark, Parquet, Kudu, Thrift, and HDFS • Yay! I'm a Big Data Engineer TM • I have been learning Rust in my spare time on and off over the past couple years • One of my goals for 2018 was to become proficient in Rust so I decided to take on a substantial project
  3. 3. What’s wrong with Spark/JVM? • Spark is actually pretty neat, but … • Garbage collection overheads can be huge • OutOfMemory errors are common • Java serialization is inefficient, even with Kryo • Expensive up-front query planning and code-generation make it inefficient for interactive queries and small data sets • Difficult to configure, monitor, and debug • Generally row-oriented, even when working with columnar data sources
  4. 4. A typical day in Spark-land …
  5. 5. Let’s build something better! • Rust > JVM: • Raw performance of compiled code • Efficient memory usage • Predictable memory usage • No serialization overhead to map raw bytes to Rust structs • Access to hardware (SIMD, DMA, etc)
  6. 6. Keep Calm and Keep Columnar • Column-oriented > Row-oriented • Just load the columns you need from disk (efficient projections) • “a > b” and “a + b” are now vectorized operations that can take advantage of SIMD (Same Instruction, Multiple Data) • Apache Arrow is a standardized columnar in-memory format for zero-copy data interchange between systems • Apache Parquet is a columnar file-format with efficient per- column encoding and compression
  7. 7. DataFusion • DataFusion is a proof-of-concept of a modern distributed compute platform, implemented in Rust • Programming model is similar to Apache Spark (DataFrame and SQL APIs) • Apache Arrow is used for the core memory model • Apache Parquet is partially supported (read-only and no support for nested types yet) • CSV is supported too (where there is Big Data, there is CSV) • etcd is used for co-ordination between nodes • Kubernetes/Docker deployment model (planned)
  8. 8. Arrow Memory Layout
  9. 9. Source code example
  10. 10. First Benchmark • Simple job to convert lat/lng pairs into ESRI WKT (Well-known text) format • SELECT ST_AsText(ST_Point(lat, lng)) FROM locations • Reads from CSV file • Calls two UDFs, and creates one UDT • Writes results to CSV file • Single thread, single core
  11. 11. Detailed Results (throughput rows/second) # Rows DataFusion 0.2.6 Apache Spark 2.2.1 Ratio 10^1 18,191 1,044,030 256,213 4 2 7,523.8 10^2 47,489 437 108.7 10^3 607,057 3,731 162.7 10^4 820,819 32,258 25.4 10^5 957,025 181,159 5.3 10^6 1,044,030 256,213 4.1 10^7 797,224 268,853 3.0 10^8 1,026,443 271,022 3.8 10^9 958,960 282,576 3.4
  12. 12. Thanks! • Resources: • DataFusion: https://datafusion.rs/ • My blog: https://andygrove.io • Apache Arrow: https://arrow.apache.org/ • Contact me: • LinkedIn: https://www.linkedin.com/in/andygrove/ • Twitter: @andygrove73 • Email: andygrove73@gmail.com

×