Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Rust & Apache Arrow @ RMS

Learn how RMS is using Rust and Apache Arrow / DataFusion to reduce TCO and increase performance of analytical queries on "Big Data".

  • Login to see the comments

Rust & Apache Arrow @ RMS

  1. 1. Apache Arrow @ RMS March 2019
  2. 2. The World's Leading Catastrophe Risk Modeling Company From earthquakes, hurricanes, and floods to terrorism and infectious diseases, RMS helps financial institutions and public agencies understand, quantify, and manage risk
  3. 3. 3 So what do we actually do? ● Models ○ We have complex models for various types of risk ■ Fire, flood, earthquakes, etc ○ Our customers run our models against their portfolios of risk items (e.g. properties) to understand financial impact ○ The models produce a lot of data ● Interactive Queries ○ Insurance analysts are similar to data scientists ○ Lots of result data to slice and dice and visualize ○ Low latency analytics on relatively large datasets ■ Too much for a SQL database but not PB scale
  4. 4. 4
  5. 5. 5 RMS Datastore Stack Intelligent query parsing, rewriting and routing. Cost-based optimizations. Ability to use different query engines depending on use case or size of data set.
  6. 6. 6 Query Service 1.0 ● Native Query Execution ○ Scala code, using Apache Arrow and Parquet libraries ○ Column-based file readers with projection push-down ○ Row-based query execution ○ Apache Arrow for the type system ● Performance ○ Order of magnitude improvements compared to Spark for some use cases ○ Slower than Spark for other use cases (larger data sets, JOINs, etc) ● SQL Interface ○ Apache Hive for our internal SQL dialect ○ Apache Hive protocol for compatibility with ODBC/JDBC drivers ○ REST API for integration with microservices
  7. 7. 7 Query Service Conclusions & Next Steps ● The Query Service was successful ○ Reduced TCO (fewer Spark nodes required) ○ Improved performance for interactive queries ● In my spare time I had been working on an open source project called DataFusion ○ DataFusion started out as a generic Rust query engine ○ I felt that Rust was much better suited than JVM ○ I learned a lot more about Apache Arrow and the benefits of columnar processing ● So how could we leverage this at RMS? ○ I donated the initial Rust implementation of Apache Arrow and later donated DataFusion as well
  8. 8. 8 Why Columnar?
  9. 9. 99 Row vs Column Source code available: https://github.com/andygrove/row-vs-col-rs Compares: ● Rust Vec<Row> ● Rust Vec<Column> ● Rust Vec<Array> // Apache Arrow Columnar benefits: ● Cache pipelining ● SIMD (Same instruction, multiple data) ● GPU vectorized processing (higher is better)
  10. 10. 10 Apache Arrow
  11. 11. 11 Apache Arrow ● Standardized language-independent columnar memory format ○ for flat and hierarchical data ○ organized for efficient analytic operations on modern hardware ■ Vectorized processing, SIMD, GPU ● Implementations available for many programming languages ○ C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust. ● Zero-copy interprocess communication ○ IPC metadata defined in flatbuffer format
  12. 12. 12 Apache Arrow ● Computational libraries ○ C++ libraries that leverage LLVM (donated by Dremio) ○ NVIDIA CUDA support ● Query Engines ○ Ursa Labs initiative ■ C++ query engine ○ DataFusion ■ Rust query engine
  13. 13. 13 Apache Arrow ● 3 years as a top level project ● Project Management Committee (PMC) members work for ... ○ Cloudera, Databricks, DataStax, Dremio, Hortonworks, Looker, MapR, RMS, RStudio, Salesforce, Twitter, UC Berkeley RISELab, Ursa Labs, WeWork, Workday ● Committers work for ... ○ Amazon, CERN, Google, IBM ● Also many individual contributors ● Companies providing financial support (via Ursa Labs) ○ nVIDIA, ODSC, RStudio, Two Sigma
  14. 14. Huge overhead converting between different data formats and duplicating data.
  15. 15. Zero-copy data access Exchange metadata and pointers to Arrow arrays
  16. 16. 16 DataFusion Rust-native in-memory query engine for Apache Arrow
  17. 17. 17 Why Rust ● See https://www.rust-lang.org/ for detailed information ● My take ○ Speed of C++ with the safety of Java ○ Memory efficient (no GC) ○ Predictable performance ○ Lower TCO ○ Forces you to think about what you are doing ■ Thread safety has to be explicit ■ Memory management has to be explicit ○ The compiler acts as a peer reviewer … tough but fair
  18. 18. 18 DataFusion current functionality ● SQL query planner and optimizer ● Supported SQL features ○ Projection (SELECT) ○ Selection (WHERE) ○ Aggregates (MIN, MAX, SUM) ● Expressions ○ identifiers (column names) ○ Literal values ● Operators ○ Arithmetic (+, -, *, /, %) ○ Comparison (<, <=, =, >=, >, !=, etc) ○ Binary (AND, OR)
  19. 19. 19
  20. 20. 20
  21. 21. Demo Time PoC of a Rust-based Query Service using Apache Arrow
  22. 22. 2222 Benchmarks! SELECT riskitem_occupancyId, occupancy_occupancyName, SUM(risk_totalTIV) FROM ContractPrimaryRealPropertyView_1234 GROUP BY riskitem_occupancyId, occupancy_occupancyName ● Spark ○ Running in local mode ○ Parquet files on local SSD ○ Cached DataFrames ● DataFusion ○ Arrow format “MemTable”
  23. 23. 23 Benchmark ResultsEC2 c5.18xlarge instance 72 vCPUs 144 GB SSD (100 IOPS / 3000 burst) Data set: 5MM risk items Wide table (~600 columns) ~16 GB on disk (higher is better)
  24. 24. 24 DataFusion Roadmap ● DataFrame-style API for building logical query plans, as alternative to SQL ● Parallel Query Execution (threads, partitions) ● Support for more data sources (Parquet, JSON) ● More complete SQL support (joins, subqueries, columnar UDFs) ● Distributed Execution ○ Distributed query planner & optimizer ○ Kubernetes & Docker deployment model ○ Apache Flight protocol for streaming data between nodes Apache Arrow is a “do-ocracy” where the individual contributors get to decide the roadmap, but here are some things that I am planning on working on
  25. 25. 25 Want to contribute? ● Great time to get involved! ○ The code base is still relatively small ■ Core Arrow library is 6k LOC ■ DataFusion is 4k LOC ○ Small number of regular contributors ○ Where to start? ■ https://cwiki.apache.org/confluence/display/ARROW/Rust+JIRA+Dashboard ○ Try adding DataFusion as a crate dependency
  26. 26. Thanks! Questions? Contact Details ▪ @AndyGrove73 ▪ andy.grove@rms.com ▪ https://www.linkedin.com/in/andygrove Arrow Resources: ▪ @ApacheArrow ▪ https://arrow.apache.org ▪ https://github.com/apache/arrow

×