Building a data lake is a daunting task. The promise of a virtual data lake is to provide the advantages of a data lake without consolidating all data into a single repository. With Apache Arrow and Dremio, companies can, for the first time, build virtual data lakes that provide full access to data no matter where it is stored and no matter what size it is.
3. The demands for data
are growing rapidly
Increasing demands
Reporting
New products
Forecasting
Threat detection
BI
Machine
Learning
Segmenting
Fraud prevention
4. Your analysts are hungry for data
SQL
But your data is everywhere
And it’s not in the shape they need
5. Today you engineer data flows and reshaping
Data Staging
• Custon ETL
• Fragile transforms
• Slow moving
SQL
6. Today you engineer data flows and reshaping
Data Staging
Data Warehouse
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
7. Today you engineer data flows and reshaping
Data Staging
Data Warehouse
Cubes, BI Extracts &
Aggregation Tables • Data sprawl
• Governance issues
• Slow to update
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
+
+
+
+
+
+
+
+
+
9. How can we Tackle this Age-old
Problem?
Direct access to data In-memory, GPU,
…
Columnar Distributed
10. Apache Arrow: Process & Move Data
Fast
• Top-level Apache project as of Feb 2016
• Collaboration among many open source projects around shared needs
• Three components:
• Language-independent columnar data structures
• Implementations available for C++, Java, Python
• Metadata for describing schemas/record batches
• Protocol for moving data between between processes without
serialization overhead
11. High-Performance Data Interchange
Today With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and
deserialization
• Similar functionality implemented in multiple projects
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg, Parquet-
to-Arrow reader)
12. Data is Organized in Record Batches
Schema
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Schema & File
Layout
Streaming Format File Format
13. Each Record Batch is Columnar
Intel CPU
SELECT * FROM clickstream WHERE
session_id = 1331246351
Traditional
Memory Buffer
Arrow
Memory Buffer
Arrow leverages the data parallelism
(SIMD) in modern Intel CPUs:
15. Fast Import of Arrow in Pandas & R
Credit: Wes McKinney, Two Sigma
16. Fast Export of Arrow in Spark
• Legacy export from Spark to Pandas (toPandas) was extremely
slow
• Row-by-row conversion from Spark driver to Python memory
• SPARK-13534 introduced an Arrow based implementation
• Wes McKinney (Two Sigma), Bryan Cutler (IBM), Li Jin (Two Sigma), and
Yin Xusen (IBM)
• Set spark.sql.execution.arrow.enable = True
Clock Time 12.5s 1.89s (6.6x)
Deserialization 88% of the time 1% of the time
Peak memory usage 8x dataset size 2x dataset size
20. Thank You
• Apache Arrow community
• Strata organizers
• Get involved
• Subscribe to the Arrow ASF lists
• Contribute to the Arrow project
• Want to learn more about Dremio?
• tshiran@dremio.com
Editor's Notes
BI assumes single relational database, but…
Data in non-relational technologies
Data fragmented across many systems
Massive scale and velocity
Data is the business, and…
Era of impatient smartphone natives
Rise of self-service BI
Accelerating time to market
Because of the complexity of modern data and increasing demands for data, IT gets crushed in the middle:
Slow or non-responsive IT
“Shadow Analytics”
Data governance risk
Illusive data engineers
Immature software
Competing strategic initiatives
Here’s the problem everyone is trying to solve today.
You have consumers of data with their favorite tools. BI products like Tableau, PowerBI, Qlik, as well as data science tools like Python, R, Spark, and SQL.
Then you have all your data, in a mix of relational, NoSQL, Hadoop, and cloud like S3.
So how are you going to get the data to the people asking for it?
Here’s how everyone tries to solve it:
First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store.
You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
Here’s how everyone tries to solve it:
First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store.
You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts.
But what we see with many customers is that the performance here isn’t sufficient for their needs, and so …
Here’s how everyone tries to solve it:
First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store.
You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts.
But what we see with many customers is that the performance here isn’t sufficient for their needs, and so …
You build cubes and aggregation tables to get the performance your users are asking for. And to do this you build another set of scripts.
In the end you’re left with something like this picture. You may have more layers, the technologies may be different, but you’re probably living with something like this. And nobody likes this – it’s expensive, the data movement is slow, it’s hard to change.
But worst of all, you’re left with a dynamic where every time a consumer of the data wants a new piece of data:
They open a ticket with IT
IT begins an engineering project to build another set of pipelines, over several weeks or months