SlideShare a Scribd company logo
1 of 31
Download to read offline
Optimising geospatial queries
with dynamic file pruning
Matthew Slack
Principal Data Architect - Wejo
Agenda
Wejo
Typical use cases
Data lake indexing strategies
Dynamic file pruning
Z-ordering strategies
Measuring effectiveness of file
pruning
Optimizing queries to activate
file pruning
Wejo
Wejo
Processing data from over 15m vehicles
OEM A
Streamed
OEM A
Micro-batch (Ign OFF)
OEM B - n
Streamed
15m+ Vehicles
18 billion data points per day
0.4 trillion per month
Peak ~900,000 per second
5Pb data (and growing)
Wejo and Databricks
Data Lake
Data analysis Data science Data governance
CoreIngress
OEM
Egress
Transform
Filter &
Aggregate
Stream
Raw asset
Bespoke DS output
Adept Stream
BI and Dashboards
Incident
Manage
ment
Change
and
Release
24x7
Monitori
ng
Alerting
SecOps
Devops
Stream Stream
Stream
Aggregates
Batch
Insight Batch
Bespoke
Insights
Sample
Preview
Infosec
Complian
ce
BI Datamart Data Aggregates
Infosec Tooling
Sample
Generation
On-prem
AWS
Google Cloud
Microsoft Azure
Portal
OEM
OEM
OEM
Databricks underpins
our ad-hoc analysis of
data for data science
We use it to populate
both the datamart for BI
and the derived data
used in WIM
Introduction of Delta
Lake to support
geospatial workloads
and CCPA and GDPR
Typical use cases
▪ Traffic intelligence
▪ Map matching
▪ Anomaly detection
▪ Journey intelligence
▪ Origin destination
analysis
All require both geospatial and temporal indexing
Spark geospatial support
Magellan
and many others…
▪ Many existing libraries and frameworks
▪ However…
▪ Most are designed to efficiently join
geospatial datasets that are already
loaded to the cluster
▪ They do not optimize the reading of data
from underlying storage
▪ Geomesa can be configured to utilise the
underlying partitioning on disk but is
complex
geopandas
Data Lake Indexing Strategies (Recap)
Data lake partitioning
▪ Choice of partition columns in
data lakes is always a trade-
off
▪ Partitioning in data lakes is
usually done using a
hierarchical directory
structure
year=2020
• month=11
• day=17
• day=18
ingest_date=2020-11-17
• ingest_hour=10
• ingest_hour=11
country=US
• state=NY
• state=TX
ingest_date=2020-11-17
• state=NY
Complexity
Selectivity
Number of
partitions (< 10k)
Size of each
partition (> 10s MB)
Dynamic partition pruning
▪ Allows partitions to be dynamically skipped at query runtime
▪ Partitions that do not match the query filters are excluded by the
optimizer
▪ Significantly improved in Spark 3.0
Data skipping, via effective partitioning strategies, combined
with columnar file formats such as Parquet, ORC were
historically the main levers for optimizing a data lake
Dynamic File Pruning
Dynamic file pruning overview
▪ Introduced with Databricks Delta Lake in early 2020
▪ Allows files to be dynamically skipped within a partition
▪ Files that do not match the query filters are excluded by the
optimizer
▪ Databricks collects metadata on a subset of columns in all files
added to the dataset, stored in _delta_log folder
▪ Relies on files being pre-sorted beforehand
Anatomy of _delta_log/00000000000000000000.json
delta.dataSkippingNumIndexedCols
Test environment
▪ Test cluster
▪ Spark 2.4.5, Databricks Runtime 6.6
▪ i3.4xlarge x 10 executors
▪ Auto scaling off
▪ CLEAR CACHE before each test
▪ Input dataset
▪ One day of connected car data (6th March)
▪ Nested parquet in AWS S3
▪ Over 16 billion rows, ~40 columns
▪ 2491 files, ~1.4 TB data, ~0.5GB per file
▪ Partition strategy
▪ ingest_date, ingest_hour
Naïve example on un-optimized Parquet
▪ Generate all geohashes that
cover the Austin polygon
▪ Spark scans all files that
match the partition filters
▪ Just over 97M datapoints are
covered by the geohashes
However…
▪ Slow… over 5 minutes
▪ Datapoints spread randomly
across all of the 2491 input
files
Z-Ordering is a technique to colocate related information in the
same set of files. This co-locality is automatically used by Delta
Lake on Databricks data-skipping algorithms to dramatically
reduce the amount of data that needs to be read.
Databricks
Z-Ordering the data
OPTIMIZEConvert to Delta
▪ New dataset
▪ 6233 files, ~0.8 TB data, 128MB per file
▪ Partition strategy
▪ ingest_date, ingest_hour
▪ Z-ordered columns
▪ longitude, latitude
Co-location of
geospatial data
▪ Spread of data across files,
after Z-ORDER by geohash
Measuring file pruning
▪ input_file_name
▪ EXPLAIN
Measuring file pruning – continued
df.queryExecution.optimizedPlan.collect {
// WARNING: this info will not be available >= DBR 7.0
case DeltaTable(prepared: PreparedDeltaFileIndex) =>
val stat = prepared.preparedScan
val skippingPct = (100 -
(stat.scanned.bytesCompressed.get.toDouble /
partitionReadBytes.get) * 100)
Candidate z-order
columns
▪ Longitude/latitude
▪ Geohash
▪ Zipcode
▪ State
▪ H3 (Uber - https://eng.uber.com/h3/)
▪ S2 (Google - https://s2geometry.io/)
Z-ordering by one geospatial column
WILL also enable dynamic file
pruning for queries that filter on a
different geospatial column
For example, z-ordering on geohash
will also improve performance for
queries on zipcode or state
Comparing performance
Filter Type
INNER JOIN
(with
broadcast)
geohash
range
h3 range
longitude/
latitude range
state
zipcode range
(10)
make/model
range
Columns
Z-Ordered
longitude,
latitude
5.55 mins
0% skipping
33s
82% skipping
30s
81% skipping
12s
93% skipping
26s
74% skipping
8s
97% skipping
1.6 mins
0% skipping
h3
27s
80% skipping
14s
96% skipping
18s
92% skipping
38s
68% skipping
11s
95% skipping
1.7 mins
0% skipping
geohash
12s
98% skipping
16s
84% skipping
16s
92% skipping
20s
76% skipping
7s
97% skipping
1.3 mins
0% skipping
make/model
1.4 mins
0% skipping
1.4 mins
0% skipping
1.4 mins
0% skipping
20s
84% skipping
geohash,
longitude,
latitude
27s
90% skipping
23s
82% skipping
23s
91% skipping
geohash,
make/model
17s
83% skipping
28s
55% skipping
35s
58% skipping
43s
38% skipping
15s
84% skipping
36s
61% skipping
s2
17s
83% skipping
36s
82% skipping
22s
91% skipping
30s
73% skipping
11s
96% skipping
1.8 mins
0% skipping
Spot the difference
Only one of these queries activates dynamic file pruning
Check the query plan!
Query caveats
▪ Only certain query filters will activate file pruning
Activates
Simple filters, e.g:
• =, >, <, BETWEEN
IN (with up to 10 values)
INNER JOIN – but..
• must be BROADCAST
• must be included in WHERE
clause
Doesn’t activate
RLIKE
IN (with over 10 values)
Skipping the OPTIMIZE step
▪ OPTIMIZE is effective but expensive
▪ Requires an additional step in your data pipelines
▪ repartitionByRange
▪ works for batch and stream
▪ does not ZORDER, but does shuffle datapoints into files based on the selected
columns
▪ can use a column such as geohash, which is already a z-order over the data
Process overview
▪ Importing data
▪ Querying data
Import data in
stream or batch Process
Add geohash
column
(implicit Z-ORDER)
repartitionByRange
(requires shuffle)
Write to parquet
(non-delta table)
Import to delta
(does not require
shuffle)
Convert all
geospatial
queries to a
set of covering
geohashes
Lookup into
table using a
BROADCAST
join
up to x100 reduction in query read
times from object storage (S3)
With the right query adjustments, dynamic file
pruning is a very effective tool for optimizing
geospatial queries which read from object
storage such as S3
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Apache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkApache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkLuiz Henrique Zambom Santana
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 

What's hot (20)

Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with SparkApache Sedona: how to process petabytes of agronomic data with Spark
Apache Sedona: how to process petabytes of agronomic data with Spark
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 

Similar to Optimising Geospatial Queries with Dynamic File Pruning

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageKai Sasaki
 
Tuning data warehouse
Tuning data warehouseTuning data warehouse
Tuning data warehouseSrinivasan R
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeTorsten Steinbach
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseAmazon Web Services
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQLtomflemingh2
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
 
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Performance Considerations in Logical Data Warehouse
Performance Considerations in Logical Data WarehousePerformance Considerations in Logical Data Warehouse
Performance Considerations in Logical Data WarehouseDenodo
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
Caching Methodology & Strategies
Caching Methodology & StrategiesCaching Methodology & Strategies
Caching Methodology & StrategiesTiệp Vũ
 

Similar to Optimising Geospatial Queries with Dynamic File Pruning (20)

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
Tuning data warehouse
Tuning data warehouseTuning data warehouse
Tuning data warehouse
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQL
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Performance Considerations in Logical Data Warehouse
Performance Considerations in Logical Data WarehousePerformance Considerations in Logical Data Warehouse
Performance Considerations in Logical Data Warehouse
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Caching Methodology & Strategies
Caching Methodology & StrategiesCaching Methodology & Strategies
Caching Methodology & Strategies
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Recently uploaded (20)

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

Optimising Geospatial Queries with Dynamic File Pruning

  • 1. Optimising geospatial queries with dynamic file pruning Matthew Slack Principal Data Architect - Wejo
  • 2. Agenda Wejo Typical use cases Data lake indexing strategies Dynamic file pruning Z-ordering strategies Measuring effectiveness of file pruning Optimizing queries to activate file pruning
  • 5. Processing data from over 15m vehicles OEM A Streamed OEM A Micro-batch (Ign OFF) OEM B - n Streamed 15m+ Vehicles 18 billion data points per day 0.4 trillion per month Peak ~900,000 per second 5Pb data (and growing)
  • 6. Wejo and Databricks Data Lake Data analysis Data science Data governance CoreIngress OEM Egress Transform Filter & Aggregate Stream Raw asset Bespoke DS output Adept Stream BI and Dashboards Incident Manage ment Change and Release 24x7 Monitori ng Alerting SecOps Devops Stream Stream Stream Aggregates Batch Insight Batch Bespoke Insights Sample Preview Infosec Complian ce BI Datamart Data Aggregates Infosec Tooling Sample Generation On-prem AWS Google Cloud Microsoft Azure Portal OEM OEM OEM Databricks underpins our ad-hoc analysis of data for data science We use it to populate both the datamart for BI and the derived data used in WIM Introduction of Delta Lake to support geospatial workloads and CCPA and GDPR
  • 7. Typical use cases ▪ Traffic intelligence ▪ Map matching ▪ Anomaly detection ▪ Journey intelligence ▪ Origin destination analysis All require both geospatial and temporal indexing
  • 8. Spark geospatial support Magellan and many others… ▪ Many existing libraries and frameworks ▪ However… ▪ Most are designed to efficiently join geospatial datasets that are already loaded to the cluster ▪ They do not optimize the reading of data from underlying storage ▪ Geomesa can be configured to utilise the underlying partitioning on disk but is complex geopandas
  • 9. Data Lake Indexing Strategies (Recap)
  • 10. Data lake partitioning ▪ Choice of partition columns in data lakes is always a trade- off ▪ Partitioning in data lakes is usually done using a hierarchical directory structure year=2020 • month=11 • day=17 • day=18 ingest_date=2020-11-17 • ingest_hour=10 • ingest_hour=11 country=US • state=NY • state=TX ingest_date=2020-11-17 • state=NY Complexity Selectivity Number of partitions (< 10k) Size of each partition (> 10s MB)
  • 11. Dynamic partition pruning ▪ Allows partitions to be dynamically skipped at query runtime ▪ Partitions that do not match the query filters are excluded by the optimizer ▪ Significantly improved in Spark 3.0
  • 12. Data skipping, via effective partitioning strategies, combined with columnar file formats such as Parquet, ORC were historically the main levers for optimizing a data lake
  • 14. Dynamic file pruning overview ▪ Introduced with Databricks Delta Lake in early 2020 ▪ Allows files to be dynamically skipped within a partition ▪ Files that do not match the query filters are excluded by the optimizer ▪ Databricks collects metadata on a subset of columns in all files added to the dataset, stored in _delta_log folder ▪ Relies on files being pre-sorted beforehand
  • 16. Test environment ▪ Test cluster ▪ Spark 2.4.5, Databricks Runtime 6.6 ▪ i3.4xlarge x 10 executors ▪ Auto scaling off ▪ CLEAR CACHE before each test ▪ Input dataset ▪ One day of connected car data (6th March) ▪ Nested parquet in AWS S3 ▪ Over 16 billion rows, ~40 columns ▪ 2491 files, ~1.4 TB data, ~0.5GB per file ▪ Partition strategy ▪ ingest_date, ingest_hour
  • 17. Naïve example on un-optimized Parquet ▪ Generate all geohashes that cover the Austin polygon ▪ Spark scans all files that match the partition filters ▪ Just over 97M datapoints are covered by the geohashes
  • 18. However… ▪ Slow… over 5 minutes ▪ Datapoints spread randomly across all of the 2491 input files
  • 19. Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. Databricks
  • 20. Z-Ordering the data OPTIMIZEConvert to Delta ▪ New dataset ▪ 6233 files, ~0.8 TB data, 128MB per file ▪ Partition strategy ▪ ingest_date, ingest_hour ▪ Z-ordered columns ▪ longitude, latitude
  • 21. Co-location of geospatial data ▪ Spread of data across files, after Z-ORDER by geohash
  • 22. Measuring file pruning ▪ input_file_name ▪ EXPLAIN
  • 23. Measuring file pruning – continued df.queryExecution.optimizedPlan.collect { // WARNING: this info will not be available >= DBR 7.0 case DeltaTable(prepared: PreparedDeltaFileIndex) => val stat = prepared.preparedScan val skippingPct = (100 - (stat.scanned.bytesCompressed.get.toDouble / partitionReadBytes.get) * 100)
  • 24. Candidate z-order columns ▪ Longitude/latitude ▪ Geohash ▪ Zipcode ▪ State ▪ H3 (Uber - https://eng.uber.com/h3/) ▪ S2 (Google - https://s2geometry.io/) Z-ordering by one geospatial column WILL also enable dynamic file pruning for queries that filter on a different geospatial column For example, z-ordering on geohash will also improve performance for queries on zipcode or state
  • 25. Comparing performance Filter Type INNER JOIN (with broadcast) geohash range h3 range longitude/ latitude range state zipcode range (10) make/model range Columns Z-Ordered longitude, latitude 5.55 mins 0% skipping 33s 82% skipping 30s 81% skipping 12s 93% skipping 26s 74% skipping 8s 97% skipping 1.6 mins 0% skipping h3 27s 80% skipping 14s 96% skipping 18s 92% skipping 38s 68% skipping 11s 95% skipping 1.7 mins 0% skipping geohash 12s 98% skipping 16s 84% skipping 16s 92% skipping 20s 76% skipping 7s 97% skipping 1.3 mins 0% skipping make/model 1.4 mins 0% skipping 1.4 mins 0% skipping 1.4 mins 0% skipping 20s 84% skipping geohash, longitude, latitude 27s 90% skipping 23s 82% skipping 23s 91% skipping geohash, make/model 17s 83% skipping 28s 55% skipping 35s 58% skipping 43s 38% skipping 15s 84% skipping 36s 61% skipping s2 17s 83% skipping 36s 82% skipping 22s 91% skipping 30s 73% skipping 11s 96% skipping 1.8 mins 0% skipping
  • 26. Spot the difference Only one of these queries activates dynamic file pruning Check the query plan!
  • 27. Query caveats ▪ Only certain query filters will activate file pruning Activates Simple filters, e.g: • =, >, <, BETWEEN IN (with up to 10 values) INNER JOIN – but.. • must be BROADCAST • must be included in WHERE clause Doesn’t activate RLIKE IN (with over 10 values)
  • 28. Skipping the OPTIMIZE step ▪ OPTIMIZE is effective but expensive ▪ Requires an additional step in your data pipelines ▪ repartitionByRange ▪ works for batch and stream ▪ does not ZORDER, but does shuffle datapoints into files based on the selected columns ▪ can use a column such as geohash, which is already a z-order over the data
  • 29. Process overview ▪ Importing data ▪ Querying data Import data in stream or batch Process Add geohash column (implicit Z-ORDER) repartitionByRange (requires shuffle) Write to parquet (non-delta table) Import to delta (does not require shuffle) Convert all geospatial queries to a set of covering geohashes Lookup into table using a BROADCAST join up to x100 reduction in query read times from object storage (S3)
  • 30. With the right query adjustments, dynamic file pruning is a very effective tool for optimizing geospatial queries which read from object storage such as S3
  • 31. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.