SlideShare a Scribd company logo
1 of 46
Download to read offline
A Day in the life of a Druid
Architect
Benjamin Hopp
Senior Solutions Architect @ Imply
ben@imply.io
San Francisco Airport Marriott Waterfront
Real-Time Analytics at Scale
https://www.druidsummit.org/
What do I do?
Productionalization
Implementation
Recommendation
Education
Ask a lot of Questions
● What is the use-case?
○ Is it a good fit for druid?
● Who are the stakeholders?
○ End users - running queries
○ Data Engineers - ingesting data
○ Cluster Administrators - managing services
● How are they using the cluster?
● Where is the data coming from?
● What are the issues or concerns?
● Where does druid fit in the technology stack?
When to use Druid
6
Search
platform
OLAP
● Real-time ingestion
● Flexible schema
● Full text search
● Batch ingestion
● Efficient storage
● Fast analytic queries
Timeseries
database
● Optimized for
time-based datasets
● Time-based functions
When NOT to use Druid
7
OLTP
Individual
record
update/delet
e
Big join
operations
Where Druid fits in
8
Data lakes
Message buses
Raw data Storage Analyze Application
Cluster Evaluation
Druid Architecture
Pick your servers
Data NodesD
● Large-ish
● Scales with size of data and query
volume
● Lots of cores, lots of memory, fast NVMe
disk
Query NodesQ
● Medium-ish
● Scales with concurrency and # of Data
nodes
● Typically CPU bound
Master NodesM
● Small-ish Nodes
● Coordinator scales with # of segments
● Overlord scales with # of supervisors and
tasks
Configure for MAXIMUM PERFORMANCE
Data NodesD
● Enable Cache
● Heap/maxDirectMemory size
● druid.processing.buffer.sizeBytes
● druid.processing.numMergeBuffers
● druid.processing.numThreads
Query NodesQ
● Disable Caching
● Heap/maxDirectMemory size
● druid.broker.http.numConnections
● druid.processing.numMergeBuffers
● druid.processing.numThreads
Master NodesM ● Heap Size
Data Evaluation
Unified Console
Optimize segment size
Ideally 300 - 700 mb (~ 5 million rows)
To control segment size
● Alter segment granularity
● Specify partition spec
● Use Automatic Compaction
Controlling Segment Size
● Number of Tasks - Keep to lowest number that supports max
ingestion rate.
● Segment Granularity - Increase if only 1 file per segment and <
200MB
"segmentGranularity": "HOUR"
● Max Rows Per Segment - Increase if a single segment is <
200MB
"maxRowsPerSegment": 5000000
Compaction
● Combines small segments into larger segments
● Useful for late-arriving data
● Task submitted to Overlord
{
"type" : "compact",
"dataSource" : "wikipedia",
"interval" : "2017-01-01/2018-01-01"
}
Rollup
● Pre-aggregation at ingestion
time
● Saves space, better
compression
● Query performance boost
Rollup
timestamp page city count sum_added sum_deleted
2011-01-01T00:00:00Z
Justin
Bieber
SF 3 50 61
2011-01-01T00:00:00Z Ke$ha LA 2 46 53
2011-01-01T00:00:00Z Miley
Cyrus
DC 4 198 88
timestamp page city added deleted
2011-01-01T00:01:35Z
Justin
Bieber
SF 10 5
2011-01-01T00:03:45Z
Justin
Bieber
SF 25 37
2011-01-01T00:05:62Z
Justin
Bieber
SF 15 19
2011-01-01T00:06:33Z Ke$ha LA 30 45
2011-01-01T00:08:51Z Ke$ha LA 16 8
2011-01-01T00:09:17Z
Miley
Cyrus
DC 75 10
2011-01-01T00:11:25Z
Miley
Cyrus
DC 11 25
2011-01-01T00:23:30Z
Miley
Cyrus
DC 22 12
2011-01-01T00:49:33Z
Miley
Cyrus
DC 90 41
Summarize with data sketches
timestamp page city count
sum_
added
sum_
deleted userid_sketch
2011-01-01T00:00:00Z
Justin
Bieber
SF 3 50 61 sketch_obj
2011-01-01T00:00:00Z Ke$ha LA 2 46 53 sketch_obj
2011-01-01T00:00:00Z Miley
Cyrus
DC 4 198 88 sketch_obj
timestamp page userid city added deleted
2011-01-01T00:01:3
5Z
Justin
Bieber
user11 SF 10 5
2011-01-01T00:03:4
5Z
Justin
Bieber
user22 SF 25 37
2011-01-01T00:05:6
2Z
Justin
Bieber
user11 SF 15 19
2011-01-01T00:06:3
3Z
Ke$ha user33 LA 30 45
2011-01-01T00:08:5
1Z
Ke$ha user33 LA 16
8
2011-01-01T00:09:1
7Z
Miley
Cyrus
user11 DC 75 10
2011-01-01T00:11:2
5Z
Miley
Cyrus
user44 DC 11 25
2011-01-01T00:23:3
0Z
Miley
Cyrus
user44 DC 22 12
2011-01-01T00:49:3
3Z
Miley
Cyrus
user55 DC 90 41
Choose column types carefully
String column
indexed
fast aggregation
fast grouping
Numeric column
indexed
fast aggregation
fast grouping
Partitioning beyond time
● Druid always partitions by time
● Decide which dimension to
partition on… next
● Partition by some dimension you
often filter on
● Improves locality, compression,
storage size, query performance
Query Evaluation
Decisions based on data!
Use Druid SQL
● Easier to learn/more familiar
● Will attempt to make intelligent query type choices (timeseries
vs topN vs groupBy)
● There are some limitations - such as multi-value dimensions,
not all aggregations are supported
Explain Plan
EXPLAIN PLAN FOR
SELECT channel, sum(added)
FROM wikipedia
WHERE commentLength >= 50
GROUP BY channel
ORDER BY sum(added) desc
LIMIT 3
Pick your query carefully
● TimeBoundary - Returns min/max timestamp for given interval.
● Timeseries - When you don’t want to group by dimension
● TopN - When you want to group by a single dimension
○ Approximate if > 1000 dimension values
● GroupBy - Least performant/most flexible
● Scan - For returning streaming raw data
○ Perfect ordering not preserved
● Select - For returning paginated raw data
● Search - Returns dimensions that match text search
Using Lookups
● Use lookups when you have dimensions that change to avoid
re-indexing data
● Lookups are key/value pairs stored on every node.
● Loaded via file or JDBC connection to external database
● Lookups are loaded into the java heap size, so large lookups
need larger heaps
Stay in touch
29
@druidio
https://imply.io
https://druid.apache.org/
Ben Hopp
Benjamin.hopp@imply.io
LinkedIn: benhopp
@implydata
roadmap and community update
Ben Hopp
ben@imply.io
Apache Druid 0.17.0
Druid 0.17.0
Our first release as a top-level Apache project!
3
Druid 0.17.0 Highlights
● Native batch - binary inputs & more
○ Supports non-binary formats such as ORC, Parquet, and Avro
○ Native batch tasks can now read from HDFS
○ Single-dimension range partitioning for parallel native batch
● Compaction improvements
○ Parallel index task split hints and parallel auto-compaction
○ Stateful auto-compaction
● Parallel query merge on brokers
○ Broker can now optionally merge query results in parallel using multiple threads.
4
Druid 0.17.0 Highlights
● ...and More!
○ Improved SQL-compatible null handling
○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram
metric types
○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system
tables in a new sys.supervisors table
○ Fast historical start with deferred loading of segments until query time
○ New readiness and self-discovery resources
○ Task assignment based on MiddleManager categories
○ Security updates
5
Apache Druid 0.16.0
Druid 0.16.0
Over 350 new features from 50 contributors!
Released September 2019.
7
Druid 0.16.0 Highlights
● Native parallel batch shuffle
○ Two-phase shuffle system allows for ‘perfect rollup’ and partitioning on dimensions
● Query vectorization phase one
○ Allows queries to be sped up by reducing the number of method calls
● Indexer process
○ An alternative to the MiddleManager + Peon task execution system which is easier to
configure and deploy
● Improved web console
○ Kafka & Kinesis support!
○ Point-and-click reindexing
8
Druid 0.17.0
Our first release as a top-level Apache project!
Coming soon (really soon).
9
Druid 0.17.0 Highlights
● Native batch - binary inputs & more
○ Supports non-binary formats such as ORC, Parquet, and Avro
○ Native batch tasks can now read from HDFS
○ Single-dimension range partitioning for parallel native batch
● Compaction improvements
○ Parallel index task split hints and parallel auto-compaction
○ Stateful auto-compaction
● Parallel query merge on brokers
○ Broker can now optionally merge query results in parallel using multiple threads.
10
Druid 0.17.0 Highlights
● ...and More!
○ Improved SQL-compatible null handling
○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram
metric types
○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system
tables in a new sys.supervisors table
○ Fast historical start with deferred loading of segments until query time
○ New readiness and self-discovery resources
○ Task assignment based on MiddleManager categories
○ Security updates
11
…and beyond!!
…and beyond!!
A selection of items planned for future 2020 Druid releases.
13
…and beyond!!
● SQL Joins
○ A multi-phase project to add full SQL Join support to Druid. Coming up first -
sub-queries and lookups
● Windowed aggregations
○ For example, moving average and cumulative sum aggregations.
● Dynamic query prioritization & laning
○ Mix ‘heavy’ and ‘light’ workloads in the same cluster without heavy workloads blocking
light ones.
● Extended query vectorization support
○ Richer support for query vectorization against more query types
14
Download
Druid community site (new): https://druid.apache.org/
Imply distribution: https://imply.io/get-started
15
Contribute
16
https://github.com/apache/druid
Stay in touch
17
@druidio
Join the community!
http://druid.io/community
Free training hosted by Imply!
https://imply.io/druid-days
Follow the Druid project on Twitter!

More Related Content

What's hot

Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingSpark Summit
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInCarl Steinbach
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Taro L. Saito
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...Codemotion
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidYahoo Developer Network
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid Imply
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Monitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadogMonitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadogSeth Rosenblum
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016kbajda
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaTreasure Data, Inc.
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 

What's hot (20)

Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedIn
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Monitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadogMonitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadog
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 

Similar to A Day in the Life of a Druid Architect

Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Dipti Borkar
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudChangshu Liu
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Piotr Findeisen
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at UberDataWorks Summit
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Zhenxiao Luo
 

Similar to A Day in the Life of a Druid Architect (20)

Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at Uber
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 

More from Itai Yaffe

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingItai Yaffe
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationItai Yaffe
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsItai Yaffe
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Itai Yaffe
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Itai Yaffe
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesItai Yaffe
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsItai Yaffe
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your DataItai Yaffe
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesItai Yaffe
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Itai Yaffe
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidItai Yaffe
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Itai Yaffe
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for DruidItai Yaffe
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidItai Yaffe
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerItai Yaffe
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureItai Yaffe
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentItai Yaffe
 
Serverless data processing built for internet SCALE
Serverless data processing built for internet SCALEServerless data processing built for internet SCALE
Serverless data processing built for internet SCALEItai Yaffe
 
Ask me anything - Women in Big Data Israel
Ask me anything - Women in Big Data IsraelAsk me anything - Women in Big Data Israel
Ask me anything - Women in Big Data IsraelItai Yaffe
 

More from Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
 
Serverless data processing built for internet SCALE
Serverless data processing built for internet SCALEServerless data processing built for internet SCALE
Serverless data processing built for internet SCALE
 
Ask me anything - Women in Big Data Israel
Ask me anything - Women in Big Data IsraelAsk me anything - Women in Big Data Israel
Ask me anything - Women in Big Data Israel
 

Recently uploaded

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证
在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证
在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证nhjeo1gg
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一
办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一
办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一F La
 

Recently uploaded (20)

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证
在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证
在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一
办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一
办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一
 

A Day in the Life of a Druid Architect

  • 1. A Day in the life of a Druid Architect Benjamin Hopp Senior Solutions Architect @ Imply ben@imply.io
  • 2. San Francisco Airport Marriott Waterfront Real-Time Analytics at Scale https://www.druidsummit.org/
  • 3.
  • 4. What do I do? Productionalization Implementation Recommendation Education
  • 5. Ask a lot of Questions ● What is the use-case? ○ Is it a good fit for druid? ● Who are the stakeholders? ○ End users - running queries ○ Data Engineers - ingesting data ○ Cluster Administrators - managing services ● How are they using the cluster? ● Where is the data coming from? ● What are the issues or concerns? ● Where does druid fit in the technology stack?
  • 6. When to use Druid 6 Search platform OLAP ● Real-time ingestion ● Flexible schema ● Full text search ● Batch ingestion ● Efficient storage ● Fast analytic queries Timeseries database ● Optimized for time-based datasets ● Time-based functions
  • 7. When NOT to use Druid 7 OLTP Individual record update/delet e Big join operations
  • 8. Where Druid fits in 8 Data lakes Message buses Raw data Storage Analyze Application
  • 11. Pick your servers Data NodesD ● Large-ish ● Scales with size of data and query volume ● Lots of cores, lots of memory, fast NVMe disk Query NodesQ ● Medium-ish ● Scales with concurrency and # of Data nodes ● Typically CPU bound Master NodesM ● Small-ish Nodes ● Coordinator scales with # of segments ● Overlord scales with # of supervisors and tasks
  • 12. Configure for MAXIMUM PERFORMANCE Data NodesD ● Enable Cache ● Heap/maxDirectMemory size ● druid.processing.buffer.sizeBytes ● druid.processing.numMergeBuffers ● druid.processing.numThreads Query NodesQ ● Disable Caching ● Heap/maxDirectMemory size ● druid.broker.http.numConnections ● druid.processing.numMergeBuffers ● druid.processing.numThreads Master NodesM ● Heap Size
  • 15. Optimize segment size Ideally 300 - 700 mb (~ 5 million rows) To control segment size ● Alter segment granularity ● Specify partition spec ● Use Automatic Compaction
  • 16. Controlling Segment Size ● Number of Tasks - Keep to lowest number that supports max ingestion rate. ● Segment Granularity - Increase if only 1 file per segment and < 200MB "segmentGranularity": "HOUR" ● Max Rows Per Segment - Increase if a single segment is < 200MB "maxRowsPerSegment": 5000000
  • 17. Compaction ● Combines small segments into larger segments ● Useful for late-arriving data ● Task submitted to Overlord { "type" : "compact", "dataSource" : "wikipedia", "interval" : "2017-01-01/2018-01-01" }
  • 18. Rollup ● Pre-aggregation at ingestion time ● Saves space, better compression ● Query performance boost
  • 19. Rollup timestamp page city count sum_added sum_deleted 2011-01-01T00:00:00Z Justin Bieber SF 3 50 61 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 timestamp page city added deleted 2011-01-01T00:01:35Z Justin Bieber SF 10 5 2011-01-01T00:03:45Z Justin Bieber SF 25 37 2011-01-01T00:05:62Z Justin Bieber SF 15 19 2011-01-01T00:06:33Z Ke$ha LA 30 45 2011-01-01T00:08:51Z Ke$ha LA 16 8 2011-01-01T00:09:17Z Miley Cyrus DC 75 10 2011-01-01T00:11:25Z Miley Cyrus DC 11 25 2011-01-01T00:23:30Z Miley Cyrus DC 22 12 2011-01-01T00:49:33Z Miley Cyrus DC 90 41
  • 20. Summarize with data sketches timestamp page city count sum_ added sum_ deleted userid_sketch 2011-01-01T00:00:00Z Justin Bieber SF 3 50 61 sketch_obj 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 sketch_obj 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 sketch_obj timestamp page userid city added deleted 2011-01-01T00:01:3 5Z Justin Bieber user11 SF 10 5 2011-01-01T00:03:4 5Z Justin Bieber user22 SF 25 37 2011-01-01T00:05:6 2Z Justin Bieber user11 SF 15 19 2011-01-01T00:06:3 3Z Ke$ha user33 LA 30 45 2011-01-01T00:08:5 1Z Ke$ha user33 LA 16 8 2011-01-01T00:09:1 7Z Miley Cyrus user11 DC 75 10 2011-01-01T00:11:2 5Z Miley Cyrus user44 DC 11 25 2011-01-01T00:23:3 0Z Miley Cyrus user44 DC 22 12 2011-01-01T00:49:3 3Z Miley Cyrus user55 DC 90 41
  • 21. Choose column types carefully String column indexed fast aggregation fast grouping Numeric column indexed fast aggregation fast grouping
  • 22. Partitioning beyond time ● Druid always partitions by time ● Decide which dimension to partition on… next ● Partition by some dimension you often filter on ● Improves locality, compression, storage size, query performance
  • 25. Use Druid SQL ● Easier to learn/more familiar ● Will attempt to make intelligent query type choices (timeseries vs topN vs groupBy) ● There are some limitations - such as multi-value dimensions, not all aggregations are supported
  • 26. Explain Plan EXPLAIN PLAN FOR SELECT channel, sum(added) FROM wikipedia WHERE commentLength >= 50 GROUP BY channel ORDER BY sum(added) desc LIMIT 3
  • 27. Pick your query carefully ● TimeBoundary - Returns min/max timestamp for given interval. ● Timeseries - When you don’t want to group by dimension ● TopN - When you want to group by a single dimension ○ Approximate if > 1000 dimension values ● GroupBy - Least performant/most flexible ● Scan - For returning streaming raw data ○ Perfect ordering not preserved ● Select - For returning paginated raw data ● Search - Returns dimensions that match text search
  • 28. Using Lookups ● Use lookups when you have dimensions that change to avoid re-indexing data ● Lookups are key/value pairs stored on every node. ● Loaded via file or JDBC connection to external database ● Lookups are loaded into the java heap size, so large lookups need larger heaps
  • 29. Stay in touch 29 @druidio https://imply.io https://druid.apache.org/ Ben Hopp Benjamin.hopp@imply.io LinkedIn: benhopp @implydata
  • 30. roadmap and community update Ben Hopp ben@imply.io
  • 32. Druid 0.17.0 Our first release as a top-level Apache project! 3
  • 33. Druid 0.17.0 Highlights ● Native batch - binary inputs & more ○ Supports non-binary formats such as ORC, Parquet, and Avro ○ Native batch tasks can now read from HDFS ○ Single-dimension range partitioning for parallel native batch ● Compaction improvements ○ Parallel index task split hints and parallel auto-compaction ○ Stateful auto-compaction ● Parallel query merge on brokers ○ Broker can now optionally merge query results in parallel using multiple threads. 4
  • 34. Druid 0.17.0 Highlights ● ...and More! ○ Improved SQL-compatible null handling ○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram metric types ○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system tables in a new sys.supervisors table ○ Fast historical start with deferred loading of segments until query time ○ New readiness and self-discovery resources ○ Task assignment based on MiddleManager categories ○ Security updates 5
  • 36. Druid 0.16.0 Over 350 new features from 50 contributors! Released September 2019. 7
  • 37. Druid 0.16.0 Highlights ● Native parallel batch shuffle ○ Two-phase shuffle system allows for ‘perfect rollup’ and partitioning on dimensions ● Query vectorization phase one ○ Allows queries to be sped up by reducing the number of method calls ● Indexer process ○ An alternative to the MiddleManager + Peon task execution system which is easier to configure and deploy ● Improved web console ○ Kafka & Kinesis support! ○ Point-and-click reindexing 8
  • 38. Druid 0.17.0 Our first release as a top-level Apache project! Coming soon (really soon). 9
  • 39. Druid 0.17.0 Highlights ● Native batch - binary inputs & more ○ Supports non-binary formats such as ORC, Parquet, and Avro ○ Native batch tasks can now read from HDFS ○ Single-dimension range partitioning for parallel native batch ● Compaction improvements ○ Parallel index task split hints and parallel auto-compaction ○ Stateful auto-compaction ● Parallel query merge on brokers ○ Broker can now optionally merge query results in parallel using multiple threads. 10
  • 40. Druid 0.17.0 Highlights ● ...and More! ○ Improved SQL-compatible null handling ○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram metric types ○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system tables in a new sys.supervisors table ○ Fast historical start with deferred loading of segments until query time ○ New readiness and self-discovery resources ○ Task assignment based on MiddleManager categories ○ Security updates 11
  • 42. …and beyond!! A selection of items planned for future 2020 Druid releases. 13
  • 43. …and beyond!! ● SQL Joins ○ A multi-phase project to add full SQL Join support to Druid. Coming up first - sub-queries and lookups ● Windowed aggregations ○ For example, moving average and cumulative sum aggregations. ● Dynamic query prioritization & laning ○ Mix ‘heavy’ and ‘light’ workloads in the same cluster without heavy workloads blocking light ones. ● Extended query vectorization support ○ Richer support for query vectorization against more query types 14
  • 44. Download Druid community site (new): https://druid.apache.org/ Imply distribution: https://imply.io/get-started 15
  • 46. Stay in touch 17 @druidio Join the community! http://druid.io/community Free training hosted by Imply! https://imply.io/druid-days Follow the Druid project on Twitter!