SlideShare a Scribd company logo
1 of 42
Download to read offline
SPARK SUMMIT
EUROPE2016
FROM SINGLE-TENANT HADOOP
TO 3000 TENANTS IN APACHE SPARK
RUBEN PULIDO
BEHAR VELIQI
IBM WATSON ANALYTICS FOR SOCIAL MEDIA
IBM GERMANY R&D LAB
WHAT IS WATSON ANALYTICS FOR SOCIAL MEDIA
PREVIOUS ARCHITECTURE
THOUGHT PROCESS TOWARDS MULTITENANCY
NEW ARCHITECTURE
LESSONS LEARNED
WATSON ANALYTICS A data analytics solution for business users in the cloud
users
WATSON ANALYTICS FOR SOCIAL MEDIA
◉ Part of Watson Analytics
◉ Allows users to
harvest and analyze
social media content
◉ To understand how
brands, products,
services, social issues,
…
are perceived
Cognos BI
Text Analytics
BigInsights
DB2
Evolving Topics
End to End Orchestration
Content is
stored in
HDFS…
…analyzed
within our
components...
…stored in
HDFS
again...
…and loaded
into DB2
Our previous architecture:
a single-tenant “big data” ETL batch pipeline
Photo by GhislainBonneau,gbphotodidactical.ca
What people think about Social Media data rates
What Stream Processing is really good at
Our requirement:
Efficiently processing
“trickle feeds” of data
What the size of a customer specific
social media data-feed really is
WATSON ANALYTICS FOR SOCIAL MEDIA
PREVIOUS ARCHITECTURE
THOUGHT PROCESS TOWARDS MULTITENANCY
NEW ARCHITECTURE
LESSONS LEARNED
Step 1: Revisit our analytics workload
Author
features
detection
Concept
level
sentiment
detection
Concept
detection
Everyone I ask says my PhoneA
is way better than my brother’s PhoneB
Everyone I ask says my PhoneAis way better than my brother’s PhoneB
PhoneA:
Positive
Everyone I ask says my PhoneA is way better than my brother’s PhoneB
PhoneB:
Negative
My wife’s PhoneA died yesterday.
My son wants me to get him PhoneB.
Author: Is Married = true
Author: Has Children = true
Creation
of author
profiles
Is Married = true
Has Children = true
Author:
Author: Is Married = true
Author: Has Children = true
Step 2: Assign the right processing
model for the right workload
Document-level analytics
Concept detection
Sentiment detection
Extracting author features
◉ Can easily work on a data stream
◉ User gets additional value from
every document
Collection-level analytics
Consolidation of author profiles
Influence detection
…
◉ Most algorithms need a batch of
documents (or all documents)
◉ User value is getting the “big picture”
◉ Document-level analytics makes sense on a
stream…
◉ ….document analysis for a single tenant often
does not justify stream processing…
◉ …but what if we process all documents for
all tenants in a single system?
Step 3: Turn “trickle feeds” into a
stream through multitenancy
Step 4: Address Multitenancy Requirements
◉ Minimize “switching costs” when analyzing data
for different tenants
−Separated text analytics steps into:
− tenant-specific
− language-specific
−Optimized tenant-specific steps for “low latency” switching
◉ Avoid storing tenant state in processing components
−Using Zookeeper as distributed configuration store
WATSON ANALYTICS FOR SOCIAL MEDIA
PREVIOUS ARCHITECTURE
THOUGHT PROCESS TOWARDS MULTITENANCY
NEW ARCHITECTURE
LESSONS LEARNED
ANALYSIS PIPELINE
COLLECTIONSERVICE
ANALYSIS PIPELINE
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
Stream Processing Batch Processing
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
urls
raw
analyzed
Kafka
HDFS
HTTP
HDFS HDFS
Stream Processing Batch Processing
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
urls
raw
analyzed
Kafka
HDFS
HTTP
HDFS HDFS
Stream Processing Batch Processing
ORCHESTRATION
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
urls
raw
analyzed
Kafka
HDFS
HTTP
HDFS HDFS
Stream Processing Batch Processing
master master master
ZK ZK ZK
worker broker worker broker
worker broker worker broker
worker broker worker broker
HDFS
HDFS
HDFS
Orchestration Tier Data Processing Tier Storage Tier
worker broker worker broker
worker broker worker broker
worker broker worker broker
HDFS
HDFS
HDFS
Data Processing Tier Storage Tier
master master master
ZK ZK ZK
JOB-1
JOB-2
JOB-N
… status
errorCode
# fetched
# analyzed
…
Orchestration Tier
Orchestration Tier Storage Tier
master master master
ZK ZK ZK
HDFS
HDFS
HDFS
worker broker worker broker
Data Processing Tier
Orchestration Tier Storage Tier
master master master
ZK ZK ZK
HDFS
HDFS
HDFS
en de fr es … en de fr es …
en de fr es …
en de fr es … en de fr es …
worker broker worker broker
worker broker worker broker
worker broker worker broker
Data Processing Tier
• Expensive in
Initialization
• High memory usage
à Not appropriate for
. Caching
• Stable across tenants
• One analysis engine
• per language
• per Spark
worker
à created at
. deployment time
• Processing threads
co-use these engines
LANGUANGE
SPECIFIC ANALYSIS
en de fr es …
Orchestration Tier Storage Tier
master master master
ZK ZK ZK
HDFS
HDFS
HDFS
en de fr es … en de fr es …
en de fr es …
en de fr es … en de fr es …
worker broker worker broker
worker broker worker broker
worker broker worker broker
Data Processing Tier
• Low memory usage
compared to language
analysis engines
à LRU-cache of 100
. tenant specific
. analysis engines
. per Spark worker
• Very low tenant-
switching overhead
TENANT
SPECIFIC ANALYSIS
• Expensive in
Initialization
• High memory usage
à Not appropriate for
. Caching
• Stable across tenants
• One analysis engine
• per language
• per Spark
worker
à created at
. deployment time
• Processing threads
co-use these engines
LANGUANGE
SPECIFIC ANALYSIS
en de fr es …
READ
…
raw
READ
…
raw
ANALYZE
create view Parent as
extract regex /
(my | our)
(sons?|daughters?|kids?|baby( boy|girl)?|babies|child|children)
/
with flags 'CASE_INSENSITIVE' on D.text as match from Document D
AQL
READ
…
raw
ANALYZE
WRITEanalyzed
status = RUNNING
errorCode = NONE
# fetched = 5000
# analyzed = 4000
…
◉ More than 3000 tenants
◉ Between 150-300 analysis jobs per day
◉ Between 25K-4M documents analyzed per job
◉ 3 data centers in Toronto, Dallas, Amsterdam
◉ Cluster of 10 VMs: each with 16 Cores and 32G RAM
◉ Text-Analytics Throughput:
◉ Full Stack: from Tokenization to Sentiment / Behavior
◉ Mixed Sources: ~60K documents per minute
◉ Tweets only: 150K Tweets per minute
DEPLOYMENT
ACTIVE
REST-API
GitHub CI
Uber
JAR
Deployment
Container
UPLOAD TO HDFS
GRACEFUL SHUTDOWN
START APPS
…
offsetSettingStrategy: "keepWithNewPartitions",
segment: {
app: “fetcher”,
spark.cores.max: 10,
spark.executor.memory: "3G",
spark.streaming.stopGracefullyOnShutdown: true,
…
}
…
WATSON ANALYTICS FOR SOCIAL MEDIA
PREVIOUS ARCHITECTURE
THOUGHT PROCESS TOWARDS MULTITENANCY
NEW ARCHITECTURE
LESSONS LEARNED
◉ “Driver Only” – No Stream or Batch Processing
◉ Uses Spark’s Fail-Over mechanisms
◉ No implementation of custom master election necessary
ORCHESTRATION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
…
// update progress on zookeeper
// trigger the batchphases
KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
◉ wrong workload for Spark`s
micro-batch based stream-processing
◉ Hard to find appropriate scheduling interval
◉ builds high scheduling delays
◉ hard to distribute equally across the cluster
◉ No benefit of Spark settings like
maxRatePerPartition or backpressure.enabled
◉ sometimes leading to OOM
KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
NON-SPARK APPLICATION
◉ Run the fetcher as separate application
outside of Spark – “long running” thread
◉ Fetch documents from social media providers
◉ Write to raw documents to Kafka
◉ Let the pipeline analyze these documents
◉ backpressure.enabled = true
◉ wrong workload for Spark`s
micro-batch based stream-processing
◉ Hard to find appropriate scheduling interval
◉ builds high scheduling delays
◉ hard to distribute equally across the cluster
◉ No benefit of Spark settings like
maxRatePerPartition or backpressure.enabled
◉ sometimes leading to OOM
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
Job-ID
• Jobs don’tinfluence each other
• Run fully in parallel
• Cluster is not fully/equally utilized
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
Job-ID
“Random” + All URLsAt Once
• Jobs don’tinfluence each other
• Run fully in parallel
• Cluster is not fully/equally utilized
• Cluster is fully utilized
• Workload equally distributed
• Jobs run sequentially
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
Job-ID
”Random” + “Next” URLs
• Jobs don’tinfluence each other
• Run fully in parallel
• Cluster is not fully/equally utilized
• Cluster is fully utilized
• Workload equally distributed
• Jobs run sequentially
• Cluster is fully utilized
• Workload equally distributed
• Jobs run in parallel
“Random” + All URLsAt Once
◉ Grouping happens in User Memory region
◉ When a large batch of documents is processed:
◉ Spark does does not spill to disk
◉ OOM errors
à configure a small streaming batch interval or
. set maxRatePerPartition to limit the events of the batch
OOM
Write documents for each job to a separate HDFS directory.
First version: Group documents using Scala collections API
User
Memory
Region
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
◉ Group and write job documents through DataFrame APIs
◉ Uses the Execution Memory region for computations
◉ Spark can prevent OOM errors by borrowing space from the
Storage memory region or by spilling to disk if needed.
Spark
Execution
Memory
Region
No OOM
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
Write documents for each job to a separate HDFS directory.
Second version: Let Spark do the grouping and write to HDFS
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker worker
worker worker
JOB-1
JOB-2
FIFO SCHEDULER
◉ First job gets available resources
while its stages have tasks to
launch
◉ Big jobs “block” smaller jobs
started later
worker worker
worker worker
JOB-1
JOB-2
◉ Tasks between jobs assigned
“round robin”
◉ Smaller jobs submitted while a
big job is running can start
receiving resources right away
FAIR SCHEDULER
Within-Application Job Scheduling
◉ Watson Analytics for Social Media allows harvesting and analyzing
social media data
◉ Becoming a real-time multi-tenant analytics pipeline required:
– Splitting analytics into tenant-specific and language-specific
– Aggregating all tenants trickle-feeds into a single stream
– Ensuring low-latency context switching between tenants
– Removing state from processing components
◉ New pipeline based on Spark, Kafka and Zookeeper
– robust
– fulfills latency and throughputrequirements
– additional analytics
SUMMARY
SPARK SUMMIT
EUROPE2016
THANK YOU.
RUBEN PULIDO
www.linkedin.com/in/ruben-pulido
@_rubenpulido
www.watsonanalytics.comFREE
TRIAL
BEHAR VELIQI
www.linkedin.com/in/beharveliqi
@beveliqi

More Related Content

What's hot

Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningKai Wähner
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018Roy Russo
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017Roy Russo
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
 
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading
 
Tale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench ToolsTale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench ToolsSATOSHI TAGOMORI
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapiragluent.
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Chris Fregly
 

What's hot (20)

Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep Learning
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop World
 
Tale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench ToolsTale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench Tools
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
 

Viewers also liked

Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceCloudera, Inc.
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Kai Sasaki
 
STAC Summit 2014 - Building a multitenant Big Data infrastructure
STAC Summit 2014 - Building a multitenant Big Data infrastructureSTAC Summit 2014 - Building a multitenant Big Data infrastructure
STAC Summit 2014 - Building a multitenant Big Data infrastructureGord Sissons
 
Building and Managing Scalable Applications on AWS: 1 to 500K users
Building and Managing Scalable Applications on AWS: 1 to 500K usersBuilding and Managing Scalable Applications on AWS: 1 to 500K users
Building and Managing Scalable Applications on AWS: 1 to 500K usersAmazon Web Services
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaTodd Palino
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...Amazon Web Services
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHBaseCon
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsMulti-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsDataWorks Summit
 
Strata Hadoop Hopsworks
Strata Hadoop HopsworksStrata Hadoop Hopsworks
Strata Hadoop HopsworksJim Dowling
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse OptimizationCloudera, Inc.
 

Viewers also liked (12)

Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0
 
STAC Summit 2014 - Building a multitenant Big Data infrastructure
STAC Summit 2014 - Building a multitenant Big Data infrastructureSTAC Summit 2014 - Building a multitenant Big Data infrastructure
STAC Summit 2014 - Building a multitenant Big Data infrastructure
 
Building and Managing Scalable Applications on AWS: 1 to 500K users
Building and Managing Scalable Applications on AWS: 1 to 500K usersBuilding and Managing Scalable Applications on AWS: 1 to 500K users
Building and Managing Scalable Applications on AWS: 1 to 500K users
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsMulti-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
 
Managing a Multi-Tenant Data Lake
Managing a Multi-Tenant Data LakeManaging a Multi-Tenant Data Lake
Managing a Multi-Tenant Data Lake
 
Strata Hadoop Hopsworks
Strata Hadoop HopsworksStrata Hadoop Hopsworks
Strata Hadoop Hopsworks
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
 

Similar to Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop to 3000 tenants in Apache Spark

Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"Pinar Alper
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Timothy Spann
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017Monal Daxini
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmDmitri Zimine
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...confluent
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesVladimír Schreiner
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...ScyllaDB
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...Amazon Web Services
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problemsAbhishek Gupta
 

Similar to Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop to 3000 tenants in Apache Spark (20)

Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
Data Science
Data ScienceData Science
Data Science
 

Recently uploaded

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 

Recently uploaded (20)

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 

Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop to 3000 tenants in Apache Spark

  • 1. SPARK SUMMIT EUROPE2016 FROM SINGLE-TENANT HADOOP TO 3000 TENANTS IN APACHE SPARK RUBEN PULIDO BEHAR VELIQI IBM WATSON ANALYTICS FOR SOCIAL MEDIA IBM GERMANY R&D LAB
  • 2. WHAT IS WATSON ANALYTICS FOR SOCIAL MEDIA PREVIOUS ARCHITECTURE THOUGHT PROCESS TOWARDS MULTITENANCY NEW ARCHITECTURE LESSONS LEARNED
  • 3. WATSON ANALYTICS A data analytics solution for business users in the cloud users WATSON ANALYTICS FOR SOCIAL MEDIA ◉ Part of Watson Analytics ◉ Allows users to harvest and analyze social media content ◉ To understand how brands, products, services, social issues, … are perceived
  • 4. Cognos BI Text Analytics BigInsights DB2 Evolving Topics End to End Orchestration Content is stored in HDFS… …analyzed within our components... …stored in HDFS again... …and loaded into DB2 Our previous architecture: a single-tenant “big data” ETL batch pipeline
  • 5. Photo by GhislainBonneau,gbphotodidactical.ca What people think about Social Media data rates What Stream Processing is really good at
  • 6. Our requirement: Efficiently processing “trickle feeds” of data What the size of a customer specific social media data-feed really is
  • 7. WATSON ANALYTICS FOR SOCIAL MEDIA PREVIOUS ARCHITECTURE THOUGHT PROCESS TOWARDS MULTITENANCY NEW ARCHITECTURE LESSONS LEARNED
  • 8. Step 1: Revisit our analytics workload Author features detection Concept level sentiment detection Concept detection Everyone I ask says my PhoneA is way better than my brother’s PhoneB Everyone I ask says my PhoneAis way better than my brother’s PhoneB PhoneA: Positive Everyone I ask says my PhoneA is way better than my brother’s PhoneB PhoneB: Negative My wife’s PhoneA died yesterday. My son wants me to get him PhoneB. Author: Is Married = true Author: Has Children = true Creation of author profiles Is Married = true Has Children = true Author: Author: Is Married = true Author: Has Children = true
  • 9. Step 2: Assign the right processing model for the right workload Document-level analytics Concept detection Sentiment detection Extracting author features ◉ Can easily work on a data stream ◉ User gets additional value from every document Collection-level analytics Consolidation of author profiles Influence detection … ◉ Most algorithms need a batch of documents (or all documents) ◉ User value is getting the “big picture”
  • 10. ◉ Document-level analytics makes sense on a stream… ◉ ….document analysis for a single tenant often does not justify stream processing… ◉ …but what if we process all documents for all tenants in a single system? Step 3: Turn “trickle feeds” into a stream through multitenancy
  • 11. Step 4: Address Multitenancy Requirements ◉ Minimize “switching costs” when analyzing data for different tenants −Separated text analytics steps into: − tenant-specific − language-specific −Optimized tenant-specific steps for “low latency” switching ◉ Avoid storing tenant state in processing components −Using Zookeeper as distributed configuration store
  • 12. WATSON ANALYTICS FOR SOCIAL MEDIA PREVIOUS ARCHITECTURE THOUGHT PROCESS TOWARDS MULTITENANCY NEW ARCHITECTURE LESSONS LEARNED
  • 20. master master master ZK ZK ZK worker broker worker broker worker broker worker broker worker broker worker broker HDFS HDFS HDFS Orchestration Tier Data Processing Tier Storage Tier
  • 21. worker broker worker broker worker broker worker broker worker broker worker broker HDFS HDFS HDFS Data Processing Tier Storage Tier master master master ZK ZK ZK JOB-1 JOB-2 JOB-N … status errorCode # fetched # analyzed … Orchestration Tier
  • 22. Orchestration Tier Storage Tier master master master ZK ZK ZK HDFS HDFS HDFS worker broker worker broker Data Processing Tier
  • 23. Orchestration Tier Storage Tier master master master ZK ZK ZK HDFS HDFS HDFS en de fr es … en de fr es … en de fr es … en de fr es … en de fr es … worker broker worker broker worker broker worker broker worker broker worker broker Data Processing Tier • Expensive in Initialization • High memory usage à Not appropriate for . Caching • Stable across tenants • One analysis engine • per language • per Spark worker à created at . deployment time • Processing threads co-use these engines LANGUANGE SPECIFIC ANALYSIS en de fr es …
  • 24. Orchestration Tier Storage Tier master master master ZK ZK ZK HDFS HDFS HDFS en de fr es … en de fr es … en de fr es … en de fr es … en de fr es … worker broker worker broker worker broker worker broker worker broker worker broker Data Processing Tier • Low memory usage compared to language analysis engines à LRU-cache of 100 . tenant specific . analysis engines . per Spark worker • Very low tenant- switching overhead TENANT SPECIFIC ANALYSIS • Expensive in Initialization • High memory usage à Not appropriate for . Caching • Stable across tenants • One analysis engine • per language • per Spark worker à created at . deployment time • Processing threads co-use these engines LANGUANGE SPECIFIC ANALYSIS en de fr es …
  • 26. READ … raw ANALYZE create view Parent as extract regex / (my | our) (sons?|daughters?|kids?|baby( boy|girl)?|babies|child|children) / with flags 'CASE_INSENSITIVE' on D.text as match from Document D AQL
  • 27. READ … raw ANALYZE WRITEanalyzed status = RUNNING errorCode = NONE # fetched = 5000 # analyzed = 4000 …
  • 28. ◉ More than 3000 tenants ◉ Between 150-300 analysis jobs per day ◉ Between 25K-4M documents analyzed per job ◉ 3 data centers in Toronto, Dallas, Amsterdam ◉ Cluster of 10 VMs: each with 16 Cores and 32G RAM ◉ Text-Analytics Throughput: ◉ Full Stack: from Tokenization to Sentiment / Behavior ◉ Mixed Sources: ~60K documents per minute ◉ Tweets only: 150K Tweets per minute
  • 29. DEPLOYMENT ACTIVE REST-API GitHub CI Uber JAR Deployment Container UPLOAD TO HDFS GRACEFUL SHUTDOWN START APPS … offsetSettingStrategy: "keepWithNewPartitions", segment: { app: “fetcher”, spark.cores.max: 10, spark.executor.memory: "3G", spark.streaming.stopGracefullyOnShutdown: true, … } …
  • 30. WATSON ANALYTICS FOR SOCIAL MEDIA PREVIOUS ARCHITECTURE THOUGHT PROCESS TOWARDS MULTITENANCY NEW ARCHITECTURE LESSONS LEARNED
  • 31. ◉ “Driver Only” – No Stream or Batch Processing ◉ Uses Spark’s Fail-Over mechanisms ◉ No implementation of custom master election necessary ORCHESTRATION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS … // update progress on zookeeper // trigger the batchphases
  • 32. KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
  • 33. KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS ◉ wrong workload for Spark`s micro-batch based stream-processing ◉ Hard to find appropriate scheduling interval ◉ builds high scheduling delays ◉ hard to distribute equally across the cluster ◉ No benefit of Spark settings like maxRatePerPartition or backpressure.enabled ◉ sometimes leading to OOM
  • 34. KAFKA EVENT à HTTP REQUEST à DOCUMENT BATCH ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS NON-SPARK APPLICATION ◉ Run the fetcher as separate application outside of Spark – “long running” thread ◉ Fetch documents from social media providers ◉ Write to raw documents to Kafka ◉ Let the pipeline analyze these documents ◉ backpressure.enabled = true ◉ wrong workload for Spark`s micro-batch based stream-processing ◉ Hard to find appropriate scheduling interval ◉ builds high scheduling delays ◉ hard to distribute equally across the cluster ◉ No benefit of Spark settings like maxRatePerPartition or backpressure.enabled ◉ sometimes leading to OOM
  • 35. ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS worker broker worker broker worker broker worker broker JOB-1 JOB-2 Job-ID • Jobs don’tinfluence each other • Run fully in parallel • Cluster is not fully/equally utilized
  • 36. ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS worker broker worker broker worker broker worker broker JOB-1 JOB-2 worker broker worker broker worker broker worker broker JOB-1 JOB-2 Job-ID “Random” + All URLsAt Once • Jobs don’tinfluence each other • Run fully in parallel • Cluster is not fully/equally utilized • Cluster is fully utilized • Workload equally distributed • Jobs run sequentially
  • 37. ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS worker broker worker broker worker broker worker broker JOB-1 JOB-2 worker broker worker broker worker broker worker broker JOB-1 JOB-2 worker broker worker broker worker broker worker broker JOB-1 JOB-2 Job-ID ”Random” + “Next” URLs • Jobs don’tinfluence each other • Run fully in parallel • Cluster is not fully/equally utilized • Cluster is fully utilized • Workload equally distributed • Jobs run sequentially • Cluster is fully utilized • Workload equally distributed • Jobs run in parallel “Random” + All URLsAt Once
  • 38. ◉ Grouping happens in User Memory region ◉ When a large batch of documents is processed: ◉ Spark does does not spill to disk ◉ OOM errors à configure a small streaming batch interval or . set maxRatePerPartition to limit the events of the batch OOM Write documents for each job to a separate HDFS directory. First version: Group documents using Scala collections API User Memory Region ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
  • 39. ◉ Group and write job documents through DataFrame APIs ◉ Uses the Execution Memory region for computations ◉ Spark can prevent OOM errors by borrowing space from the Storage memory region or by spilling to disk if needed. Spark Execution Memory Region No OOM ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS Write documents for each job to a separate HDFS directory. Second version: Let Spark do the grouping and write to HDFS
  • 40. ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS worker worker worker worker JOB-1 JOB-2 FIFO SCHEDULER ◉ First job gets available resources while its stages have tasks to launch ◉ Big jobs “block” smaller jobs started later worker worker worker worker JOB-1 JOB-2 ◉ Tasks between jobs assigned “round robin” ◉ Smaller jobs submitted while a big job is running can start receiving resources right away FAIR SCHEDULER Within-Application Job Scheduling
  • 41. ◉ Watson Analytics for Social Media allows harvesting and analyzing social media data ◉ Becoming a real-time multi-tenant analytics pipeline required: – Splitting analytics into tenant-specific and language-specific – Aggregating all tenants trickle-feeds into a single stream – Ensuring low-latency context switching between tenants – Removing state from processing components ◉ New pipeline based on Spark, Kafka and Zookeeper – robust – fulfills latency and throughputrequirements – additional analytics SUMMARY
  • 42. SPARK SUMMIT EUROPE2016 THANK YOU. RUBEN PULIDO www.linkedin.com/in/ruben-pulido @_rubenpulido www.watsonanalytics.comFREE TRIAL BEHAR VELIQI www.linkedin.com/in/beharveliqi @beveliqi