SlideShare a Scribd company logo
1 of 38
Download to read offline
Spark Streaming-as-a-
Service with Kafka and
YARN
Jim Dowling
KTH Royal Institute of Technology, Stockholm
Senior Researcher, SICS
CEO, Logical Clocks AB
Spark Streaming-as-a-Service in Sweden
• SICS ICE: datacenter research environment
• Hopsworks: Spark/Flink/Kafka/Tensorflow/Hadoop
• -as-a-service
– Built on Hops Hadoop (www.hops.io)
– >130 active users
Hadoop is not a cool kid anymore!
Hadoop’s Evolution
2009 2016
?
Hadoop’s Evolution
2009 2016
?
Tiny Brain
(NameNode, ResourceMgr)
Huge Body (DataNodes)
Build out Hadoop’s Brain with External
Weakly Consistent MetaData Services
Google-Glass Approach to Intelligence
NameNodes
NDB
HDFS Client
DataNodes
>37X Capacity
>16 X
Throughput
HopsFS
Larger Brains => Bigger, Faster*
16x
Performance
on Spotify Workload
*Usenix FAST 2017, HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
Hopsworks
• Projects
– Datasets/Files
– Topics
– Jobs/Notebooks
Hadoop
• Clusters
• Users
• Jobs/Applications
• Files
• ACLs
• Sys Admins
• Kerberos
Larger Brains => More Intelligent*
*HMGA2 gene mutations correlated with increased intracranial volume as well as enhanced IQ.
http://newsroom.ucla.edu/releases/international-team-uncovers-new-231989
User-Friendly Concepts
http://www.ibtimes.co.uk/embargoed-8pm-25th-jan-size-matters-brain-size-relative-body-size-indicates-animals-ability-1539994
YARN Spark Streaming Support
• Apache Kafka
• ELK Stack
– Real-time Logs
• Grafana/InfluxDB
– Monitoring
Hopsworks
YARN aggregates logs on job completion
http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/
Kafka Self-Service UI
Manage & Share
• Topics
• ACLs
• Avro Schemas
Manage & Share
• Topics
• ACLs
• Avro Schemas
Logs
Elasticsearch,
Logstash,
Kibana
(ELK Stack)
Elasticsearch,
Logstash,
Kibana
(ELK Stack)
Monitoring/Alerting
InfluxDB
and
Grafana
InfluxDB
and
Grafana
metrics.properties: StreamingMetrics.streaming.lastReceivedBatch_records == 0
Zeppelin for Prototyping Streaming Apps
[https://github.com/knockdata/spark-highcharts]
Debugging Spark with Dr. Elephant
• Analyzes Spark Jobs
for errors and
common using
pluggable heuristics
• Doesn’t show killed
jobs
• No online support for
streaming apps yet
Integration as Microservices in Hopsworks
• Project-based Multi-tenancy
• Self-Service UI
• Simplifying Spark Streaming Apps
Proj-All
Proj-X
Projects in Hopsworks
•
Proj-42
Shared TopicTopic /Projs/My/Data
CompanyDB
User roles
18
Data Owner
- Import/Export data
- Manage Membership
- Share DataSets, Topics
Data Scientist
- Write and Run code
Self-Service Administration – No Administrator Needed
Notebooks, Data sharing and Quotas
• Zeppelin Notebooks in HDFS, Jobs launcher UI.
• Sharing is not Copying
– Datasets/Topics
• Per-Project quotas
– Storage in HDFS
– CPU in YARN (Uber-style Pricing)
Dynamic roles
alice@gmail.com
ProjectA
Authenticate
ProjectB
HopsFS
YARN
Kafka
SSL/TLS
Certificates
Secure
Impersonation
ProjectA__alice
ProjectB__alice
Look Ma, no Kerberos
• Each project-specific user issued with a SSL/TLS
(X.509) certificate for both authentication and encryption.
• Services also issued with SSL/TLS certificates.
– Same root CA as user certs
Simplifying Spark Streaming Apps
• Spark Streaming Applications need to know
– Credentials
• Hadoop, Kafka, InfluxDb, Logstash
– Endpoints
• Kafka Broker, Kafka SchemaRegistry, ResourceManager,
NameNode, InfluxDB, Logstash
• The HopsUtil API hides this complexity.
– Location/security transparent Spark applications
Secure Streaming App with Kafka
Developer
1.Discover: Schema Registry and Kafka/InfluxDB/ELK Endpoints
2.Create: Kafka Properties file with certs and broker details
3.Create: Producer/Consumer using Kafka Properties
4.Download: the Schema for the Topic from the Schema Registry
5.Distribute: X.509 certs to all hosts on the cluster
6.Cleanup securely
These steps are replaced by calls to the HopsUtil API
Operations
https://github.com/hopshadoop/hops-kafka-examples
Streaming Producer in HopsWorks
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
String topic = HopsUtil.getTopic(); //Optional
SparkProducer producer = HopsUtil.getSparkProducer();
Map<String, String> message = …
sparkProducer.produce(message);
Streaming Consumer in HopsWorks
JavaStreamingContext jssc = new
JavaStreamingContext(sparkConf,Durations.seconds(2));
String topic = HopsUtil.getTopic(); //Optional
String consumerGroup = HopsUtil.getConsumerGroup(); //Optional
SparkConsumer consumer = HopsUtil.getSparkConsumer(jssc);
JavaInputDStream<ConsumerRecord<String, byte[]>> messages =
consumer.createDirectStream();
jssc.start();
Less code to write
https://github.com/hopshadoop/hops-kafka-examples
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList);
props.put(SCHEMA_REGISTRY_URL, restApp.restConnect);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
org.apache.kafka.common.serialization.StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put("producer.type", "sync");
props.put("serializer.class","kafka.serializer.StringEncoder");
props.put("request.required.acks", "1");
props.put("ssl.keystore.location","/var/ssl/kafka.client.keystore.jks"
)
props.put("ssl.keystore.password","test1234")
props.put("ssl.key.password","test1234")
ProducerConfig config = new ProducerConfig(props);
String userSchema =
"{"namespace": "example.avro", "type": "record", "name": "U
ser"," +
""fields":
[{"name": "name", "type": "string"}]}";
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(userSchema);
GenericRecord avroRecord = new GenericData.Record(schema);
avroRecord.put("name", "testUser");
Producer<String, String> producer = new Producer<String,
String>(config);
ProducerRecord<String, Object> message = new
ProducerRecord<>(“topicName”, avroRecord );
producer.send(data);
Lots of Hard-Coded Endpoints Here!
SparkProducer producer =
HopsUtil.getSparkProducer();
Map<String, String> message = …
sparkProducer.produce(message);
Massively Simplified Code for
Secure Spark Streaming/Kafka
Distributing Certs for Spark Streaming
Alice@gmail.com
1. Launch Spark Job
Distributed
Database
2. Get certs,
service endpoints
YARN Private
LocalResources
Spark Streaming App
4. Materialize certs
3. YARN Job, config
6. Get Schema
7. Consume
Produce
5. Read Certs
Hopsworks
HopsUtil
8. Read ACLs for
authentication
Multi-Tenant IoT Scenario
Sensor
Node
Sensor
Node
Sensor
Node
Sensor
Node
Sensor
Node
Sensor
Node
Field Gateway
StorageStorage
AnalysisAnalysis
IngestionIngestion
ACMEACME
Evil CorpEvil Corp
IoT Cloud Platform
DontBeEvil
Corp
DontBeEvil
Corp
IoT Scenario
ACME DontBeEvil Corp Evil-Corp
AWS Google
Cloud
Oracle
Cloud
User Apps control IoT Devices
IoT Company:
Analyze Data,
Data Services
for Clients
ACME DontBeEvil Corp Evil Corp
Cloud-Native Analytics Solution
ACME S3S3
[Authorization]
GCSGCS
OracleOracleIoT Company
Each customer needs its own
Analytics Infrastructure
Each customer needs its own
Analytics Infrastructure
Spark
Streaming App
IoT Company
Project
GatewayTopic
Hopsworks Solution using Projects
ACME
ProjectACMETopic
ACME Dataset
Data Stream
Analytics Reports
Hopsworks Solution
ACME
Spark
Streaming App
[Authorized]
ACME
Dataset
ACME
Dataset
ACME Topic
ACME Analytics
Reports
ACME Analytics
Reports
Spark Batch
Job
ACME Project
Karamel/Chef for Automated Installation
Google Compute Engine
BareMetal
DEMO
Hops Roadmap
• HopsFS
– HA support for Multi-Data-Center
– Small files, 2-Level Erasure Coding
• HopsYARN
– Tensorflow with isolated GPUs
• Hopsworks
– P2P Dataset Sharing
– Jupyter, Presto, Hive
Summary
• Hops is a new distribution of Hadoop
– Tinker-friendly and open-source.
• Hopsworks provides first-class support for
Spark-Streaming-as-a-Service
– With support services like Kafka, ELK Stack,
Zeppelin, Grafana/InfluxDB.
Hops Team
Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Roberto
Bampi, Fabio Buso, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid,
Robin Andersso, ArunaKumari Yedurupaka, Tobias Johansson, August Bonds, Tiago Brito, Filotas
Siskos.
Active:
Alumni:
Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan
Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali
Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt,
Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler,
Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Hops
Thank You.
We totally understand it’s going to be
America First Spark Streaming first, but
can we take this chance to say
Hopsworks second!
http://www.hops.io
@hopshadoop
Hops

More Related Content

What's hot

Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virendervithakur
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Spark Summit
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Spark Summit
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
 
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkAPACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkSpark Summit
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Spark Summit
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark OverviewairisData
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with DaskUwe Korn
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
 

What's hot (20)

Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael Nitschinger
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkAPACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with Dask
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 

Similar to Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by Jim Dowling

Multi-tenant Flink as-a-service with Kafka on Hopsworks
Multi-tenant Flink as-a-service with Kafka on HopsworksMulti-tenant Flink as-a-service with Kafka on Hopsworks
Multi-tenant Flink as-a-service with Kafka on HopsworksJim Dowling
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Flink Forward
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Evention
 
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in HopsworksSecure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in HopsworksTheofilos Kakantousis
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AITorsten Steinbach
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaHelen Rogers
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN Jim Dowling
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networkspbelko82
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
AWS re:Invent 2016 - Scality's Open Source AWS S3 Server
AWS re:Invent 2016 - Scality's Open Source AWS S3 ServerAWS re:Invent 2016 - Scality's Open Source AWS S3 Server
AWS re:Invent 2016 - Scality's Open Source AWS S3 ServerScality
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
 

Similar to Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by Jim Dowling (20)

Multi-tenant Flink as-a-service with Kafka on Hopsworks
Multi-tenant Flink as-a-service with Kafka on HopsworksMulti-tenant Flink as-a-service with Kafka on Hopsworks
Multi-tenant Flink as-a-service with Kafka on Hopsworks
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
 
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in HopsworksSecure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
AWS re:Invent 2016 - Scality's Open Source AWS S3 Server
AWS re:Invent 2016 - Scality's Open Source AWS S3 ServerAWS re:Invent 2016 - Scality's Open Source AWS S3 Server
AWS re:Invent 2016 - Scality's Open Source AWS S3 Server
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 

Recently uploaded (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 

Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by Jim Dowling

  • 1. Spark Streaming-as-a- Service with Kafka and YARN Jim Dowling KTH Royal Institute of Technology, Stockholm Senior Researcher, SICS CEO, Logical Clocks AB
  • 2. Spark Streaming-as-a-Service in Sweden • SICS ICE: datacenter research environment • Hopsworks: Spark/Flink/Kafka/Tensorflow/Hadoop • -as-a-service – Built on Hops Hadoop (www.hops.io) – >130 active users
  • 3. Hadoop is not a cool kid anymore!
  • 5. Hadoop’s Evolution 2009 2016 ? Tiny Brain (NameNode, ResourceMgr) Huge Body (DataNodes)
  • 6. Build out Hadoop’s Brain with External Weakly Consistent MetaData Services Google-Glass Approach to Intelligence
  • 8. Larger Brains => Bigger, Faster* 16x Performance on Spotify Workload *Usenix FAST 2017, HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
  • 9. Hopsworks • Projects – Datasets/Files – Topics – Jobs/Notebooks Hadoop • Clusters • Users • Jobs/Applications • Files • ACLs • Sys Admins • Kerberos Larger Brains => More Intelligent* *HMGA2 gene mutations correlated with increased intracranial volume as well as enhanced IQ. http://newsroom.ucla.edu/releases/international-team-uncovers-new-231989 User-Friendly Concepts http://www.ibtimes.co.uk/embargoed-8pm-25th-jan-size-matters-brain-size-relative-body-size-indicates-animals-ability-1539994
  • 10. YARN Spark Streaming Support • Apache Kafka • ELK Stack – Real-time Logs • Grafana/InfluxDB – Monitoring Hopsworks YARN aggregates logs on job completion http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/
  • 11. Kafka Self-Service UI Manage & Share • Topics • ACLs • Avro Schemas Manage & Share • Topics • ACLs • Avro Schemas
  • 14. Zeppelin for Prototyping Streaming Apps [https://github.com/knockdata/spark-highcharts]
  • 15. Debugging Spark with Dr. Elephant • Analyzes Spark Jobs for errors and common using pluggable heuristics • Doesn’t show killed jobs • No online support for streaming apps yet
  • 16. Integration as Microservices in Hopsworks • Project-based Multi-tenancy • Self-Service UI • Simplifying Spark Streaming Apps
  • 17. Proj-All Proj-X Projects in Hopsworks • Proj-42 Shared TopicTopic /Projs/My/Data CompanyDB
  • 18. User roles 18 Data Owner - Import/Export data - Manage Membership - Share DataSets, Topics Data Scientist - Write and Run code Self-Service Administration – No Administrator Needed
  • 19. Notebooks, Data sharing and Quotas • Zeppelin Notebooks in HDFS, Jobs launcher UI. • Sharing is not Copying – Datasets/Topics • Per-Project quotas – Storage in HDFS – CPU in YARN (Uber-style Pricing)
  • 21. Look Ma, no Kerberos • Each project-specific user issued with a SSL/TLS (X.509) certificate for both authentication and encryption. • Services also issued with SSL/TLS certificates. – Same root CA as user certs
  • 22. Simplifying Spark Streaming Apps • Spark Streaming Applications need to know – Credentials • Hadoop, Kafka, InfluxDb, Logstash – Endpoints • Kafka Broker, Kafka SchemaRegistry, ResourceManager, NameNode, InfluxDB, Logstash • The HopsUtil API hides this complexity. – Location/security transparent Spark applications
  • 23. Secure Streaming App with Kafka Developer 1.Discover: Schema Registry and Kafka/InfluxDB/ELK Endpoints 2.Create: Kafka Properties file with certs and broker details 3.Create: Producer/Consumer using Kafka Properties 4.Download: the Schema for the Topic from the Schema Registry 5.Distribute: X.509 certs to all hosts on the cluster 6.Cleanup securely These steps are replaced by calls to the HopsUtil API Operations https://github.com/hopshadoop/hops-kafka-examples
  • 24. Streaming Producer in HopsWorks JavaSparkContext jsc = new JavaSparkContext(sparkConf); String topic = HopsUtil.getTopic(); //Optional SparkProducer producer = HopsUtil.getSparkProducer(); Map<String, String> message = … sparkProducer.produce(message);
  • 25. Streaming Consumer in HopsWorks JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,Durations.seconds(2)); String topic = HopsUtil.getTopic(); //Optional String consumerGroup = HopsUtil.getConsumerGroup(); //Optional SparkConsumer consumer = HopsUtil.getSparkConsumer(jssc); JavaInputDStream<ConsumerRecord<String, byte[]>> messages = consumer.createDirectStream(); jssc.start();
  • 26. Less code to write https://github.com/hopshadoop/hops-kafka-examples Properties props = new Properties(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList); props.put(SCHEMA_REGISTRY_URL, restApp.restConnect); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, org.apache.kafka.common.serialization.StringSerializer.class); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroSerializer.class); props.put("producer.type", "sync"); props.put("serializer.class","kafka.serializer.StringEncoder"); props.put("request.required.acks", "1"); props.put("ssl.keystore.location","/var/ssl/kafka.client.keystore.jks" ) props.put("ssl.keystore.password","test1234") props.put("ssl.key.password","test1234") ProducerConfig config = new ProducerConfig(props); String userSchema = "{"namespace": "example.avro", "type": "record", "name": "U ser"," + ""fields": [{"name": "name", "type": "string"}]}"; Schema.Parser parser = new Schema.Parser(); Schema schema = parser.parse(userSchema); GenericRecord avroRecord = new GenericData.Record(schema); avroRecord.put("name", "testUser"); Producer<String, String> producer = new Producer<String, String>(config); ProducerRecord<String, Object> message = new ProducerRecord<>(“topicName”, avroRecord ); producer.send(data); Lots of Hard-Coded Endpoints Here! SparkProducer producer = HopsUtil.getSparkProducer(); Map<String, String> message = … sparkProducer.produce(message); Massively Simplified Code for Secure Spark Streaming/Kafka
  • 27. Distributing Certs for Spark Streaming Alice@gmail.com 1. Launch Spark Job Distributed Database 2. Get certs, service endpoints YARN Private LocalResources Spark Streaming App 4. Materialize certs 3. YARN Job, config 6. Get Schema 7. Consume Produce 5. Read Certs Hopsworks HopsUtil 8. Read ACLs for authentication
  • 28. Multi-Tenant IoT Scenario Sensor Node Sensor Node Sensor Node Sensor Node Sensor Node Sensor Node Field Gateway StorageStorage AnalysisAnalysis IngestionIngestion ACMEACME Evil CorpEvil Corp IoT Cloud Platform DontBeEvil Corp DontBeEvil Corp
  • 29. IoT Scenario ACME DontBeEvil Corp Evil-Corp AWS Google Cloud Oracle Cloud User Apps control IoT Devices IoT Company: Analyze Data, Data Services for Clients ACME DontBeEvil Corp Evil Corp
  • 30. Cloud-Native Analytics Solution ACME S3S3 [Authorization] GCSGCS OracleOracleIoT Company Each customer needs its own Analytics Infrastructure Each customer needs its own Analytics Infrastructure Spark Streaming App
  • 31. IoT Company Project GatewayTopic Hopsworks Solution using Projects ACME ProjectACMETopic ACME Dataset Data Stream Analytics Reports
  • 32. Hopsworks Solution ACME Spark Streaming App [Authorized] ACME Dataset ACME Dataset ACME Topic ACME Analytics Reports ACME Analytics Reports Spark Batch Job ACME Project
  • 33. Karamel/Chef for Automated Installation Google Compute Engine BareMetal
  • 34. DEMO
  • 35. Hops Roadmap • HopsFS – HA support for Multi-Data-Center – Small files, 2-Level Erasure Coding • HopsYARN – Tensorflow with isolated GPUs • Hopsworks – P2P Dataset Sharing – Jupyter, Presto, Hive
  • 36. Summary • Hops is a new distribution of Hadoop – Tinker-friendly and open-source. • Hopsworks provides first-class support for Spark-Streaming-as-a-Service – With support services like Kafka, ELK Stack, Zeppelin, Grafana/InfluxDB.
  • 37. Hops Team Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Roberto Bampi, Fabio Buso, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, Robin Andersso, ArunaKumari Yedurupaka, Tobias Johansson, August Bonds, Tiago Brito, Filotas Siskos. Active: Alumni: Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu. Hops
  • 38. Thank You. We totally understand it’s going to be America First Spark Streaming first, but can we take this chance to say Hopsworks second! http://www.hops.io @hopshadoop Hops