SlideShare a Scribd company logo
1 of 51
Couchbase Data Pipeline
Stream yourOperational Data w/ Apache Spark &
Kafka into Hadoop using Couchbase
©2016 Couchbase Inc. 2
Agenda
 Couchbase Overview
 Couchbase Data Pipeline
 Kafka
– Demo - https://github.com/couchbase/couchbase-kafka-connector
 Spark
– Demo - https://github.com/justinmichaels006/CouchbaseSpark
 Q&A
©2016 Couchbase Inc. 3
Couchbase Overview
Couchbase is the company behind Couchbase Server & Couchbase Mobile
• Open source JSON database
• Founded 2010
• 400+ enterprise customers globally
Some of our customers:
Couchbase Server can be deployed as:
Document database Key-value store Distributed cache
©2016 Couchbase Inc. 4
Couchbase Overview
 The first NoSQL database that enables you to develop with agility and
operate at any scale.
Managed Cache Key-Value Store Document
Database
Embedded
Database
Sync Management
©2016 Couchbase Inc. 5
Couchbase Architecture
• Data Service – builds and
maintains Distributed secondary
indexes (MapReduce Views)
• Indexing Engine – builds and
maintains Global Secondary
Indexes
• Query Engine – plans,
coordinates, and executes
queries against either Global or
Distributed indexes
• Cluster Manager –
configuration, heartbeat,
statistics, RESTful
Management interface
©2016 Couchbase Inc. 6
Storing And Retrieving Documents
©2016 Couchbase Inc. 7
Online Linear Scalability
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
ACTIVE ACTIVE
REPLICA REPLICA
Couchbase Server 4 Couchbase Server 5
SHARD
5
SHARD
2
SHARD SHARD
SHARD
4
SHARD SHARD
SHARD
1
SHARD
3
SHARD SHARD
SHARD
4
SHARD
1
SHARD
8
SHARD SHARD SHARD
SHARD
6
SHARD
3
SHARD
2
SHARD SHARD SHARD
SHARD
7
SHARD
9
SHARD
5
SHARD SHARD SHARD
SHARD
7
SHARD
SHARD
6
SHARD
SHARD
8
SHARD
9
SHARD
READ/WRITE/UPDATE
©2016 Couchbase Inc. 8
Cross Datacenter Replication (XDCR)
Unidirectional or Bidirectional Replication
Unidirectional
 Hot spare / Disaster Recovery
 Development/Testing copies
 Connectors (Solr, Elasticsearch)
 Integrate to custom consumer
Bidirectional
 MultipleActive Masters
 Disaster Recovery
 Datacenter Locality
Couchbase Data Pipeline
9
©2016 Couchbase Inc. 10
DCP
 DatabaseChange Protocol
– Since Couchbase Server 3.x internal standard to handle changes
– Clients: Intra-Cluster Replication, Indexing, XDCR
 Mutation
– Event which is raised in case of a creation, update or delete
– Each mutation that occurs in a vBucket has a sequence number
 Core of the 2.x Java SDK
– Can consume DCP streams
– Important: API not yet exposed but used to implementConnectors provided by Couchbase
©2016 Couchbase Inc. 11
Couchbase Data Pipeline
 Couchbase is primarily online operational NoSQL datastore, low latency,
scalable
 Data Source
– Example: Pulling user profiles into Hadoop for deep analytics
 Data Sink
– Example: training machine learning models that are then cached / served from
Couchbase
NoSQL
Spark
Hadoop
Web
Mobile
IoT
Analytics
Discovery
Prediction
©2016 Couchbase Inc. 12
Couchbase Data Pipeline
Couchbase Spark Hadoop (Hive)
Use cases • Operational
• Web / Mobile
• Analytics
• Machine Learning
• Analytics
• Machine Learning
Processing mode • Online
• Ad Hoc
• Ad Hoc
• Batch
• Streaming (+/-)
• Batch
• Ad Hoc (+/-)
Low latency = < 1ms ops Seconds Minutes
Performance Highly predictable Variable Variable
Users are typically… Millions of
customers
100’s of analysts or
data scientists
100’s of analysts or
data scientists
Memory-centric Memory-centric Disk-centric
Big data = 10s of Terabytes Petabytes (?) Petabytes
ANALYTICALOPERATIONAL
©2016 Couchbase Inc. 13
Couchbase Data Pipeline
13
Operational Velocity Analytical VolumeAnalytical Velocity
©2016 Couchbase Inc. 14
Couchbase Data Pipeline
New Data
Stream
MergedView
All Data
Precompute
Views
(Map Reduce)
Process
Stream
Incremental
Views
Partial
Aggregate
Partial
Aggregate
Partial
Aggregate
Real-Time Data
Batch
Recompute
BatchViews
Real-TimeViews
Real-Time
Increment
Merge
Batch Layer
Serving Layer
Speed Layer
Couchbase Hadoop
Connector (Sqoop)
©2016 Couchbase Inc. 15
Couchbase Hadoop
Connector (Sqoop)
Couchbase Data Pipeline
New Data
Stream
MergedView
All Data
Precompute
Views
(Map Reduce)
Process
Stream
Incremental
Views
Partial
Aggregate
Partial
Aggregate
Partial
Aggregate
Real-Time Data
Batch
Recompute
BatchViews
Real-TimeViews
Real-Time
Increment
Merge
Batch Layer
Serving Layer
Speed Layer
Stream / Data
Ingestion
Store
Incremental
Data / Stream
processing
Serving merged
results /
responses
©2016 Couchbase Inc. 16
Couchbase Connectors
data scientist / engineersup to 1010 application
users
NoSQL
Database
101- 102
Kafka Hadoop
Spark
Elasticsearch
DCP
XDCR
Storm
Sqoop
COMPLEX
EVENT PROCESSING
Real Time
REPOSITORY
PERPETUAL
STORE
ANALYTICAL
DB
BUSINESS
INTELLIGENCE
MONITORING
CHAT/VOICE
SYSTEM
BATCH
TRACK
REAL-TIME
TRACK
DASHBOARD
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
PayPal Use Case
http://developer.couchbase.com/documentation/server/4.1/connectors/kafka-2.0/kafka-intro.html
https://github.com/paypal/couchbasekafka
18
Camus ,
MR Jobs
DCP Stream Couchbase Kafka Adapter
{DCP Client
+ Kafka Producer}
[1] [2] [3]
[4][5][6]
[7]
Kafka Connector
19
©2016 Couchbase Inc. 20
Kafka
 Publish-Subscriber System
– Model which describes how Publishers can distribute information
across multiple Subscribers those need to sign up for the retrieval of
such data
 Message Queue System
– Messages are put (stored) in a queue until the recipient can retrieve
them
 Specifics
– Commit log based
– Distributed & partitioned
– Failover mechanism
©2016 Couchbase Inc. 21
Kafka
 ZooKeeper
– Coordination between Kafka Brokers
– Store Coordination Data, Status Information, Configurations
 Broker
– One or more Services those are processing messages
– Can stores messages
– Failover: Leader vs. Follower
 Topic
– Distributed/partitioned message queue
 Consumer
– Applications/processes/threads those are subscribed to the topic
– Can be grouped in order to process messages in parallel
 Producer
– Publish data/messages to the topic
©2016 Couchbase Inc. 22
Kafka
 Data broker w/ publish / subscribe system
 Decouples producers of data from consumers
 Massively scalable
 Messages queued until the recipient can retrieve them
Consumer
HDFSKafka Consumer
Producer
Consumer
Producer
Producer
Producer
Couchbase can be
consumer and producer
©2016 Couchbase Inc. 23
Couchbase/Kafka Use Cases
 Polyglot Processing
– Activities
– Events
– Monitoring Metrics
– Sensor Data
 Typical Use Cases
– Messaging: Decouple data processing from data producers
– Log Aggregation: A log as stream of messages
– Stream Processing:Consume data from one topic and put
the filtered/transformed data into another one
– Click StreamAnalysis: Page views/searches as real-time
publish-subscribe feeds
– Real-time Data Integration: Extract from Couchbase ,
transform and load data in real time
©2016 Couchbase Inc. 24
Couchbase Kafka Connector
Available Now: 2.0 GA
 Kafka Producer or Consumer
 Stream events
 Filters
 Transform events
 Sample Producer & Consumer
24
Code: https://github.com/couchbase/couchbase-kafka-connector/
Monthly Releases
 Kafka Connect (Apache Kafka 0.9)
 Merge code for Storm connector
 ???
Issues: https://issues.couchbase.com/projects/KAFKAC
Docs: http://developer.couchbase.com/documentation/server/4.1/connectors/kafka-1.2/kafka-intro.html
©2016 Couchbase Inc. 25
Learn More - Couchbase Kafka Connector
Confluent’s Ewen Cheslack-Postava atCouchbase Connect 2015
 https://youtu.be/fFPVwYKUTHs
Couchbase and Kafka - Up and Running in 10 Minutes
 http://blog.couchbase.com/2015/november/kafka-and-couchbase-up-and-running-in-10-minutes
Product docs
 http://developer.couchbase.com/documentation/server/4.1/connectors/kafka-1.2/kafka-intro.html
Avalon Consulting blog and Github repo
 http://blogs.avalonconsult.com/blog/big-data/purchase-transaction-alerting-with-couchbase-and-kafka/
 https://github.com/Avalon-Consulting-LLC/couchbase-kafka
25
Couchbase Spark Connector
©2016 Couchbase Inc. 27
Spark
DCP
KV
N1QL
Views
Source: Databricks https://www.brighttalk.com/webcast/12891/196891
©2016 Couchbase Inc. 28
Spark
 Fast and general engine for big data processing with libraries for
advanced analytics
Spark Core:
• task scheduling
• memory management
• fault recovery
• interacting with storage systems
©2016 Couchbase Inc. 29
Spark
 Resilient Distributed Dataset: Core Spark data abstraction
– Distributed collection of elements
– Read-only (immutable)
– Fault tolerant: Can recover from loss of a partition
• Re-computed, not stored
 Operations
– Transformation: Lazy operation creating new RDD
• e.g. map(), filter()
– Action: Return a result or save it
• e.g. take(), save()
29
©2016 Couchbase Inc. 30
Spark
Create: Read a log file
sc.textFile(“server.log”)
Filter: Keep lines starting with
"ERROR"
Filter: Keep lines containing
”system5"
Action: Write result to storage
 RDD is created by either:
– Loading an external dataset or RDD
 RDD is transformed:
– Result:A new RDD
– Sequence of transformations
 Data is eventually extracted
– By an action on an RDD
©2016 Couchbase Inc. 31
DataFrames (SparkSQL)
• Distributed collection of data organized in named colums
 DataFrame = “RDD + Schema”
• Perform SQL Queries on
top of your data
From https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
©2016 Couchbase Inc. 32
Datasets
• Typesafe programming on top of DataFrames
• Initial support in Spark 1.6, likely to be extended in 2.0
• Higher performance & less memory usage
• Encoding/Decoding for semi-structured data
https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html
©2016 Couchbase Inc. 34
Couchbase Spark Connector
data scientists &
data engineers
application users
DCP
Features
• Create RDDs from KV, N1QL, Views
• Create DStreams from DCP feeds
• Persist RDDs and Dstreams
• Support for DataFrames and SparkSQL
©2016 Couchbase Inc. 35
Couchbase Spark Use Case: Data Analysis
 Query across data in many systems using one language & runtime
– Separate Couchbase clusters for workload isolation
– Results streamed back as needed to support applications
Operational
Data Store
XDCR
RDBMSs3hdfs Elastic
©2016 Couchbase Inc. 36
Couchbase Spark Use Case: Machine Learning
 Data scientists train machine learning models
– Load results into Couchbase so end users can interact with them online
– Recommendations for content and products, fraud detection or spam filter
DCP
Hadoop
Machine
Learning
Models
Data
Warehouse
Historical Data
©2016 Couchbase Inc. 37
Couchbase Spark Connector 1.2
Upcoming Release (planned) : 1.2
• Spark 1.6 Support (including Datasets)
• Bugfixes
• Enhanced JavaAPIs
• Updated DCP Functionality
37
github.com/couchbaselabs/couchbase-spark-connector
https://issues.couchbase.com/projects/SPARKC
©2016 Couchbase Inc. 38
Learn More - Couchbase Spark Connector
Github Repo
 https://github.com/couchbaselabs/couchbase-spark-connector
Spark with Couchbase to Electrify your Data Processing
 https://youtu.be/sBnAf7gAfLc
Docs
 http://developer.couchbase.com/documentation/server/current/connectors/spark-1.0/spark-intro.html
Avalon Consulting blog and Github repo (Market Basket Analysis)
 http://blogs.avalonconsult.com/blog/big-data/combining-operational-and-analytical-big-data-
using-couchbase-and-spark-a-market-basket-analysis-example/
 https://github.com/Avalon-Consulting-LLC/couchbase-spark-mba
38
Q&A
justin@couchbase.com
Dev Guide
©2016 Couchbase Inc. 41
Couchbase
Data Node
Data Node
Spark Worker
Anatomy of a Spark Application
Driver Program
SparkContext
Cluster
Manager
Couchbase
Couchbase
Executor
Task
©2016 Couchbase Inc. 42
Connection Management
©2016 Couchbase Inc. 43
Connection Management
©2016 Couchbase Inc. 44
Creating RDDs
©2016 Couchbase Inc. 45
Persisting RDDs
©2016 Couchbase Inc. 46
RDD N1QL Query
©2016 Couchbase Inc. 47
Spark SQL - Schema
©2016 Couchbase Inc. 48
Spark SQL – Dataframe Query
©2016 Couchbase Inc. 49
Demo of Dataset (Spark 1.6)
©2016 Couchbase Inc. 50
Demo of Dataset (Spark 1.6)
©2016 Couchbase Inc. 51
Demo of Dataset (Spark 1.6)
©2016 Couchbase Inc. 52
Spark Streaming with DCP

More Related Content

What's hot

Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to KafkaAkash Vacher
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedInGuozhang Wang
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsAyyappadas Ravindran (Appu)
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020confluent
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
A Journey through the JDKs (Java 9 to Java 11)
A Journey through the JDKs (Java 9 to Java 11)A Journey through the JDKs (Java 9 to Java 11)
A Journey through the JDKs (Java 9 to Java 11)Markus GĂĽnther
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaJay Kreps
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedInDiscover Pinterest
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpNathan Handler
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talkDataStax Academy
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...JAXLondon2014
 

What's hot (20)

Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
A Journey through the JDKs (Java 9 to Java 11)
A Journey through the JDKs (Java 9 to Java 11)A Journey through the JDKs (Java 9 to Java 11)
A Journey through the JDKs (Java 9 to Java 11)
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
 
kafka
kafkakafka
kafka
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
 

Viewers also liked

Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Data Con LA
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksBig Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksData Con LA
 
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...DataStax Academy
 
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianData Con LA
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
VoltDB Big Data Camp LA 2014 - Scott Jar
VoltDB  Big Data Camp LA 2014 - Scott JarVoltDB  Big Data Camp LA 2014 - Scott Jar
VoltDB Big Data Camp LA 2014 - Scott JarData Con LA
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...Data Con LA
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Auto driving car
Auto driving carAuto driving car
Auto driving carsalehin riad
 
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate UniversityBig Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate UniversityData Con LA
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 

Viewers also liked (20)

Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
 
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksBig Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
 
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
 
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerian
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
VoltDB Big Data Camp LA 2014 - Scott Jar
VoltDB  Big Data Camp LA 2014 - Scott JarVoltDB  Big Data Camp LA 2014 - Scott Jar
VoltDB Big Data Camp LA 2014 - Scott Jar
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Auto driving car
Auto driving carAuto driving car
Auto driving car
 
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate UniversityBig Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 

Similar to Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couchbase

Real time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and CouchbaseReal time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Kafka & Couchbase Integration Patterns
Kafka & Couchbase Integration PatternsKafka & Couchbase Integration Patterns
Kafka & Couchbase Integration PatternsManuel Hurtado
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Denodo
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Data streaming
Data streamingData streaming
Data streamingAlberto Paro
 
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...confluent
 
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformOCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformMarc Dutoo
 
Understanding kafka
Understanding kafkaUnderstanding kafka
Understanding kafkaAmitDhodi
 
Real Time Streaming with Flink & Couchbase
Real Time Streaming with Flink & CouchbaseReal Time Streaming with Flink & Couchbase
Real Time Streaming with Flink & CouchbaseManuel Hurtado
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Data Con LA
 

Similar to Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couchbase (20)

Real time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and CouchbaseReal time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and Couchbase
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Kafka & Couchbase Integration Patterns
Kafka & Couchbase Integration PatternsKafka & Couchbase Integration Patterns
Kafka & Couchbase Integration Patterns
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
 
Apache Kafka - Strakin Technologies Pvt Ltd
Apache Kafka - Strakin Technologies Pvt LtdApache Kafka - Strakin Technologies Pvt Ltd
Apache Kafka - Strakin Technologies Pvt Ltd
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Data streaming
Data streamingData streaming
Data streaming
 
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
 
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformOCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
 
Understanding kafka
Understanding kafkaUnderstanding kafka
Understanding kafka
 
Real Time Streaming with Flink & Couchbase
Real Time Streaming with Flink & CouchbaseReal Time Streaming with Flink & Couchbase
Real Time Streaming with Flink & Couchbase
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂşjo
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couchbase

  • 1. Couchbase Data Pipeline Stream yourOperational Data w/ Apache Spark & Kafka into Hadoop using Couchbase
  • 2. ©2016 Couchbase Inc. 2 Agenda  Couchbase Overview  Couchbase Data Pipeline  Kafka – Demo - https://github.com/couchbase/couchbase-kafka-connector  Spark – Demo - https://github.com/justinmichaels006/CouchbaseSpark  Q&A
  • 3. ©2016 Couchbase Inc. 3 Couchbase Overview Couchbase is the company behind Couchbase Server & Couchbase Mobile • Open source JSON database • Founded 2010 • 400+ enterprise customers globally Some of our customers: Couchbase Server can be deployed as: Document database Key-value store Distributed cache
  • 4. ©2016 Couchbase Inc. 4 Couchbase Overview  The first NoSQL database that enables you to develop with agility and operate at any scale. Managed Cache Key-Value Store Document Database Embedded Database Sync Management
  • 5. ©2016 Couchbase Inc. 5 Couchbase Architecture • Data Service – builds and maintains Distributed secondary indexes (MapReduce Views) • Indexing Engine – builds and maintains Global Secondary Indexes • Query Engine – plans, coordinates, and executes queries against either Global or Distributed indexes • Cluster Manager – configuration, heartbeat, statistics, RESTful Management interface
  • 6. ©2016 Couchbase Inc. 6 Storing And Retrieving Documents
  • 7. ©2016 Couchbase Inc. 7 Online Linear Scalability ACTIVE ACTIVE ACTIVE REPLICA REPLICA REPLICA Couchbase Server 1 Couchbase Server 2 Couchbase Server 3 ACTIVE ACTIVE REPLICA REPLICA Couchbase Server 4 Couchbase Server 5 SHARD 5 SHARD 2 SHARD SHARD SHARD 4 SHARD SHARD SHARD 1 SHARD 3 SHARD SHARD SHARD 4 SHARD 1 SHARD 8 SHARD SHARD SHARD SHARD 6 SHARD 3 SHARD 2 SHARD SHARD SHARD SHARD 7 SHARD 9 SHARD 5 SHARD SHARD SHARD SHARD 7 SHARD SHARD 6 SHARD SHARD 8 SHARD 9 SHARD READ/WRITE/UPDATE
  • 8. ©2016 Couchbase Inc. 8 Cross Datacenter Replication (XDCR) Unidirectional or Bidirectional Replication Unidirectional  Hot spare / Disaster Recovery  Development/Testing copies  Connectors (Solr, Elasticsearch)  Integrate to custom consumer Bidirectional  MultipleActive Masters  Disaster Recovery  Datacenter Locality
  • 10. ©2016 Couchbase Inc. 10 DCP  DatabaseChange Protocol – Since Couchbase Server 3.x internal standard to handle changes – Clients: Intra-Cluster Replication, Indexing, XDCR  Mutation – Event which is raised in case of a creation, update or delete – Each mutation that occurs in a vBucket has a sequence number  Core of the 2.x Java SDK – Can consume DCP streams – Important: API not yet exposed but used to implementConnectors provided by Couchbase
  • 11. ©2016 Couchbase Inc. 11 Couchbase Data Pipeline  Couchbase is primarily online operational NoSQL datastore, low latency, scalable  Data Source – Example: Pulling user profiles into Hadoop for deep analytics  Data Sink – Example: training machine learning models that are then cached / served from Couchbase NoSQL Spark Hadoop Web Mobile IoT Analytics Discovery Prediction
  • 12. ©2016 Couchbase Inc. 12 Couchbase Data Pipeline Couchbase Spark Hadoop (Hive) Use cases • Operational • Web / Mobile • Analytics • Machine Learning • Analytics • Machine Learning Processing mode • Online • Ad Hoc • Ad Hoc • Batch • Streaming (+/-) • Batch • Ad Hoc (+/-) Low latency = < 1ms ops Seconds Minutes Performance Highly predictable Variable Variable Users are typically… Millions of customers 100’s of analysts or data scientists 100’s of analysts or data scientists Memory-centric Memory-centric Disk-centric Big data = 10s of Terabytes Petabytes (?) Petabytes ANALYTICALOPERATIONAL
  • 13. ©2016 Couchbase Inc. 13 Couchbase Data Pipeline 13 Operational Velocity Analytical VolumeAnalytical Velocity
  • 14. ©2016 Couchbase Inc. 14 Couchbase Data Pipeline New Data Stream MergedView All Data Precompute Views (Map Reduce) Process Stream Incremental Views Partial Aggregate Partial Aggregate Partial Aggregate Real-Time Data Batch Recompute BatchViews Real-TimeViews Real-Time Increment Merge Batch Layer Serving Layer Speed Layer Couchbase Hadoop Connector (Sqoop)
  • 15. ©2016 Couchbase Inc. 15 Couchbase Hadoop Connector (Sqoop) Couchbase Data Pipeline New Data Stream MergedView All Data Precompute Views (Map Reduce) Process Stream Incremental Views Partial Aggregate Partial Aggregate Partial Aggregate Real-Time Data Batch Recompute BatchViews Real-TimeViews Real-Time Increment Merge Batch Layer Serving Layer Speed Layer Stream / Data Ingestion Store Incremental Data / Stream processing Serving merged results / responses
  • 16. ©2016 Couchbase Inc. 16 Couchbase Connectors data scientist / engineersup to 1010 application users NoSQL Database 101- 102 Kafka Hadoop Spark Elasticsearch DCP XDCR Storm Sqoop
  • 18. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. PayPal Use Case http://developer.couchbase.com/documentation/server/4.1/connectors/kafka-2.0/kafka-intro.html https://github.com/paypal/couchbasekafka 18 Camus , MR Jobs DCP Stream Couchbase Kafka Adapter {DCP Client + Kafka Producer} [1] [2] [3] [4][5][6] [7]
  • 20. ©2016 Couchbase Inc. 20 Kafka  Publish-Subscriber System – Model which describes how Publishers can distribute information across multiple Subscribers those need to sign up for the retrieval of such data  Message Queue System – Messages are put (stored) in a queue until the recipient can retrieve them  Specifics – Commit log based – Distributed & partitioned – Failover mechanism
  • 21. ©2016 Couchbase Inc. 21 Kafka  ZooKeeper – Coordination between Kafka Brokers – Store Coordination Data, Status Information, Configurations  Broker – One or more Services those are processing messages – Can stores messages – Failover: Leader vs. Follower  Topic – Distributed/partitioned message queue  Consumer – Applications/processes/threads those are subscribed to the topic – Can be grouped in order to process messages in parallel  Producer – Publish data/messages to the topic
  • 22. ©2016 Couchbase Inc. 22 Kafka  Data broker w/ publish / subscribe system  Decouples producers of data from consumers  Massively scalable  Messages queued until the recipient can retrieve them Consumer HDFSKafka Consumer Producer Consumer Producer Producer Producer Couchbase can be consumer and producer
  • 23. ©2016 Couchbase Inc. 23 Couchbase/Kafka Use Cases  Polyglot Processing – Activities – Events – Monitoring Metrics – Sensor Data  Typical Use Cases – Messaging: Decouple data processing from data producers – Log Aggregation: A log as stream of messages – Stream Processing:Consume data from one topic and put the filtered/transformed data into another one – Click StreamAnalysis: Page views/searches as real-time publish-subscribe feeds – Real-time Data Integration: Extract from Couchbase , transform and load data in real time
  • 24. ©2016 Couchbase Inc. 24 Couchbase Kafka Connector Available Now: 2.0 GA  Kafka Producer or Consumer  Stream events  Filters  Transform events  Sample Producer & Consumer 24 Code: https://github.com/couchbase/couchbase-kafka-connector/ Monthly Releases  Kafka Connect (Apache Kafka 0.9)  Merge code for Storm connector  ??? Issues: https://issues.couchbase.com/projects/KAFKAC Docs: http://developer.couchbase.com/documentation/server/4.1/connectors/kafka-1.2/kafka-intro.html
  • 25. ©2016 Couchbase Inc. 25 Learn More - Couchbase Kafka Connector Confluent’s Ewen Cheslack-Postava atCouchbase Connect 2015  https://youtu.be/fFPVwYKUTHs Couchbase and Kafka - Up and Running in 10 Minutes  http://blog.couchbase.com/2015/november/kafka-and-couchbase-up-and-running-in-10-minutes Product docs  http://developer.couchbase.com/documentation/server/4.1/connectors/kafka-1.2/kafka-intro.html Avalon Consulting blog and Github repo  http://blogs.avalonconsult.com/blog/big-data/purchase-transaction-alerting-with-couchbase-and-kafka/  https://github.com/Avalon-Consulting-LLC/couchbase-kafka 25
  • 27. ©2016 Couchbase Inc. 27 Spark DCP KV N1QL Views Source: Databricks https://www.brighttalk.com/webcast/12891/196891
  • 28. ©2016 Couchbase Inc. 28 Spark  Fast and general engine for big data processing with libraries for advanced analytics Spark Core: • task scheduling • memory management • fault recovery • interacting with storage systems
  • 29. ©2016 Couchbase Inc. 29 Spark  Resilient Distributed Dataset: Core Spark data abstraction – Distributed collection of elements – Read-only (immutable) – Fault tolerant: Can recover from loss of a partition • Re-computed, not stored  Operations – Transformation: Lazy operation creating new RDD • e.g. map(), filter() – Action: Return a result or save it • e.g. take(), save() 29
  • 30. ©2016 Couchbase Inc. 30 Spark Create: Read a log file sc.textFile(“server.log”) Filter: Keep lines starting with "ERROR" Filter: Keep lines containing ”system5" Action: Write result to storage  RDD is created by either: – Loading an external dataset or RDD  RDD is transformed: – Result:A new RDD – Sequence of transformations  Data is eventually extracted – By an action on an RDD
  • 31. ©2016 Couchbase Inc. 31 DataFrames (SparkSQL) • Distributed collection of data organized in named colums  DataFrame = “RDD + Schema” • Perform SQL Queries on top of your data From https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
  • 32. ©2016 Couchbase Inc. 32 Datasets • Typesafe programming on top of DataFrames • Initial support in Spark 1.6, likely to be extended in 2.0 • Higher performance & less memory usage • Encoding/Decoding for semi-structured data https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html
  • 33. ©2016 Couchbase Inc. 34 Couchbase Spark Connector data scientists & data engineers application users DCP Features • Create RDDs from KV, N1QL, Views • Create DStreams from DCP feeds • Persist RDDs and Dstreams • Support for DataFrames and SparkSQL
  • 34. ©2016 Couchbase Inc. 35 Couchbase Spark Use Case: Data Analysis  Query across data in many systems using one language & runtime – Separate Couchbase clusters for workload isolation – Results streamed back as needed to support applications Operational Data Store XDCR RDBMSs3hdfs Elastic
  • 35. ©2016 Couchbase Inc. 36 Couchbase Spark Use Case: Machine Learning  Data scientists train machine learning models – Load results into Couchbase so end users can interact with them online – Recommendations for content and products, fraud detection or spam filter DCP Hadoop Machine Learning Models Data Warehouse Historical Data
  • 36. ©2016 Couchbase Inc. 37 Couchbase Spark Connector 1.2 Upcoming Release (planned) : 1.2 • Spark 1.6 Support (including Datasets) • Bugfixes • Enhanced JavaAPIs • Updated DCP Functionality 37 github.com/couchbaselabs/couchbase-spark-connector https://issues.couchbase.com/projects/SPARKC
  • 37. ©2016 Couchbase Inc. 38 Learn More - Couchbase Spark Connector Github Repo  https://github.com/couchbaselabs/couchbase-spark-connector Spark with Couchbase to Electrify your Data Processing  https://youtu.be/sBnAf7gAfLc Docs  http://developer.couchbase.com/documentation/server/current/connectors/spark-1.0/spark-intro.html Avalon Consulting blog and Github repo (Market Basket Analysis)  http://blogs.avalonconsult.com/blog/big-data/combining-operational-and-analytical-big-data- using-couchbase-and-spark-a-market-basket-analysis-example/  https://github.com/Avalon-Consulting-LLC/couchbase-spark-mba 38
  • 40. ©2016 Couchbase Inc. 41 Couchbase Data Node Data Node Spark Worker Anatomy of a Spark Application Driver Program SparkContext Cluster Manager Couchbase Couchbase Executor Task
  • 41. ©2016 Couchbase Inc. 42 Connection Management
  • 42. ©2016 Couchbase Inc. 43 Connection Management
  • 43. ©2016 Couchbase Inc. 44 Creating RDDs
  • 44. ©2016 Couchbase Inc. 45 Persisting RDDs
  • 45. ©2016 Couchbase Inc. 46 RDD N1QL Query
  • 46. ©2016 Couchbase Inc. 47 Spark SQL - Schema
  • 47. ©2016 Couchbase Inc. 48 Spark SQL – Dataframe Query
  • 48. ©2016 Couchbase Inc. 49 Demo of Dataset (Spark 1.6)
  • 49. ©2016 Couchbase Inc. 50 Demo of Dataset (Spark 1.6)
  • 50. ©2016 Couchbase Inc. 51 Demo of Dataset (Spark 1.6)
  • 51. ©2016 Couchbase Inc. 52 Spark Streaming with DCP

Editor's Notes

  1. KEY POINT: COUCHBASE HAS YOU COVERED FOR YOUR GENERAL PURPOSE DB NEEDS. FROM CACHING TO KV STORE, TO JSON DOCUMENT STORE, TO MOBILE APPS. NO OTHER NOSQL DB VENDOR HAS THIS BREADTH AND DEPTH OF TECHNOLOGY The purpose of this slide is to discuss the high level concepts of Couchbase, and if the SE wants to discuss what parts of Couchbase make up each concept. It is not to go over specific technologies like N1QL, ODBC, etc
  2. KEY POINT: Frame the conversation
  3. KEY POINT: GROUND THE USER IN HOW OBJECTs RELATE TO BUCKETS AND THOSE ARE SPREAD ACROSS THE CLUSTER; AS WELL AS HOW THINGS IN THE CLUSTER ARE STACKED PHYSICALLY. Talk to the audience about how documents move in the application at a high level and the relation between data buckets and how they are spread evenly across the cluster. Remember that vBuckets are not in this diagram, but that is on purpose. That comes later and might confuse people at this point. Going over this slide now sets you up for the vBucket discussion later in the presentation. Convey what a bucket is. That it is a logical key space, with its own set of server resources, queues, etc. Make sure to stress that one can have multiple buckets, but to not just create them like you would tables or schemas in a relational database An example of when to split data into different buckets is using Views across different object types. For example, if you have JSON documents and base64 encoded XML documents in the same bucket, a view or index will have to interact with that object even though it will never need it. So it would be better to put the XML into another bucket so the views and indexes are only looking at the JSON data they actually will be indexing.
  4. Application has single logical connection to cluster (client object) Multiple nodes added or removed at once One-click operation Incremental movement of active and replica vbuckets and data Client library updated via cluster map Fully online operation, no downtime or loss of performance
  5. Use cases are totally different – Spark is an execution engine, not a database. A prime use case for Hadoop is as a low cost data warehouse, which is not a good use case for Couchbase or Spark Latency - Everyone says real time, but what do mean? For an operational system, this means: Extremely fast (in-memory) reads Extremely fast (log append) writes For Couchbase, complete millions of ops / second (these are gets / sets) at latencies of under 1ms, compare LinkedIn figures from Jerry Franz’s session Tuned to LinkedIn’s specific workload: 75% writes (sets + incr) / 25% reads – 13 byte values, 25 byte keys on average 2.5 billion items (+ 1 replica) 600 Gbytes of RAM /  3 Tbytes of disk in use on average Average store latency ~ 0.4 milliseconds 99th percentile store latency ~ 2.5 milliseconds Average get latency ~ 0.8 milliseconds 99th percentile get latency ~ 8 milliseconds In general, Spark is just better at Hadoop’s core use cases than Hadoop (note, I’m not talking about HDFS) Spark is much better for highly iterative algorithms and interactive queries – which is important, given that the majority of jobs work on 100GB or less of data (small big data) Spark scale – less than Map Reduce based solutions on Hadoop, but that’s OK – “[T]he majority of real-world analytics jobs process less than 100GB of input, but popular infrastructures such as Hadoop/MapReduce were originally designed for petascale processing.” http://www.msr-waypoint.com/pubs/204499/a20-appuswamy.pdf This is especially convenient for people with development background who like to run "stuff" (ad-hoc queries) on data in hadoop/hdfs. This remove the need to know about the underlying hadoop layer and just think of it as data.
  6. - Improve the animation for data handling – refresh existing sales deck and deep dives.
  7. Using Couchbase as the high performance, low latency, scalable data store to support personalized interactions Couchbase, as real-time operational database may be generating real-time data to feed into both the batch and real-time layers In some use cases, Couchbase is used to perform real-time processing and analytics – M/R views Some customers are using Couchbase as the data store to for stream processing Email from Michael on streaming data from CB to Spark via Kafka: well, Kafka is one way but it also uses DCP under the covers. I actually need to make some changes to the DCP implementation in the java client and my plan is to have DCP support in dp2 (a month later or so). So once we are GA, there will be a 100% way to stream data directly into a DStream (spark streaming). And of course you can easily implement simple polling of let’s say a view, and grab the full docs that match for example a time interval. 
  8. Work people do in these systems - Training ML models ETL / Data wrangling Aggregations Reporting / BI Kafka is a data multiplexer – some people are still going to want to do this, but it’s designed for higher latency applications with a known high complexity (e.g. ebay – many different consumers for information) Traditional data warehouse – definitely will be a different programming language – how do you make sense of the data feed? You get into the problems that making changes on one side introduces tons of complexity on the other Downsides – maturity is not 100% on the Spark side, still in active development in the Couchbase side KV / N1QL
  9. The data generated by users is published to Apache Kafka. Next, it’s pulled into Apache Storm for real time analysis and processing as well as into Hadoop. Finally, Storm writes the data to Couchbase Server for real-time access by LivePerson agents while the data in Hadoop is eventually accessed via HP Vertica and MicroStrategy for offline business intelligence and analysis.
  10. High throughput distributed messaging system Massively scalable Simple, elegant – basically a giant distributed ordered commit log Like a manifold for your data processing – decouples the data producers from the data consumers, allows for asynchronicity Automatic recovery from failures Use cases: Offline & online processing of activities, events, monitoring, and sensor data Couchbase can be either a producer or a consumer in Kafka terms. Couchbase as the Master Database React to changes happening in the bucket by updating data somewhere else Triggers/Event Handling Handle events like Deletions/Expirations externally (E.G. Expiration & replicated session tokens) Real-time Data Integration Extract from Couchbase , transform and load data in real time Real-time Data Processing Extract from a Bucket, process in real-time and load to another Bucket
  11. Fully transparent cluster and bucket management, including direct access if needed
  12. Fully transparent cluster and bucket management, including direct access if needed
  13. Spark Core Automatic Cluster and Resource Management Creating and Persisting RDDs Java, Scala APIs Spark SQL Easy JSON handling and querying Tight N1QL Integration Spark Streaming Persisting DStreams DCP source (experimental)
  14. 29
  15. 30
  16. 33
  17. Analyze other
  18. Analyze other
  19. Fully transparent cluster and bucket management, including direct access if needed
  20. Fully transparent cluster and bucket management, including direct access if needed
  21. Spark Application = a user program built on Spark. It includes the driver program and the executors on the cluster. Application jar – a jar containing the users application along with its dependencies. Driver program – The process running the main() function of the application and creating the SparkContext Executor - A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.