963

Real time big data with Apache Kafka,
Spark Streaming, Scala, Elastic search.
By
S Annu Ahmed(122N1A0573)
V Indu Priyanka(122N1A0532)
S Ravindra(122N1A0572)
M Imran Basha(122N1A0556)
P B Sravanthi(122N1A0558)
B Baby Likhitha(122N1A0514)

Contents:
• Overview of project
• What is Big Data ?
• Hadoop
• Apache Kafka
• Scala
• Spark Streaming
• Elastic Search

What is Data ?
“A set of values that may be Qualitative or Quantitate in
nature”
What is Big Data ?
“Data so large and voluminous that it overwhelms the
existing data storage and processing infrastructure, is said to
be big enough to be called as-Big data”
 What is Real time Big Data ?
“ Hadoop is engineered for big data analytics, but it's
not real time. NoSQL is engineered for real-time big data,
but it's operational rather than analytical. NoSQL together
with Hadoop is the key to real time big data”

Parameters to use big data:
 Huge amount of data
 Complex data which consists of lots of unstructured data
 Speed of generating data

What We Need ?
Fault Tolerant
Failure Detection
Fast - low latency, distributed, data locality
Masterless, Decentralized Cluster Membership
DataCenters
Partition-Aware
Elasticity
•Parallelism

Apache Hadoop is an open source framework for distributed storage
and processing of large sets of data on commodity hardware.
ECO SYSTEM:
Hadoop

Let’s recall basic concepts of
Messaging System

Point to Point Messaging
(Queue)

Publish-Subscribe Messaging
(Topic)

Overview
 An apache project initially developed at LinkedIn
 Distributed publish-subscribe messaging system
• Designed for processing of real time activity stream data e.g.
logs, metrics collections
• Written in Scala
 Features
 Persistent messaging
 High-throughput
 Supports both queue and topic semantics
 Uses Zookeeper for forming a cluster of nodes
(producer/consumer/broker)and many more…

Real time transfer
Consumer3
(Group2)
Kafka
Broker
Consumer4
(Group2)
Producer
Zookeeper
Consumer2
(Group1)
Consumer1
(Group1)
Update Consumed
Message offset
Queue
Topology
Topic
Topology
Kafka
Broker
Broker does not Push messages to Consumer, Consumer Polls messages from Broker.

About Apache Spark
 Initially started at UC Berkeley in 2009
 Fast and general purpose cluster computing system
 10x (on disk) - 100x (In-Memory) faster
 Most popular for running Iterative Machine Learning Algorithms.
 Provides high level APIs in
 Java
 Scala
 Python
 Integration with Hadoop and its eco-system and can read existing data.

So Why Spark ?
Hadoop execution flow
Spark execution flow
• Most of Machine Learning Algorithms are iterative because each iteration
can improve the results
• With Disk based approach each iteration’s output is written to disk making
it slow

Resilient Distributed Dataset (RDD)
•Fault Tolerant: can recalculate from any point of failure
•Created through transformations on data (map,filter..) or other
RDDs
•Immutable
•Partitioned
•Indicate which RDD elements to partition across machines
based on a key in each record
•Can be reused

Spark Streaming
Makes it easy to build scalable fault-tolerant
streaming applications.
Ease of Use
Fault Tolerance
Combine streaming with batch and interactive
queries.

zillions of bytes gigabytes per second
Spark Streaming

• Functional
• Object oriented programming
• On the JVM
• Static typing - easier to control performance
Why Scala?

Scala
 Scala has been created by Martin Odersky and he
released the first version in 2001
 Scala is the language that addresses the major needs of
the modern developer.
 It is a statically typed, mixed-paradigm, JVM language
with a succinct, elegant, and flexible syntax, a
sophisticated type system, and idioms that promote
scalability from small , interpreted scripts to large,
sophisticated applications.

Continued….
 Scala is compelling because it feels like a dynamically
typed scripting language, due to its succinct syntax and
type inference.
 Yet Scala gives you all the benefits of static typing, a
modern object model, functional programming, and an
advanced type system.
 Scala's aim to provide advanced constructs for the
abstraction and composition of components is shared by
several recent research efforts.

What is elasticsearch?
 In short, it can be thought of as “search engine software”
 It provides the realistic potential for you to run your own search engine
service (like a Bing or a Google) but with say, private, sensitive, or
confidential data/documents that you don’t want on the public web
 great extra capability for your company, enterprise, app, startup, client
 elasticsearch is an open-source, distributed web application that runs on
top of Lucene, and it is written in Java, and it sports a REST API
 Apache Lucene is the best open-source search engine, and probably one
of the best search engines available, and holds its own even when
compared against the most expensive commercial alternatives
 very fast search

Where did elasticsearch come from?
 Originally there was a search application project called Apache
Compass, which was primarily worked on by @kimchy
 Compass also relied on Lucene, but was not distributed
 kimchy decided to write elasticsearch to be distributed from the
get go, and so you could say it was built with the cloud in mind
 Add more servers and they play together nicely, and they know
how to work together to split up the work load (and search
queries can be resource intensive and expensive in terms of
memory/disk requirements)

Elastic search is an advanced distributed app
 It has some very cool properties and abilities when it
comes to operations that involve lots of nodes
 It scales extremely gracefully
 It has its own optimized binary protocol and makes its
own “internal network”
 …as long as you know what you are doing when it
comes to configuration
 It is open source

963

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 963

Similar to 963 (20)

963