SlideShare a Scribd company logo
1 of 103
Download to read offline
BIG DATA PROCESSING WITH
APACHE SPARK
December 9, 2015
LBS College of Engineering
www.sarithdivakar.info | www.csegyan.org
WHAT IS BIG DATA?
Terabytes of Data
Petabytes of Data
Exabytes of Data
Yottabytes of Data
Brontobytes of Data
Geobytes of Data
WHERE BIG DATA COMES FROM?
Huge amount of data is created everyday!
It comes from Us!
No digitized process becomes digitized
Digital India
 Programmee to transform India to a digitally empowered
society and knowledge economy
EXAMPLES OF DIGITIZATION
Online banking
Online shopping
E-learning
Emails
Social medias
Decrease in cost of storage & data capture technology
make up new era of data revolution
TRENDS IN BIG DATA
 Digitalization of virtually everything: e.g. One’s personal life
DATATYPES
Structured
Database, Data warehouse, Enterprise systems
Unstructured
Analog data, GPS tracking, Audio/Video streams,Text
files
Semi-Structured
XML, Email, EDI
KEY ENABLERS OF BIG DATA
Increase in storage
capacities
Increase in processing
power
Availability of Data
FEATURES OF BIG DATA GENERATED
Digitally generated
Passively produced
Automatically collected
Geographically or temporarily
trackable
Continuously analyzed
DIMENSIONS OF BIG DATA
 Volume: Every minute 72 hours of videos are uploaded on
YouTube
 Variety: Excel tables & databases (Structured), Pure text,
photo, audio, video, web, GPS data, sensor data, documents,
sms, etc. New data formats for new applications
 Velocity: Batch processing not possible as data is streamed.
 Veracity/variability: Uncertainty inherent within some type of
data
 Value: Economic/business value of different data may vary
CHALLENGES IN BIG DATA
Capture
Storage
Search
Sharing
Transfer
Analysis
Visualization
NEED FOR BIG DATA ANALYTICS
Big Data needs to be captured, stored, organized and
analyzed
It is Large & Complex
Cannot manage with current methodologies or data
mining tools
THEN NOW
Data warehousing, Datamining & Database
technologies
Did not analyze email, PDF andVideo files
Worked with huge amount of data Analyzing semi structured and Un structured
data
Prediction based on data Access and store all huge data created
BIG DATA ANALYTICS
Big Data analytics refers to tools
and methodologies that aim to
transform massive quantities of
raw data into “data about data”
for analytical purposes.
Discovery of meaningful
patterns in data
Used for decision making
EXCAVATING HIDDENTREASURES FROM
BIG DATA
Insights into data can provide business advantage
Some key early indications can mean fortunes to
business
More precise analysis with more data
Integrate Big Data with traditional data: Enhance
business intelligence analysis
UNSTRUCTURED DATATYPES
Email and other forms of electronic
communication
Web based content(Click streams,
social media)
Digitized audio and video
Machine generated data(RFID,
GPS, sensor-generated data, log
files) and IoT
APPLICATIONS OF BIG DATA ANALYSIS
 Business: Customer personalization, customer needs
 Technology: Reduce process time
 Health: DNA mining to detect hereditary diseases
 Smart cities: Cities with good economic development and high
quality of life could be analyzed
 Oil and Gas: Analyzing sensor generated data for production
optimization, cost management risk management, healthy and
safe drilling
 Telecommunications: Network analytics and optimization from
device, sensor and GPS to enhance social and promotion
OPPORTUNITIES BIG DATA OFFERS
Early warning
Real-time awareness
Real-time feedback
CHALLENGES IN BIG DATA
Heterogeneity and incompleteness
Scale
Timeliness
Privacy
Human collaboration
BIG DATA AND CLOUD: CONVERGING
TECHNOLOGIES
Big data: Extracting value out of
“variety, velocity and volume”
from unstructured information
available
Cloud: On demand, elastic,
scalable pay per use self service
model
ANSWERTHESE BEFORE MOVINGTO BIG
DATA ANALYSIS
Do you have an effective big data problem?
Can the business benefit from using Big
Data?
Do your data volumes really require these
distributed mechanisms?
TECHNOLOGYTO HANDLE BIG DATA
Google was the first company to effectively use big data
Engineers at google created massively distributed
systems
Collected and analyzed massive collections of web
pages & relationships between them and created
“Google Search Engine” capable of querying billions of
pages
FIRST GENERATION OF DISTRIBUTED
SYSTEMS
Proprietary
Custom Hardware and
software
Centralized data
Hardware based fault
recovery
Eg:Teradata, Netezza etc
SECOND GENERATION OF DISTRIBUTED
SYSTEMS
Open source
Commodity hardware
Distributed data
Software based fault recovery
Eg : Hadoop, HPCC
APACHE HADOOP
Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters
of commodity computers using a simple programming
model.
HADOOP – KEY CHARACTERISTICS
HADOOP CORE COMPONENTS
HDFS ARCHITECTURE
SECONDARY NAMENODE
HADOOP CLUSTER ARCHITECTURE
HADOOP ECOSYSTEM
HADOOP CLUSTER MODES
MAP REDUCE PROGRAMMING
MAP REDUCE FLOW
EXISTING HADOOP CUSTOMERS
HADOOPVERSIONS
WHYWE NEED NEW GENERATION?
Lot has been changed from 2000
Both hardware and software gone through changes
Big data has become necessity now
Let’s look at what changed over decade
CHANGES IN HARDWARE
State of hardware in 2000 State of hardware now
Disk was cheap so disk was primary source
of data
RAM is the king
Network was costly so data locality RAM is primary source of data and we use
disk for fallback
RAM was very costly Network is speedier
Single core machines were dominant Multi core machines are commonplace
SHORTCOMINGS OF SECOND GENERATION
Batch processing is primary objective
Not designed to change depending upon use cases
Tight coupling betweenAPI and run time
Do not exploit new hardware capabilities
Too much complex
MAPREDUCE LIMITATIONS
 If you wanted to do something complicated, you would have to
string together a series of MapReduce jobs and execute them in
sequence.
 Each of those jobs have high-latency, and none could start until
the previous job had finished completely.
 The Job output data between each step has to be stored in the
distributed file system before the next step can begin.
 Hence, this approach tends to be slow due to replication & disk
storage.
HADOOP VS SPARK
HADOOP SPARK
Stores data on disk Sores data in memory (RAM)
Commodity hardware can be utilized Need high end systems with greater RAM
Uses Replication to achieve fault tolerance Uses different data storage models to achieve
fault tolerance (Eg. RDD)
Speed of processing is less due to disk read
write
100x faster than Hadoop
Supports only Java & R Supports Java, Python, R, Scala etc. Ease of
programming is high.
Everything is just Map and Reduce Supports Map, Reduce, SQL. Streaming etc
Data should be in HDFS Data can be in HDFS,Cassandra,Hbase or S3.
Runs on Hadoop, Cloud, Mesos or standalone
THIRD GENERATION DISTRIBUTED
SYSTEMS
 Handle both batch processing and real time
 Exploit RAM as much as disk
 Multiple core aware
 Do not reinvent the wheel
 They use
 HDFS for storage
 Apache Mesos /YARN for distribution
 Plays well with Hadoop
APACHE SPARK
Open source Big Data processing framework
Apache Spark started as a research project at UC
Berkeley in the AMPLab(Now Databricks), which
focuses on big data analytics.
Open sourced in early 2010.
Many of the ideas behind the system are presented in
various research papers.
SPARKTIMELINE
SPARK FEATURES
Spark gives us a comprehensive, unified framework
Manage big data processing requirements with a variety
of data sets
 Diverse in nature (text data, graph data etc)
 Source of data (batch v. real-time streaming data).
Spark lets you quickly write applications in Java, Scala,
or Python.
DIRECTED ACYCLIC GRAPH (DAG)
Spark allows programmers to
develop complex, multi-step data
pipelines using directed acyclic
graph (DAG) pattern.
It also supports in-memory data
sharing across DAGs, so that
different jobs can work with the
same data.
UNIFIED PLATFORM FOR BIG DATA APPS
WHY UNIFICATION MATTERS?
Good for developers : One platform to learn
Good for users :Take apps every where
Good for distributors : More apps
UNIFICATION BRINGS ONE ABSTRACTION
All different processing systems in spark share same
abstraction called RDD
RDD is Resilient Distributed Dataset
As they share same abstraction you can mix and match
different kind of processing in same application
SPAM DETECTION
RUNS EVERYWHERE
You can spark on top any distributed system
It can run on
Hadoop 1.x
Hadoop 2.x
Apache Mesos
It’s own cluster
It’s just a user space
library
SMALL AND SIMPLE
 Apache Spark is highly
modular
 The original version
contained only 1600 lines of
scala code
 Apache Spark API is
extremely simple compared
Java API of M/R
 API is concise and consistent
SPARK ARCHITECTURE
DATA STORAGE
Spark uses HDFS file system for data storage purposes.
 It works with any Hadoop compatible data source
including HDFS, HBase, Cassandra, etc.
API
The API provides the application developers to create
Spark based applications using a standard API interface.
Spark provides API for Scala, Java, and Python
programming languages.
RESOURCE MANAGEMENT
Spark can be deployed as a Stand-alone server or it can
be on a distributed computing framework like Mesos or
YARN
SPARK RUNNING ARCHITECTURE
SPARK RUNNING ARCHITECTURE
Connects to a cluster manager which allocate resources
across applications
Acquires executors on cluster nodes– worker processes
to run computations and store data
Sends appcode to the executors
Sends tasks for the executors to run
SPARK RUNNING ARCHITECTURE
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark client
(app master)
Spark worker
HDFS, HBase, …
Block
manager
Task threads
RDD graph
Scheduler
Block tracker
Shuffle tracker
Cluster
manager
SCHEDULING PROCESS
RDD Objects
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
RDD - RESILIENT DISTRIBUTED DATASET
 Resilient Distributed Datasets (RDD) are the primary abstraction
in Spark – a fault-tolerant collection of elements that can be
operated on in parallel
 A big collection of data having following properties
 Immutable
 Lazy evaluated
 Cacheable
 Type inferred
RDD - RESILIENT DISTRIBUTED DATASET –
TWOTYPES
Parallelized collections – take an existing Scala
collection and run functions on it in parallel
Hadoop datasets / files – run functions on each record of
a file in Hadoop distributed file system or any other
storage system supported by Hadoop
SPARK COMPONENTS & ECOSYSTEM
 Spark driver (context)
 Spark DAG scheduler
 Cluster management
systems
 YARN
 Apache Mesos
 Data sources
 In memory
 HDFS
 No SQL
ECOSYSTEM OF HADOOP & SPARK
CONTRIBUTORS PER MONTHTO SPARK
SPARK – STARK OVERFLOW ACTIVITIES
IN MEMORY
In Spark, you can cache hdfs data in main memory of
worker nodes
Spark analysis can be executed directly on in memory
data
Shuffling also can be done from in memory
Fault tolerant
INTEGRATION WITH HADOOP
No separate storage layer
Integrates well with HDFS
Can run on Hadoop 1.0 and Hadoop 2.0YARN
Excellent integration with ecosystem projects like
Apache Hive, HBase etc
MULTI LANGUAGE API
Written in Scala but API is not limited to it
OffersAPI in
Scala
Java
Python
You can also do SQL using SparkSQL
SPARK – OPEN SOURCE ECOSYSTEM
SPARK SORT RECORD
PYTHON EXAMPLES
PYTHON EXAMPLES
WRITE
f = open('demo.txt','r')
data = f.read()
print(data)
f = open('demo.txt','a')
f.write('I am trying to write a file')
f.close()
PYTHON EXAMPLES
READ
PYTHON EXAMPLES
PYTHON EXAMPLES
RDD CREATION – FROM COLLECTIONS
A = range(1,100000)
print(A)
raw_data = sc. parallelize(A)
raw_data.count()
raw_data.take(5)
Creating a RDD from a Collection
Creating a Collection
To view the sample data
Count the number of lines in the loaded files
RDD CREATION – FROM FILES
Getting the data files
import urllib
f = urllib.urlretrieve ("https://sparksarith.azurewebsites.net/Sarith/test.csv", "tv.csv")
Count the number of lines in the loaded files
data_file = "./tv.csv"
raw_data = sc.textFile(data_file)
Creating a RDD from a file
raw_data.count()
To view the sample data
raw_data.take(5)
IMMUTABILITY
 Immutability means once created it never changes
 Big data by default immutable in nature
 Immutability helps to
 Parallelize
 Caching
 const int a = 0 //immutable
 int b = 0; // mutable
 b ++ // in place (Updation)
 c = a + 1 (Copy)
 Immutability is about value not about reference
IMMUTABILITY IN COLLECTIONS
Mutable Immutable
var collection = [1,2,4,5]
for ( i = 0; i<collection.length; i++) {
collection[i]+=1;
}
Uses loop for updating
collection is updated in place
val collection = [1,2,4,5]
val newCollection =
collection.map(value
=> value +1)
Uses transformation for change
Creates a new copy of collection.
Leaves
collection intact
CHALLENGES OF IMMUTABILITY
Immutability is great for parallelism but not good for
space
Doing multiple transformations result in
Multiple copies of data
Multiple passes over data
In big data, multiple copies and multiple passes will have
poor performance characteristics.
LET’S GET LAZY
Laziness means not
computing transformation till
it’s need
Laziness defers evaluation
Laziness allows separating
execution from evaluation
LAZINESS AND IMMUTABILITY
You can be lazy only if the underneath data is
immutable
You cannot combine transformation if transformation
has side effect
So combining laziness and immutability gives better
performance and distributed processing.
CHALLENGES OF LAZINESS
Laziness poses challenges in terms of data type
If laziness defers execution, determining the type of the
variable becomes challenging
If we can’t determine the right type, it allows to have
semantic issues
Running big data programs and getting semantics errors
are not fun.
TRANSFORMATIONS
 Transformations are the operations on RDD that return new RDD
 By using the map transformation in Spark, we can apply a function to every
element in our RDD
 Collect will get all the elements in the RDD into memory to work with them
csv_data = raw_data.map(lambda x : x.split(“,”))
all_data = csv_data.collect()
all_data
len(all_data)
SET OPERATIONS ON RDD
 Spark support many of the operations we have in mathematical sets, such as
union and intersection, even when the RDDs themselves are not properly sets
 Union of RDDs doesn't remove duplicates
a=[1,2,3,4,5]
b=[1,2,3,6]
dist_a = sc.parallelize(a)
dist_b = sc.parallelize(b)
substract_data = dist_a.subtract(dist_b)
substract_data.take(10)
union_data=dist_a.union(dist_b)
union_data.take(10)
[1, 2, 3, 4, 5, 1, 2, 3, 6]
distinct_data=union_data.distinct()
distinct_data.take(10)
[2, 4, 6, 1, 3, 5]
KEYVALUE PAIRS - RDD
 Spark provides specific functions to deal with RDDs which elements are key/value
pairs
 They are commonly used for grouping and aggregations
data = ['nithin,25','appu,40','anil,20','nithin,35','anil,30','anil,50’]
raw_data = sc.parallelize(data)
raw_data.collect()
key_value = raw_data.map(lambda line:(line.split(',')[0],int(line.split(',')[1])))
grouped_data = key_value.reduceByKey(lambda x,y:x+y)
grouped_data.collect()
grouped_data.keys().collect()
grouped_data.values().collect()
sorted_data = grouped_data.sortByKey()
sorted_data.collect()
CACHING
 Immutable data allows you to cache data for long time
 Lazy transformation allows to recreate data on failure
 Transformations can be saved also
 Caching data improves execution engine performance
raw_data.cache()
raw_data.collect()
SAVINGYOUR DATA
 saveAsTextFile(path) is used for storing the RDD inside your harddisk
 Path is a directory and spark will output the multiple files under that directory. It
allows the spark to write the output from the multiple nodes
raw_data.saveAsTextFile('opt')
SPARK EXECUTION MODEL
Create DAG of RDDs to represent computation
Create logical execution plan for DAG
Schedule and execute individual tasks
STEP 1: CREATE RDDS
DEPENDENCYTYPES
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Narrow” deps: “Wide” (shuffle) deps:
STEP 2: CREATE EXECUTION PLAN
Pipeline as much as possible
Split into “stages” based on
need to reorganize data
STEP 3: SCHEDULETASKS
Split each stage into tasks
A task is data +
computation
Execute all tasks within a
stage before moving on
SPARK SUBMIT
from pyspark import SparkContext
sc = SparkContext( 'local', 'pyspark')
raw_data = sc.textFile("./bigdata.txt")
shows = raw_data.map(lambda line: (line.split(',')[4],1))
shows.take(5)
STEP 3: SCHEDULETASKS
WHO ARE USING SPARK
SPARK INSTALLATION
 INSTALL JDK
 sudo apt-get install openjdk-7-jdk
 INSTALL SCALA
 sudo apt-get install scala
 INSTALLING MAVEN
 wget http://mirrors.sonic.net/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-
bin.tar.gz
 tar -zxf apache-maven-3.3.3-bin.tar.gz
 sudo cp -R apache-maven-3.3.3 /usr/local
 sudo ln -s /usr/local/apache-maven-3.3.3/bin/mvn /usr/bin/mvn
 mvn –v
SPARK INSTALLATION
 INSTALLINGGIT
 sudo apt-get install git
 CLONE SPARK PROJECT FROM GITHUB
 git clone https://github.com/apache/spark.git
 cd spark
 BUILD SPARK PROJECT
 build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -
DskipTests clean package
 For starting spark cluster - ./sbin/start-all.sh
 For Starting shell ./bin/pyspark
REFERENCES
 1. “Data Mining and DataWarehousing”, M.Sudheep Elayidom, SOE, CUSAT
 2. “Resilient Distributed Datasets:A Fault-Tolerant Abstraction for In-Memory
Cluster Computing”. Matei Zaharia, Mosharaf Chowdhury,Tathagata Das, Ankur
Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica.
NSDI 2012. April 2012. Best Paper Award.
 3. “What is Big Data”, https://www-01.ibm.com/software/in/data/bigdata/
 4. “Apache Hadoop”, https://hadoop.apache.org/
 5. “Apache Spark”, http://spark.apache.org/
 6. “Spark: Cluster Computing with Working Sets”. Matei Zaharia, Mosharaf
Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June
2010.
CREDITS
 Dr. M Sudheep Elayidom, Associate Professor, Div Of Computer Science & Engg,
SOE, CUSAT
 Nithink K Anil, Quantiph, Mumbai, Maharashtra, India
 Lija Mohan, Lija Mohan, Div Of Computer Science & Engg, SOE, CUSAT

More Related Content

What's hot

Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid DataWorks Summit
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Juan Pedro Moreno
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingDataWorks Summit
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Spark Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 

What's hot (20)

Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral Processing
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Hadoop Platform at Yahoo
Hadoop Platform at YahooHadoop Platform at Yahoo
Hadoop Platform at Yahoo
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 

Viewers also liked

The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
Cisco event 6 05 2014v3 wwt only
Cisco event 6 05 2014v3 wwt onlyCisco event 6 05 2014v3 wwt only
Cisco event 6 05 2014v3 wwt onlyArthur_Hansen
 
Past Present and Future of Data Processing in Apache Hadoop
Past Present and Future of Data Processing in Apache HadoopPast Present and Future of Data Processing in Apache Hadoop
Past Present and Future of Data Processing in Apache HadoopDataWorks Summit
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
Big Data: Small Screen Location-Based Services 2.0
Big Data: Small Screen Location-Based Services 2.0 Big Data: Small Screen Location-Based Services 2.0
Big Data: Small Screen Location-Based Services 2.0 Kevin Foreman
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientistMassimiliano Martella
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadershipsjoerdluteyn
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinesecolorant
 
NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaSpark Summit
 
Leveraging Geo-Spatial (Big) Data for Financial Services Solutions
Leveraging Geo-Spatial (Big) Data for Financial Services SolutionsLeveraging Geo-Spatial (Big) Data for Financial Services Solutions
Leveraging Geo-Spatial (Big) Data for Financial Services SolutionsCapgemini
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute modelDean Wampler
 
T22.Fujitsu World Tour India 2016-Business Intelligence and Data Analytics in...
T22.Fujitsu World Tour India 2016-Business Intelligence and Data Analytics in...T22.Fujitsu World Tour India 2016-Business Intelligence and Data Analytics in...
T22.Fujitsu World Tour India 2016-Business Intelligence and Data Analytics in...Fujitsu India
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 

Viewers also liked (20)

The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Cisco event 6 05 2014v3 wwt only
Cisco event 6 05 2014v3 wwt onlyCisco event 6 05 2014v3 wwt only
Cisco event 6 05 2014v3 wwt only
 
Past Present and Future of Data Processing in Apache Hadoop
Past Present and Future of Data Processing in Apache HadoopPast Present and Future of Data Processing in Apache Hadoop
Past Present and Future of Data Processing in Apache Hadoop
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Smart control v3
Smart control v3Smart control v3
Smart control v3
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Big Data: Small Screen Location-Based Services 2.0
Big Data: Small Screen Location-Based Services 2.0 Big Data: Small Screen Location-Based Services 2.0
Big Data: Small Screen Location-Based Services 2.0
 
Pipelining
PipeliningPipelining
Pipelining
 
Performance
PerformancePerformance
Performance
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientist
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
 
NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey Stella
 
Leveraging Geo-Spatial (Big) Data for Financial Services Solutions
Leveraging Geo-Spatial (Big) Data for Financial Services SolutionsLeveraging Geo-Spatial (Big) Data for Financial Services Solutions
Leveraging Geo-Spatial (Big) Data for Financial Services Solutions
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
T22.Fujitsu World Tour India 2016-Business Intelligence and Data Analytics in...
T22.Fujitsu World Tour India 2016-Business Intelligence and Data Analytics in...T22.Fujitsu World Tour India 2016-Business Intelligence and Data Analytics in...
T22.Fujitsu World Tour India 2016-Business Intelligence and Data Analytics in...
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Mobile LBS
Mobile LBSMobile LBS
Mobile LBS
 

Similar to Big data processing with apache spark

Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Big Data/Hadoop Option Analysis
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysiszafarali1981
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Solving the Really Big Tech Problems with IoT
 Solving the Really Big Tech Problems with IoT Solving the Really Big Tech Problems with IoT
Solving the Really Big Tech Problems with IoTEric Kavanagh
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoopahmed alshikh
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 

Similar to Big data processing with apache spark (20)

Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data with java
Big data with javaBig data with java
Big data with java
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data/Hadoop Option Analysis
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysis
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Solving the Really Big Tech Problems with IoT
 Solving the Really Big Tech Problems with IoT Solving the Really Big Tech Problems with IoT
Solving the Really Big Tech Problems with IoT
 
Big data
Big dataBig data
Big data
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoop
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 

Recently uploaded

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 

Recently uploaded (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 

Big data processing with apache spark

  • 1. BIG DATA PROCESSING WITH APACHE SPARK December 9, 2015 LBS College of Engineering www.sarithdivakar.info | www.csegyan.org
  • 2. WHAT IS BIG DATA? Terabytes of Data Petabytes of Data Exabytes of Data Yottabytes of Data Brontobytes of Data Geobytes of Data
  • 3. WHERE BIG DATA COMES FROM? Huge amount of data is created everyday! It comes from Us! No digitized process becomes digitized Digital India  Programmee to transform India to a digitally empowered society and knowledge economy
  • 4. EXAMPLES OF DIGITIZATION Online banking Online shopping E-learning Emails Social medias Decrease in cost of storage & data capture technology make up new era of data revolution
  • 5. TRENDS IN BIG DATA  Digitalization of virtually everything: e.g. One’s personal life
  • 6. DATATYPES Structured Database, Data warehouse, Enterprise systems Unstructured Analog data, GPS tracking, Audio/Video streams,Text files Semi-Structured XML, Email, EDI
  • 7. KEY ENABLERS OF BIG DATA Increase in storage capacities Increase in processing power Availability of Data
  • 8. FEATURES OF BIG DATA GENERATED Digitally generated Passively produced Automatically collected Geographically or temporarily trackable Continuously analyzed
  • 9.
  • 10. DIMENSIONS OF BIG DATA  Volume: Every minute 72 hours of videos are uploaded on YouTube  Variety: Excel tables & databases (Structured), Pure text, photo, audio, video, web, GPS data, sensor data, documents, sms, etc. New data formats for new applications  Velocity: Batch processing not possible as data is streamed.  Veracity/variability: Uncertainty inherent within some type of data  Value: Economic/business value of different data may vary
  • 11. CHALLENGES IN BIG DATA Capture Storage Search Sharing Transfer Analysis Visualization
  • 12. NEED FOR BIG DATA ANALYTICS Big Data needs to be captured, stored, organized and analyzed It is Large & Complex Cannot manage with current methodologies or data mining tools THEN NOW Data warehousing, Datamining & Database technologies Did not analyze email, PDF andVideo files Worked with huge amount of data Analyzing semi structured and Un structured data Prediction based on data Access and store all huge data created
  • 13. BIG DATA ANALYTICS Big Data analytics refers to tools and methodologies that aim to transform massive quantities of raw data into “data about data” for analytical purposes. Discovery of meaningful patterns in data Used for decision making
  • 14. EXCAVATING HIDDENTREASURES FROM BIG DATA Insights into data can provide business advantage Some key early indications can mean fortunes to business More precise analysis with more data Integrate Big Data with traditional data: Enhance business intelligence analysis
  • 15.
  • 16. UNSTRUCTURED DATATYPES Email and other forms of electronic communication Web based content(Click streams, social media) Digitized audio and video Machine generated data(RFID, GPS, sensor-generated data, log files) and IoT
  • 17. APPLICATIONS OF BIG DATA ANALYSIS  Business: Customer personalization, customer needs  Technology: Reduce process time  Health: DNA mining to detect hereditary diseases  Smart cities: Cities with good economic development and high quality of life could be analyzed  Oil and Gas: Analyzing sensor generated data for production optimization, cost management risk management, healthy and safe drilling  Telecommunications: Network analytics and optimization from device, sensor and GPS to enhance social and promotion
  • 18. OPPORTUNITIES BIG DATA OFFERS Early warning Real-time awareness Real-time feedback
  • 19. CHALLENGES IN BIG DATA Heterogeneity and incompleteness Scale Timeliness Privacy Human collaboration
  • 20. BIG DATA AND CLOUD: CONVERGING TECHNOLOGIES Big data: Extracting value out of “variety, velocity and volume” from unstructured information available Cloud: On demand, elastic, scalable pay per use self service model
  • 21. ANSWERTHESE BEFORE MOVINGTO BIG DATA ANALYSIS Do you have an effective big data problem? Can the business benefit from using Big Data? Do your data volumes really require these distributed mechanisms?
  • 22. TECHNOLOGYTO HANDLE BIG DATA Google was the first company to effectively use big data Engineers at google created massively distributed systems Collected and analyzed massive collections of web pages & relationships between them and created “Google Search Engine” capable of querying billions of pages
  • 23.
  • 24. FIRST GENERATION OF DISTRIBUTED SYSTEMS Proprietary Custom Hardware and software Centralized data Hardware based fault recovery Eg:Teradata, Netezza etc
  • 25. SECOND GENERATION OF DISTRIBUTED SYSTEMS Open source Commodity hardware Distributed data Software based fault recovery Eg : Hadoop, HPCC
  • 26. APACHE HADOOP Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
  • 27. HADOOP – KEY CHARACTERISTICS
  • 38. WHYWE NEED NEW GENERATION? Lot has been changed from 2000 Both hardware and software gone through changes Big data has become necessity now Let’s look at what changed over decade
  • 39. CHANGES IN HARDWARE State of hardware in 2000 State of hardware now Disk was cheap so disk was primary source of data RAM is the king Network was costly so data locality RAM is primary source of data and we use disk for fallback RAM was very costly Network is speedier Single core machines were dominant Multi core machines are commonplace
  • 40. SHORTCOMINGS OF SECOND GENERATION Batch processing is primary objective Not designed to change depending upon use cases Tight coupling betweenAPI and run time Do not exploit new hardware capabilities Too much complex
  • 41. MAPREDUCE LIMITATIONS  If you wanted to do something complicated, you would have to string together a series of MapReduce jobs and execute them in sequence.  Each of those jobs have high-latency, and none could start until the previous job had finished completely.  The Job output data between each step has to be stored in the distributed file system before the next step can begin.  Hence, this approach tends to be slow due to replication & disk storage.
  • 42. HADOOP VS SPARK HADOOP SPARK Stores data on disk Sores data in memory (RAM) Commodity hardware can be utilized Need high end systems with greater RAM Uses Replication to achieve fault tolerance Uses different data storage models to achieve fault tolerance (Eg. RDD) Speed of processing is less due to disk read write 100x faster than Hadoop Supports only Java & R Supports Java, Python, R, Scala etc. Ease of programming is high. Everything is just Map and Reduce Supports Map, Reduce, SQL. Streaming etc Data should be in HDFS Data can be in HDFS,Cassandra,Hbase or S3. Runs on Hadoop, Cloud, Mesos or standalone
  • 43. THIRD GENERATION DISTRIBUTED SYSTEMS  Handle both batch processing and real time  Exploit RAM as much as disk  Multiple core aware  Do not reinvent the wheel  They use  HDFS for storage  Apache Mesos /YARN for distribution  Plays well with Hadoop
  • 44. APACHE SPARK Open source Big Data processing framework Apache Spark started as a research project at UC Berkeley in the AMPLab(Now Databricks), which focuses on big data analytics. Open sourced in early 2010. Many of the ideas behind the system are presented in various research papers.
  • 46. SPARK FEATURES Spark gives us a comprehensive, unified framework Manage big data processing requirements with a variety of data sets  Diverse in nature (text data, graph data etc)  Source of data (batch v. real-time streaming data). Spark lets you quickly write applications in Java, Scala, or Python.
  • 47. DIRECTED ACYCLIC GRAPH (DAG) Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data.
  • 48. UNIFIED PLATFORM FOR BIG DATA APPS
  • 49. WHY UNIFICATION MATTERS? Good for developers : One platform to learn Good for users :Take apps every where Good for distributors : More apps
  • 50. UNIFICATION BRINGS ONE ABSTRACTION All different processing systems in spark share same abstraction called RDD RDD is Resilient Distributed Dataset As they share same abstraction you can mix and match different kind of processing in same application
  • 52. RUNS EVERYWHERE You can spark on top any distributed system It can run on Hadoop 1.x Hadoop 2.x Apache Mesos It’s own cluster It’s just a user space library
  • 53. SMALL AND SIMPLE  Apache Spark is highly modular  The original version contained only 1600 lines of scala code  Apache Spark API is extremely simple compared Java API of M/R  API is concise and consistent
  • 55. DATA STORAGE Spark uses HDFS file system for data storage purposes.  It works with any Hadoop compatible data source including HDFS, HBase, Cassandra, etc.
  • 56. API The API provides the application developers to create Spark based applications using a standard API interface. Spark provides API for Scala, Java, and Python programming languages.
  • 57. RESOURCE MANAGEMENT Spark can be deployed as a Stand-alone server or it can be on a distributed computing framework like Mesos or YARN
  • 59. SPARK RUNNING ARCHITECTURE Connects to a cluster manager which allocate resources across applications Acquires executors on cluster nodes– worker processes to run computations and store data Sends appcode to the executors Sends tasks for the executors to run
  • 60. SPARK RUNNING ARCHITECTURE sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark client (app master) Spark worker HDFS, HBase, … Block manager Task threads RDD graph Scheduler Block tracker Shuffle tracker Cluster manager
  • 61. SCHEDULING PROCESS RDD Objects agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
  • 62. RDD - RESILIENT DISTRIBUTED DATASET  Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel  A big collection of data having following properties  Immutable  Lazy evaluated  Cacheable  Type inferred
  • 63. RDD - RESILIENT DISTRIBUTED DATASET – TWOTYPES Parallelized collections – take an existing Scala collection and run functions on it in parallel Hadoop datasets / files – run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop
  • 64. SPARK COMPONENTS & ECOSYSTEM  Spark driver (context)  Spark DAG scheduler  Cluster management systems  YARN  Apache Mesos  Data sources  In memory  HDFS  No SQL
  • 67. SPARK – STARK OVERFLOW ACTIVITIES
  • 68. IN MEMORY In Spark, you can cache hdfs data in main memory of worker nodes Spark analysis can be executed directly on in memory data Shuffling also can be done from in memory Fault tolerant
  • 69. INTEGRATION WITH HADOOP No separate storage layer Integrates well with HDFS Can run on Hadoop 1.0 and Hadoop 2.0YARN Excellent integration with ecosystem projects like Apache Hive, HBase etc
  • 70. MULTI LANGUAGE API Written in Scala but API is not limited to it OffersAPI in Scala Java Python You can also do SQL using SparkSQL
  • 71. SPARK – OPEN SOURCE ECOSYSTEM
  • 75. WRITE f = open('demo.txt','r') data = f.read() print(data) f = open('demo.txt','a') f.write('I am trying to write a file') f.close() PYTHON EXAMPLES READ
  • 78. RDD CREATION – FROM COLLECTIONS A = range(1,100000) print(A) raw_data = sc. parallelize(A) raw_data.count() raw_data.take(5) Creating a RDD from a Collection Creating a Collection To view the sample data Count the number of lines in the loaded files
  • 79. RDD CREATION – FROM FILES Getting the data files import urllib f = urllib.urlretrieve ("https://sparksarith.azurewebsites.net/Sarith/test.csv", "tv.csv") Count the number of lines in the loaded files data_file = "./tv.csv" raw_data = sc.textFile(data_file) Creating a RDD from a file raw_data.count() To view the sample data raw_data.take(5)
  • 80. IMMUTABILITY  Immutability means once created it never changes  Big data by default immutable in nature  Immutability helps to  Parallelize  Caching  const int a = 0 //immutable  int b = 0; // mutable  b ++ // in place (Updation)  c = a + 1 (Copy)  Immutability is about value not about reference
  • 81. IMMUTABILITY IN COLLECTIONS Mutable Immutable var collection = [1,2,4,5] for ( i = 0; i<collection.length; i++) { collection[i]+=1; } Uses loop for updating collection is updated in place val collection = [1,2,4,5] val newCollection = collection.map(value => value +1) Uses transformation for change Creates a new copy of collection. Leaves collection intact
  • 82. CHALLENGES OF IMMUTABILITY Immutability is great for parallelism but not good for space Doing multiple transformations result in Multiple copies of data Multiple passes over data In big data, multiple copies and multiple passes will have poor performance characteristics.
  • 83. LET’S GET LAZY Laziness means not computing transformation till it’s need Laziness defers evaluation Laziness allows separating execution from evaluation
  • 84. LAZINESS AND IMMUTABILITY You can be lazy only if the underneath data is immutable You cannot combine transformation if transformation has side effect So combining laziness and immutability gives better performance and distributed processing.
  • 85. CHALLENGES OF LAZINESS Laziness poses challenges in terms of data type If laziness defers execution, determining the type of the variable becomes challenging If we can’t determine the right type, it allows to have semantic issues Running big data programs and getting semantics errors are not fun.
  • 86. TRANSFORMATIONS  Transformations are the operations on RDD that return new RDD  By using the map transformation in Spark, we can apply a function to every element in our RDD  Collect will get all the elements in the RDD into memory to work with them csv_data = raw_data.map(lambda x : x.split(“,”)) all_data = csv_data.collect() all_data len(all_data)
  • 87. SET OPERATIONS ON RDD  Spark support many of the operations we have in mathematical sets, such as union and intersection, even when the RDDs themselves are not properly sets  Union of RDDs doesn't remove duplicates a=[1,2,3,4,5] b=[1,2,3,6] dist_a = sc.parallelize(a) dist_b = sc.parallelize(b) substract_data = dist_a.subtract(dist_b) substract_data.take(10) union_data=dist_a.union(dist_b) union_data.take(10) [1, 2, 3, 4, 5, 1, 2, 3, 6] distinct_data=union_data.distinct() distinct_data.take(10) [2, 4, 6, 1, 3, 5]
  • 88. KEYVALUE PAIRS - RDD  Spark provides specific functions to deal with RDDs which elements are key/value pairs  They are commonly used for grouping and aggregations data = ['nithin,25','appu,40','anil,20','nithin,35','anil,30','anil,50’] raw_data = sc.parallelize(data) raw_data.collect() key_value = raw_data.map(lambda line:(line.split(',')[0],int(line.split(',')[1]))) grouped_data = key_value.reduceByKey(lambda x,y:x+y) grouped_data.collect() grouped_data.keys().collect() grouped_data.values().collect() sorted_data = grouped_data.sortByKey() sorted_data.collect()
  • 89. CACHING  Immutable data allows you to cache data for long time  Lazy transformation allows to recreate data on failure  Transformations can be saved also  Caching data improves execution engine performance raw_data.cache() raw_data.collect()
  • 90. SAVINGYOUR DATA  saveAsTextFile(path) is used for storing the RDD inside your harddisk  Path is a directory and spark will output the multiple files under that directory. It allows the spark to write the output from the multiple nodes raw_data.saveAsTextFile('opt')
  • 91. SPARK EXECUTION MODEL Create DAG of RDDs to represent computation Create logical execution plan for DAG Schedule and execute individual tasks
  • 93. DEPENDENCYTYPES union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Narrow” deps: “Wide” (shuffle) deps:
  • 94. STEP 2: CREATE EXECUTION PLAN Pipeline as much as possible Split into “stages” based on need to reorganize data
  • 95. STEP 3: SCHEDULETASKS Split each stage into tasks A task is data + computation Execute all tasks within a stage before moving on
  • 96. SPARK SUBMIT from pyspark import SparkContext sc = SparkContext( 'local', 'pyspark') raw_data = sc.textFile("./bigdata.txt") shows = raw_data.map(lambda line: (line.split(',')[4],1)) shows.take(5)
  • 98. WHO ARE USING SPARK
  • 99.
  • 100. SPARK INSTALLATION  INSTALL JDK  sudo apt-get install openjdk-7-jdk  INSTALL SCALA  sudo apt-get install scala  INSTALLING MAVEN  wget http://mirrors.sonic.net/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3- bin.tar.gz  tar -zxf apache-maven-3.3.3-bin.tar.gz  sudo cp -R apache-maven-3.3.3 /usr/local  sudo ln -s /usr/local/apache-maven-3.3.3/bin/mvn /usr/bin/mvn  mvn –v
  • 101. SPARK INSTALLATION  INSTALLINGGIT  sudo apt-get install git  CLONE SPARK PROJECT FROM GITHUB  git clone https://github.com/apache/spark.git  cd spark  BUILD SPARK PROJECT  build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver - DskipTests clean package  For starting spark cluster - ./sbin/start-all.sh  For Starting shell ./bin/pyspark
  • 102. REFERENCES  1. “Data Mining and DataWarehousing”, M.Sudheep Elayidom, SOE, CUSAT  2. “Resilient Distributed Datasets:A Fault-Tolerant Abstraction for In-Memory Cluster Computing”. Matei Zaharia, Mosharaf Chowdhury,Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012. Best Paper Award.  3. “What is Big Data”, https://www-01.ibm.com/software/in/data/bigdata/  4. “Apache Hadoop”, https://hadoop.apache.org/  5. “Apache Spark”, http://spark.apache.org/  6. “Spark: Cluster Computing with Working Sets”. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June 2010.
  • 103. CREDITS  Dr. M Sudheep Elayidom, Associate Professor, Div Of Computer Science & Engg, SOE, CUSAT  Nithink K Anil, Quantiph, Mumbai, Maharashtra, India  Lija Mohan, Lija Mohan, Div Of Computer Science & Engg, SOE, CUSAT