SlideShare a Scribd company logo
1 of 30
Download to read offline
Spark @ Hackathon
Madhukara Phatak
Zinnia Systems
@madhukaraphatak

1
Hackathon

“ Hackathon is a programming ritual ,
usually done in night, where
programmers solve problems overnight
which may take years of time otherwise”
-Anonymous

2
Where / When
Dec 6 and 7 th , 2013
At BigData Conclave
Hosted by Flutura
Solving real world problem(s) in 24 hours
No restriction on tools to be used
Deliverables : Results, Working code and Visualization

3
What?
Predict the global energy demand for next year using the
energy usage data available for last four years, in order to
enable utility companies , effectively handle the energy
demand.
Every minute usage data collected from smart meters
From 2008- 2012
Around 133 mb uncompressed
2,75,000 records
Around 25,000 missing data records

4
Questions to be answered
What would be the energy consumption for the next day?
What would be week wise Energy consumption for the next
one year?
What would be household’s peak time load (Peak time is
between 7 AM to 10 AM) for the next month.
During Weekdays
During Weekends
Assuming there was full day of outage, Calculate the
Revenue Loss for a particular day next year by finding the
Average Revenue per day (ARPD) of the household using
given tariff plan
Can you identify the device usage patterns?
5
Who
Four member team from Zinnia Systems
Chose Spark/Scala for solving these problems
Every one except me ,were new to Scala and Spark
Able to solve the problems on time
Won first prize at Hackathon

6
Why Spark?
Faster prototyping
Excellent in memory performance
Uses AKKA
Able to run 2.5 million concurrent processing in 1GB RAM
Easy to debug
Excellent integration in IntelliJ and Eclipse
Little to code - 500 lines of code
Something new to try at Hackathon

7
Solution
Uses only core spark
Uses Geometric Brownian motion algorithm for prediction
Complete code Available at Github
https://github.com/zinniasystems/spark-energy-prediction
Under Apache license
Blog series
http://zinniasystems.com/blog/2013/12/12/
predicting-global-energy-demand-using-spark-part-1/

8
Embrace Scala
Scala is a JVM language in which spark is implemented.
Though Spark gives Java API and Python API Scala feels
more natural.
If you are coming from Pig , You feel at home in Scala API.
Spark source base is small. So knowing Scala helps you
peek at spark source whenever possible.
Excellent REPL support.

9
Go functional
Spark encourages you to use functional idioms over object
oriented one's
Some of the functional idioms available are
Closure
Function chaining
Lazy evaluation
Ex : Standard deviation
Sum( (xi – Mean)*(xi – Mean))
Map/Reduce way
Map calculates (xi – Mean)*(xi – Mean)
Reduce does the sum
10
Spark way
private def standDeviation(inputRDD: RDD[Double], mean:
Double): Double = {
val sum = inputRDD.map(value => {
(value - mean) * (value - mean)
}).reduce((firstValue, secondValue) => {
firstValue + secondValue
})}
private def standDeviation(inputRDD: RDD[Double],mean: Double):
Double = {
val sum =0;
inputRDD.map(value => sum+=(value-mean)*(value-mean))}
11
Use Tuples
Map/Reduce is restricted to Key,Value pairs
Representing data like Grouped data is too difficult in
Key,Value pairs
Writable is too much work to develop/maintain
There was Tuple Map/Reduce effort some point of time
Spark (Scala) has built in.

12
Tuple Example
Aggregating data over hour
def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String,
Long), Record)]
The resulting RDD has tuple as key which combines date and
hour of the day.
These tuples can be sent as input to other functions
Aggregating data over Day
def dailyAggregation(inputRDD: RDD[((String, Long),
Record)]): RDD[(Date, Record)]

13
Use Lazy evaluation
Map/Reduce does not embrace lazy evaluation. Output of
every job has to be written to HDFS
HDFS is only way to share data in Map/Reduce
Spark differs
Every operation, other than actions, are lazy evaluated
Only write critical data to Disk, other intermediate data
cache in Memory
Be careful when you use actions. Try to delay calling of
actions as late possible.
Refer Main.scala

14
Thank you

15
Spark @ Hackathon
Madhukara Phatak
Zinnia Systems
@madhukaraphatak

1
Hackathon

“ Hackathon is a programming ritual ,
usually done in night, where
programmers solve problems overnight
which may take years of time otherwise”
-Anonymous

2
Where / When
Dec 6 and 7 th , 2013
At BigData Conclave
Hosted by Flutura
Solving real world problem(s) in 24 hours
No restriction on tools to be used
Deliverables : Results, Working code and Visualization

3
What?
Predict the global energy demand for next year using the
energy usage data available for last four years, in order to
enable utility companies , effectively handle the energy
demand.
Every minute usage data collected from smart meters
From 2008- 2012
Around 133 mb uncompressed
2,75,000 records
Around 25,000 missing data records

4
Questions to be answered
What would be the energy consumption for the next day?
What would be week wise Energy consumption for the next
one year?
What would be household’s peak time load (Peak time is
between 7 AM to 10 AM) for the next month.
During Weekdays
During Weekends
Assuming there was full day of outage, Calculate the
Revenue Loss for a particular day next year by finding the
Average Revenue per day (ARPD) of the household using
given tariff plan
Can you identify the device usage patterns?
5
Who
Four member team from Zinnia Systems
Chose Spark/Scala for solving these problems
Every one except me ,were new to Scala and Spark
Able to solve the problems on time
Won first prize at Hackathon

6
Why Spark?
Faster prototyping
Excellent in memory performance
Uses AKKA
Able to run 2.5 million concurrent processing in 1GB RAM
Easy to debug
Excellent integration in IntelliJ and Eclipse
Little to code - 500 lines of code
Something new to try at Hackathon

7
Solution
Uses only core spark
Uses Geometric Brownian motion algorithm for prediction
Complete code Available at Github
https://github.com/zinniasystems/spark-energy-prediction
Under Apache license
Blog series
http://zinniasystems.com/blog/2013/12/12/
predicting-global-energy-demand-using-spark-part-1/

8
Embrace Scala
Scala is a JVM language in which spark is implemented.
Though Spark gives Java API and Python API Scala feels
more natural.
If you are coming from Pig , You feel at home in Scala API.
Spark source base is small. So knowing Scala helps you
peek at spark source whenever possible.
Excellent REPL support.

9
Go functional
Spark encourages you to use functional idioms over object
oriented one's
Some of the functional idioms available are
Closure
Function chaining
Lazy evaluation
Ex : Standard deviation
Sum( (xi – Mean)*(xi – Mean))
Map/Reduce way
Map calculates (xi – Mean)*(xi – Mean)
Reduce does the sum
10
Spark way
private def standDeviation(inputRDD: RDD[Double], mean:
Double): Double = {
val sum = inputRDD.map(value => {
(value - mean) * (value - mean)
}).reduce((firstValue, secondValue) => {
firstValue + secondValue
})}
private def standDeviation(inputRDD: RDD[Double],mean: Double):
Double = {
val sum =0;
inputRDD.map(value => sum+=(value-mean)*(value-mean))}
11

The code is available in
EnergyUsagePrediction.scala
Use Tuples
Map/Reduce is restricted to Key,Value pairs
Representing data like Grouped data is too difficult in
Key,Value pairs
Writable is too much work to develop/maintain
There was Tuple Map/Reduce effort some point of time
Spark (Scala) has built in.

12
Tuple Example
Aggregating data over hour
def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String,
Long), Record)]
The resulting RDD has tuple as key which combines date and
hour of the day.
These tuples can be sent as input to other functions
Aggregating data over Day
def dailyAggregation(inputRDD: RDD[((String, Long),
Record)]): RDD[(Date, Record)]

13
Use Lazy evaluation
Map/Reduce does not embrace lazy evaluation. Output of
every job has to be written to HDFS
HDFS is only way to share data in Map/Reduce
Spark differs
Every operation, other than actions, are lazy evaluated
Only write critical data to Disk, other intermediate data
cache in Memory
Be careful when you use actions. Try to delay calling of
actions as late possible.
Refer Main.scala

14
Thank you

15

More Related Content

What's hot

Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech ProjectsJody Garnett
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformationsswooledge
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsGabriele Modena
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoopishan0019
 

What's hot (20)

Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Unit 2 part-2
Unit 2 part-2Unit 2 part-2
Unit 2 part-2
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformations
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Similar to Spark at-hackthon8jan2014

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache SparkStefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache SparkCodemotion
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKMatt Stubbs
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopQuantUniversity
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with RDatabricks
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesStratio
 

Similar to Spark at-hackthon8jan2014 (20)

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache SparkStefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache Spark
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
 
Scala+data
Scala+dataScala+data
Scala+data
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with R
 
Spark 101
Spark 101Spark 101
Spark 101
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
As simple as Apache Spark
As simple as Apache SparkAs simple as Apache Spark
As simple as Apache Spark
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph Datasources
 

Recently uploaded

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Spark at-hackthon8jan2014

  • 1. Spark @ Hackathon Madhukara Phatak Zinnia Systems @madhukaraphatak 1
  • 2. Hackathon “ Hackathon is a programming ritual , usually done in night, where programmers solve problems overnight which may take years of time otherwise” -Anonymous 2
  • 3. Where / When Dec 6 and 7 th , 2013 At BigData Conclave Hosted by Flutura Solving real world problem(s) in 24 hours No restriction on tools to be used Deliverables : Results, Working code and Visualization 3
  • 4. What? Predict the global energy demand for next year using the energy usage data available for last four years, in order to enable utility companies , effectively handle the energy demand. Every minute usage data collected from smart meters From 2008- 2012 Around 133 mb uncompressed 2,75,000 records Around 25,000 missing data records 4
  • 5. Questions to be answered What would be the energy consumption for the next day? What would be week wise Energy consumption for the next one year? What would be household’s peak time load (Peak time is between 7 AM to 10 AM) for the next month. During Weekdays During Weekends Assuming there was full day of outage, Calculate the Revenue Loss for a particular day next year by finding the Average Revenue per day (ARPD) of the household using given tariff plan Can you identify the device usage patterns? 5
  • 6. Who Four member team from Zinnia Systems Chose Spark/Scala for solving these problems Every one except me ,were new to Scala and Spark Able to solve the problems on time Won first prize at Hackathon 6
  • 7. Why Spark? Faster prototyping Excellent in memory performance Uses AKKA Able to run 2.5 million concurrent processing in 1GB RAM Easy to debug Excellent integration in IntelliJ and Eclipse Little to code - 500 lines of code Something new to try at Hackathon 7
  • 8. Solution Uses only core spark Uses Geometric Brownian motion algorithm for prediction Complete code Available at Github https://github.com/zinniasystems/spark-energy-prediction Under Apache license Blog series http://zinniasystems.com/blog/2013/12/12/ predicting-global-energy-demand-using-spark-part-1/ 8
  • 9. Embrace Scala Scala is a JVM language in which spark is implemented. Though Spark gives Java API and Python API Scala feels more natural. If you are coming from Pig , You feel at home in Scala API. Spark source base is small. So knowing Scala helps you peek at spark source whenever possible. Excellent REPL support. 9
  • 10. Go functional Spark encourages you to use functional idioms over object oriented one's Some of the functional idioms available are Closure Function chaining Lazy evaluation Ex : Standard deviation Sum( (xi – Mean)*(xi – Mean)) Map/Reduce way Map calculates (xi – Mean)*(xi – Mean) Reduce does the sum 10
  • 11. Spark way private def standDeviation(inputRDD: RDD[Double], mean: Double): Double = { val sum = inputRDD.map(value => { (value - mean) * (value - mean) }).reduce((firstValue, secondValue) => { firstValue + secondValue })} private def standDeviation(inputRDD: RDD[Double],mean: Double): Double = { val sum =0; inputRDD.map(value => sum+=(value-mean)*(value-mean))} 11
  • 12. Use Tuples Map/Reduce is restricted to Key,Value pairs Representing data like Grouped data is too difficult in Key,Value pairs Writable is too much work to develop/maintain There was Tuple Map/Reduce effort some point of time Spark (Scala) has built in. 12
  • 13. Tuple Example Aggregating data over hour def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String, Long), Record)] The resulting RDD has tuple as key which combines date and hour of the day. These tuples can be sent as input to other functions Aggregating data over Day def dailyAggregation(inputRDD: RDD[((String, Long), Record)]): RDD[(Date, Record)] 13
  • 14. Use Lazy evaluation Map/Reduce does not embrace lazy evaluation. Output of every job has to be written to HDFS HDFS is only way to share data in Map/Reduce Spark differs Every operation, other than actions, are lazy evaluated Only write critical data to Disk, other intermediate data cache in Memory Be careful when you use actions. Try to delay calling of actions as late possible. Refer Main.scala 14
  • 16. Spark @ Hackathon Madhukara Phatak Zinnia Systems @madhukaraphatak 1
  • 17. Hackathon “ Hackathon is a programming ritual , usually done in night, where programmers solve problems overnight which may take years of time otherwise” -Anonymous 2
  • 18. Where / When Dec 6 and 7 th , 2013 At BigData Conclave Hosted by Flutura Solving real world problem(s) in 24 hours No restriction on tools to be used Deliverables : Results, Working code and Visualization 3
  • 19. What? Predict the global energy demand for next year using the energy usage data available for last four years, in order to enable utility companies , effectively handle the energy demand. Every minute usage data collected from smart meters From 2008- 2012 Around 133 mb uncompressed 2,75,000 records Around 25,000 missing data records 4
  • 20. Questions to be answered What would be the energy consumption for the next day? What would be week wise Energy consumption for the next one year? What would be household’s peak time load (Peak time is between 7 AM to 10 AM) for the next month. During Weekdays During Weekends Assuming there was full day of outage, Calculate the Revenue Loss for a particular day next year by finding the Average Revenue per day (ARPD) of the household using given tariff plan Can you identify the device usage patterns? 5
  • 21. Who Four member team from Zinnia Systems Chose Spark/Scala for solving these problems Every one except me ,were new to Scala and Spark Able to solve the problems on time Won first prize at Hackathon 6
  • 22. Why Spark? Faster prototyping Excellent in memory performance Uses AKKA Able to run 2.5 million concurrent processing in 1GB RAM Easy to debug Excellent integration in IntelliJ and Eclipse Little to code - 500 lines of code Something new to try at Hackathon 7
  • 23. Solution Uses only core spark Uses Geometric Brownian motion algorithm for prediction Complete code Available at Github https://github.com/zinniasystems/spark-energy-prediction Under Apache license Blog series http://zinniasystems.com/blog/2013/12/12/ predicting-global-energy-demand-using-spark-part-1/ 8
  • 24. Embrace Scala Scala is a JVM language in which spark is implemented. Though Spark gives Java API and Python API Scala feels more natural. If you are coming from Pig , You feel at home in Scala API. Spark source base is small. So knowing Scala helps you peek at spark source whenever possible. Excellent REPL support. 9
  • 25. Go functional Spark encourages you to use functional idioms over object oriented one's Some of the functional idioms available are Closure Function chaining Lazy evaluation Ex : Standard deviation Sum( (xi – Mean)*(xi – Mean)) Map/Reduce way Map calculates (xi – Mean)*(xi – Mean) Reduce does the sum 10
  • 26. Spark way private def standDeviation(inputRDD: RDD[Double], mean: Double): Double = { val sum = inputRDD.map(value => { (value - mean) * (value - mean) }).reduce((firstValue, secondValue) => { firstValue + secondValue })} private def standDeviation(inputRDD: RDD[Double],mean: Double): Double = { val sum =0; inputRDD.map(value => sum+=(value-mean)*(value-mean))} 11 The code is available in EnergyUsagePrediction.scala
  • 27. Use Tuples Map/Reduce is restricted to Key,Value pairs Representing data like Grouped data is too difficult in Key,Value pairs Writable is too much work to develop/maintain There was Tuple Map/Reduce effort some point of time Spark (Scala) has built in. 12
  • 28. Tuple Example Aggregating data over hour def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String, Long), Record)] The resulting RDD has tuple as key which combines date and hour of the day. These tuples can be sent as input to other functions Aggregating data over Day def dailyAggregation(inputRDD: RDD[((String, Long), Record)]): RDD[(Date, Record)] 13
  • 29. Use Lazy evaluation Map/Reduce does not embrace lazy evaluation. Output of every job has to be written to HDFS HDFS is only way to share data in Map/Reduce Spark differs Every operation, other than actions, are lazy evaluated Only write critical data to Disk, other intermediate data cache in Memory Be careful when you use actions. Try to delay calling of actions as late possible. Refer Main.scala 14