SlideShare a Scribd company logo
1 of 26
Frustration-Reduced
Spark
DataFrames and the Spark Time-Series Library
Ilya Ganelin
Why are we here?
 Spark for quick and easy batch ETL (no streaming)
 Actually using data frames
 Creation
 Modification
 Access
 Transformation
 Time Series analysis
 https://github.com/cloudera/spark-timeseries
 http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-
for-analyzing-time-series-data-with-apache-spark/
Why Spark?
 Batch/micro-batch processing of large datasets
 Easy to use, easy to iterate, wealth of common
industry-standard ML algorithms
 Super fast if properly configured
 Bridges the gap between the old (SQL, single machine
analytics) and the new (declarative/functional
distributed programming)
Why not Spark?
 Breaks easily with poor usage or improperly specified
configs
 Scaling up to larger datasets 500GB -> TB scale
requires deep understanding of internal configurations,
garbage collection tuning, and Spark mechanisms
 While there are lots of ML algorithms, a lot of them
simply don’t work, don’t work at scale, or have poorly
defined interfaces / documentation
Scala
 Yes, I recommend Scala
 Python API is underdeveloped, especially for ML Lib
 Java (until Java 8) is a second class citizen as far as
convenience vs. Scala
 Spark is written in Scala – understanding Scala helps you
navigate the source
 Can leverage the spark-shell to rapidly prototype new
code and constructs
 http://www.scala-
lang.org/docu/files/ScalaByExample.pdf
Why DataFrames?
 Iterate on datasets MUCH faster
 Column access is easier
 Data inspection is easier
 groupBy, join, are faster due to under-the-hood
optimizations
 Some chunks of ML Lib now optimized to use data
frames
Why not DataFrames?
 RDD API is still much better developed
 Getting data into DataFrames can be clunky
 Transforming data inside DataFrames can be clunky
 Many of the algorithms in ML Lib still depend on RDDs
Creation
 Read in a file with an embedded header
 http://tinyurl.com/zc5jzb2
 Create a DF
 Option A – Map schema to Strings, convert to Rows
 Option B – default type (case-classes or tuples)
DataFrame Creation
 Option C – Define the schema explicitly
 Check your work with df.show()
DataFrame Creation
Column Manipulation
 Selection
 GroupBy
 Confusing! You get a GroupedData object, not an RDD or
DataFrame
 Use agg or built-ins to get back to a DataFrame.
 Can convert to RDD with dataFrame.rdd
Custom Column Functions
 Option A: Add a column with a custom function:
 http://stackoverflow.com/questions/29483498/append-a-
column-to-data-frame-in-apache-spark-1-3
 Option B: Match the Row, get explicit names (yields
RDD, not DataFrame!)
Row Manipulation
 Filter
 Range:
 Equality:
 Column functions
Joins
 Option A (inner join)
 Option B (explicit)
 Join types: “inner”, “outer”, “left_outer”, “right_outer”,
“leftsemi”
 DataFrame joins benefit from Tungsten optimizations
Null Handling
 Built in support for handling nulls in data frames.
 Drop, fill, replace
 https://spark.apache.org/docs/1.6.0/api/java/org/apache
/spark/sql/DataFrameNaFunctions.html
Spark-TS
 https://github.com/cloudera/spark-timeseries
 Uses Java 8 ZonedDateTime as of 0.2 release:
Dealing with timestamps
Why Spark TS?
 Each row of the TimeSeriesRDD is a keyed vector of
doubles (indexed by your time index)
 Easily and efficiently slice data-sets by time:
 Generate statistics on the data:
Why Spark TS?
 Feature generation
 Moving averages over time
 Outlier detection (e.g. daily activity > 2 std-dev from
moving average)
 Constant time lookups in RDD by time vs. default O(m),
where m is the partition size
What doesn’t work?
 Cannot have overlapping entries per time index, e.g. data
with identical date time (e.g. same day for DayFrequency)
 If time zones are not aligned in your data, data may not
show up in the RDD
 Limited input format: must be built from two columns => Key
(K) and a Double
 Documentation/Examples not up to date with v0.2
 0.2 version => There will be bugs
 But it’s open source! Go fix them 
How do I use it?
 Download the binary (version 0.2 with dependencies)
 http://tinyurl.com/z6oo823
 Add it as a jar dependency when launching Spark:
 spark-shell --jars sparkts-0.2.0-SNAPSHOT-jar-with-
dependencies_ilya_0.3.jar
What else?
 Save your work => Write completed datasets to file
 Work on small data first, then go to big data
 Create test data to capture edge cases
 LMGTFY
By popular demand:
screen spark-shell
--driver-memory 100g 
--num-executors 60 
--executor-cores 5 
--master yarn-client 
--conf "spark.executor.memory=20g” 
--conf "spark.io.compression.codec=lz4" 
--conf "spark.shuffle.consolidateFiles=true" 
--conf "spark.dynamicAllocation.enabled=false" 
--conf "spark.shuffle.manager=tungsten-sort" 
--conf "spark.akka.frameSize=1028" 
--conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m -
XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 -
XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
-XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -
XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts -
XX:+UseCompressedOops"
Any Spark on YARN
 E.g. Deploy Spark 1.6 on CDH 5.4
 Download your Spark binary to the cluster and untar
 In $SPARK_HOME/conf/spark-env.sh:
 export
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/co
nf
 This tells Spark where Hadoop is deployed, this also gives it the
link it needs to run on YARN
 export SPARK_DIST_CLASSPATH=$(/usr/bin/hadoop
classpath)
 This defines the location of the Hadoop binaries used at run
time
References
 http://spark.apache.org/docs/latest/programming-guide.html
 http://spark.apache.org/docs/latest/sql-programming-guide.html
 http://tinyurl.com/leqek2d (Working With Spark, by Ilya Ganelin)
 http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-
spark-jobs-part-1/ (by Sandy Ryza)
 http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-
spark-jobs-part-2/ (by Sandy Ryza)
 http://www.slideshare.net/ilganeli/frustrationreduced-spark-
dataframes-and-the-spark-timeseries-library

More Related Content

What's hot

Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache SparkJosef Adersberger
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalystTakuya UESHIN
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying SparkDatabricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallSpark Summit
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 

What's hot (20)

Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 

Similar to Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library

Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 
Oracle Database 12c "New features"
Oracle Database 12c "New features" Oracle Database 12c "New features"
Oracle Database 12c "New features" Anar Godjaev
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 

Similar to Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library (20)

Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Oracle Database 12c "New features"
Oracle Database 12c "New features" Oracle Database 12c "New features"
Oracle Database 12c "New features"
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 

Recently uploaded

Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfrs7054576148
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 

Recently uploaded (20)

Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 

Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library

  • 1. Frustration-Reduced Spark DataFrames and the Spark Time-Series Library Ilya Ganelin
  • 2. Why are we here?  Spark for quick and easy batch ETL (no streaming)  Actually using data frames  Creation  Modification  Access  Transformation  Time Series analysis  https://github.com/cloudera/spark-timeseries  http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library- for-analyzing-time-series-data-with-apache-spark/
  • 3. Why Spark?  Batch/micro-batch processing of large datasets  Easy to use, easy to iterate, wealth of common industry-standard ML algorithms  Super fast if properly configured  Bridges the gap between the old (SQL, single machine analytics) and the new (declarative/functional distributed programming)
  • 4. Why not Spark?  Breaks easily with poor usage or improperly specified configs  Scaling up to larger datasets 500GB -> TB scale requires deep understanding of internal configurations, garbage collection tuning, and Spark mechanisms  While there are lots of ML algorithms, a lot of them simply don’t work, don’t work at scale, or have poorly defined interfaces / documentation
  • 5. Scala  Yes, I recommend Scala  Python API is underdeveloped, especially for ML Lib  Java (until Java 8) is a second class citizen as far as convenience vs. Scala  Spark is written in Scala – understanding Scala helps you navigate the source  Can leverage the spark-shell to rapidly prototype new code and constructs  http://www.scala- lang.org/docu/files/ScalaByExample.pdf
  • 6. Why DataFrames?  Iterate on datasets MUCH faster  Column access is easier  Data inspection is easier  groupBy, join, are faster due to under-the-hood optimizations  Some chunks of ML Lib now optimized to use data frames
  • 7. Why not DataFrames?  RDD API is still much better developed  Getting data into DataFrames can be clunky  Transforming data inside DataFrames can be clunky  Many of the algorithms in ML Lib still depend on RDDs
  • 8. Creation  Read in a file with an embedded header  http://tinyurl.com/zc5jzb2
  • 9.  Create a DF  Option A – Map schema to Strings, convert to Rows  Option B – default type (case-classes or tuples) DataFrame Creation
  • 10.  Option C – Define the schema explicitly  Check your work with df.show() DataFrame Creation
  • 11. Column Manipulation  Selection  GroupBy  Confusing! You get a GroupedData object, not an RDD or DataFrame  Use agg or built-ins to get back to a DataFrame.  Can convert to RDD with dataFrame.rdd
  • 12. Custom Column Functions  Option A: Add a column with a custom function:  http://stackoverflow.com/questions/29483498/append-a- column-to-data-frame-in-apache-spark-1-3  Option B: Match the Row, get explicit names (yields RDD, not DataFrame!)
  • 13. Row Manipulation  Filter  Range:  Equality:  Column functions
  • 14. Joins  Option A (inner join)  Option B (explicit)  Join types: “inner”, “outer”, “left_outer”, “right_outer”, “leftsemi”  DataFrame joins benefit from Tungsten optimizations
  • 15. Null Handling  Built in support for handling nulls in data frames.  Drop, fill, replace  https://spark.apache.org/docs/1.6.0/api/java/org/apache /spark/sql/DataFrameNaFunctions.html
  • 16.
  • 19. Why Spark TS?  Each row of the TimeSeriesRDD is a keyed vector of doubles (indexed by your time index)  Easily and efficiently slice data-sets by time:  Generate statistics on the data:
  • 20. Why Spark TS?  Feature generation  Moving averages over time  Outlier detection (e.g. daily activity > 2 std-dev from moving average)  Constant time lookups in RDD by time vs. default O(m), where m is the partition size
  • 21. What doesn’t work?  Cannot have overlapping entries per time index, e.g. data with identical date time (e.g. same day for DayFrequency)  If time zones are not aligned in your data, data may not show up in the RDD  Limited input format: must be built from two columns => Key (K) and a Double  Documentation/Examples not up to date with v0.2  0.2 version => There will be bugs  But it’s open source! Go fix them 
  • 22. How do I use it?  Download the binary (version 0.2 with dependencies)  http://tinyurl.com/z6oo823  Add it as a jar dependency when launching Spark:  spark-shell --jars sparkts-0.2.0-SNAPSHOT-jar-with- dependencies_ilya_0.3.jar
  • 23. What else?  Save your work => Write completed datasets to file  Work on small data first, then go to big data  Create test data to capture edge cases  LMGTFY
  • 24. By popular demand: screen spark-shell --driver-memory 100g --num-executors 60 --executor-cores 5 --master yarn-client --conf "spark.executor.memory=20g” --conf "spark.io.compression.codec=lz4" --conf "spark.shuffle.consolidateFiles=true" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.shuffle.manager=tungsten-sort" --conf "spark.akka.frameSize=1028" --conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m - XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 - XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 - XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts - XX:+UseCompressedOops"
  • 25. Any Spark on YARN  E.g. Deploy Spark 1.6 on CDH 5.4  Download your Spark binary to the cluster and untar  In $SPARK_HOME/conf/spark-env.sh:  export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/co nf  This tells Spark where Hadoop is deployed, this also gives it the link it needs to run on YARN  export SPARK_DIST_CLASSPATH=$(/usr/bin/hadoop classpath)  This defines the location of the Hadoop binaries used at run time
  • 26. References  http://spark.apache.org/docs/latest/programming-guide.html  http://spark.apache.org/docs/latest/sql-programming-guide.html  http://tinyurl.com/leqek2d (Working With Spark, by Ilya Ganelin)  http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache- spark-jobs-part-1/ (by Sandy Ryza)  http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache- spark-jobs-part-2/ (by Sandy Ryza)  http://www.slideshare.net/ilganeli/frustrationreduced-spark- dataframes-and-the-spark-timeseries-library