SlideShare a Scribd company logo
1 of 45
Demi Ben-Ari
10/2015
About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” unit - IAF
Agenda
 What is Spark?
 Spark Infrastructure and Basics
 Spark Features and Suite
 Development with Spark
 Conclusion
What is Spark?
Efficient Usable
 General execution
graphs
 In-memory storage
 Rich APIs in Java,
Scala, Python
 Interactive shell
Fast and Expressive Cluster Computing
Engine Compatible with Apache Hadoop
What is Spark?
 Apache Spark is a general-purpose, cluster
computing framework
 Spark does computation In Memory & on
Disk
 Apache Spark has low level and high level
APIs
About Spark project
 Spark was founded at UC Berkeley and the
main contributor is “Databricks”.
 Interactive shell Spark in Scala and Python
◦ (spark-shell, pyspark)
 Currently stable in version 1.5
Spark Philosophy
 Make life easy and productive for data
scientists
 Well documented, expressive API’s
 Powerful domain specific libraries
 Easy integration with storage systems
 … and caching to avoid data movement
 Predictable releases, stable API’s
 Stable release each 3 months
Unified Tools Platform
Unified Tools Platform
Spark
SQL
GraphX
MLlib
Machine
Learning
Spark
Streamin
g
Spark Core
Spark Core Features
 Distributed In memory Computation
 Stand alone and Local Capabilities
 History server for Spark UI
 Resource management Integration
 Unified job submission tool
Spark Contributors
 Highly active open source community
(09/2015)
◦ https://github.com/apache/spark/
 https://www.openhub.net/p/apache-spark
Spark Petabyte Sort
Basic Terms
 Cluster (Master, Slaves)
 Driver
 Executors
 Spark Context
 RDD – Resilient Distributed Dataset
Resilient Distributed Datasets
Resilient Distributed Datasets
Spark execution engine
 Spark uses lazy evaluation
 Runs the code only when it encounters
an action operation
 There is no need to design and write a
single complex map-reduce job.
 In Spark we can write smaller and
manageable operations
◦ Spark will group operations together
Spark execution engine
 Serializes your code to the executors
◦ Can choose your serialization method
(Java serialization, Kryo)
 In Java - functions are specified as
objects that implement one of Spark’s
Function interfaces.
◦ Can use the same method of
implementation in Scala and Python as
well.
Spark Execution - UI
Persistence layers for Spark
 Distributed system
◦ Hadoop (HDFS)
◦ Local file system
◦ Amazon S3
◦ Cassandra
◦ Hive
◦ Hbase
 File formats
◦ Text file
 CSV, TSV, Plain Text
◦ Sequence File
◦ AVRO
◦ Parquet
History Server
 Can be run on all Spark deployments,
◦ Stand Alone, YARN, Mesos
 Integrates both with YARN and Mesos
 In Yarn / Mesos, run history server as
a daemon.
Job Submission Tool
 ./bin/spark-submit <app-jar> 
--class my.main.Class
--name myAppName
--master local[4]
--master spark://some-cluster
Multi Language API Support
 Scala
 Java
 Python
 Clojure
Spark Shell
 YouTube – Word Count Example
Cassandra & Spark
 Cassandra cluster
◦ Bare metal vs. On the cloud
 DSE – DataStax Enterprise
◦ Cassandra & Spark in each node
 Vs
◦ Separate Cassandra and Spark clusters
Development with Spark
Where do I start from?!
 Download spark as a package
◦ Run it on “local” mode (no need of a real
cluster)
◦ “spark-ec2” scripts to ramp-up a Stand Alone
mode cluster
◦ Amazon Elastic Map Reduce (EMR)
 Yarn vs. Mesos vs. Stand Alone
Running Environments
 Development – Testing – Production
◦ Don’t you need more?
◦ Be as flexible as you can
 Cluster Utilization
◦ Unified Cluster for all environments
 Vs.
◦ Cluster per Environment
 (Cluster per Data Center)
 Configuration
◦ Local Files vs. Distributed
Saving and Maintaining the
Data Local File System – Not effective in a distributed
environment
 HDFS
◦ Might be very Expensive
◦ Locality Rules – Spark + HDFS node + Same machine
 S3
◦ High latency and pretty slow but low costs
 Cassandra
◦ Rigid data model
◦ Very fast and depends on the Volume of the data can be
DevOps – Keep It Simple,
Stupid Linux
◦ Bash scripts
◦ Crontab
 Automation via Jenkins
 Continuous Deployment – with every GIT push
Dev Testing
Live
Staging
Production
Daily ManualAutomaticAutomatic
Build Automation
 Maven
◦ Sonatype Nexus artifact management
 -
◦ Deploy and Script generation scripts
◦ Per Environment Testing
◦ Data Validation
◦ Scheduled Tasks
Workflow Management
 Oozie – Very hard to integrate with Spark
◦ XML configuration based and not that convenient
 Azkaban (Haven’t tried it)
 Chosen:
◦ Luigi
◦ Crontab + Jenkins (KISS again)
Testing
Dev Testing
Live
Staging
Production
Testing
 Unit
◦ JUnit tests that run on the Spark “Functions”
 End to End
◦ Simulate the full execution of an application on a
single JVM (local mode) – Real input, Real output
 Functional
◦ Stand alone application
◦ Running on the cluster
◦ Minimal coverage – Shows working data flow
Logging
 Runs by default log4j (slf4j)
 How to log correctly:
◦ Separate logs for different applications
◦ Driver and Executors log to different locations
◦ Yarn logging also exists (Might find problems there too)
 ELK Stack (Logstash - ElasticSearch – Kibana)
◦ By Logstash Shippers (Intrusive) or UDP Socket Appender (Log4j2)
◦ DO NOT use the regular TCP Log4J appender
Reporting and Monitoring
 Graphite
◦ Online application metrics
 Grafana
◦ Good Graphite visualization
 Jenkins - Monitoring
◦ Scheduled tests
◦ Validate result set of the applications
◦ Hung or stuck applications
◦ Failed application
Reporting and Monitoring
 Grafana + Graphite - Example
Summary
Cluster
Dev Testing
Live
Staging
ProductionEnv
ELK
Data Flow
Extern
al Data
Source
s
Analytics Layers Data Output
Conclusion
 Spark is a popular and very powerful
distributed in memory computation
framework
 Broadly used and has lots of contributors
 Leading tool in the new world of Petabytes
of unexplored data in the world
Questions?
Thanks,
Resources and Contact
 Demi Ben-Ari
◦ LinkedIn
◦ Twitter: @demibenari
◦ Blog: http://progexc.blogspot.com/
◦ Email: demi.benari@gmail.com
◦ “Big Things” Community
 Meetup, YouTube, Facebook, Twitter

More Related Content

What's hot

Managing (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraManaging (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraDataStax Academy
 
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarWebinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarDataStax
 
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration  AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration Amazon Web Services
 
Migrating and Running DBs on Amazon RDS for Oracle
Migrating and Running DBs on Amazon RDS for OracleMigrating and Running DBs on Amazon RDS for Oracle
Migrating and Running DBs on Amazon RDS for OracleMaris Elsins
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkJeremy Beard
 
AWS Webcast - Migrating to RDS Oracle
AWS Webcast - Migrating to RDS OracleAWS Webcast - Migrating to RDS Oracle
AWS Webcast - Migrating to RDS OracleAmazon Web Services
 
Virtualizing Apache Spark and Machine Learning with Justin Murray
Virtualizing Apache Spark and Machine Learning with Justin MurrayVirtualizing Apache Spark and Machine Learning with Justin Murray
Virtualizing Apache Spark and Machine Learning with Justin MurrayDatabricks
 
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...DataStax
 
Advanced data migration techniques for Amazon RDS
Advanced data migration techniques for Amazon RDSAdvanced data migration techniques for Amazon RDS
Advanced data migration techniques for Amazon RDSTom Laszewski
 
AWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentationAWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentationTATA LILIAN SHULIKA
 
Migrate Oracle database to Amazon RDS
Migrate Oracle database to Amazon RDSMigrate Oracle database to Amazon RDS
Migrate Oracle database to Amazon RDSJesus Guzman
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
 
Docker y azure container service
Docker y azure container serviceDocker y azure container service
Docker y azure container serviceFernando Mejía
 
Cassandra Summit 2014: Deploying Cassandra for Call of Duty
Cassandra Summit 2014: Deploying Cassandra for Call of DutyCassandra Summit 2014: Deploying Cassandra for Call of Duty
Cassandra Summit 2014: Deploying Cassandra for Call of DutyDataStax Academy
 
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarC* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarDataStax Academy
 
Application Development with Apache Cassandra as a Service
Application Development with Apache Cassandra as a ServiceApplication Development with Apache Cassandra as a Service
Application Development with Apache Cassandra as a ServiceWSO2
 
Making (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with CachingMaking (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with CachingAmazon Web Services
 
What's New in Amazon RDS for Open Source and Commercial Databases
What's New in Amazon RDS for Open Source and Commercial DatabasesWhat's New in Amazon RDS for Open Source and Commercial Databases
What's New in Amazon RDS for Open Source and Commercial DatabasesAmazon Web Services
 
Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013
Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013
Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013Amazon Web Services
 
Cloud Computing101 Azure, updated june 2017
Cloud Computing101 Azure, updated june 2017Cloud Computing101 Azure, updated june 2017
Cloud Computing101 Azure, updated june 2017Fernando Mejía
 

What's hot (20)

Managing (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraManaging (Schema) Migrations in Cassandra
Managing (Schema) Migrations in Cassandra
 
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarWebinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
 
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration  AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration
 
Migrating and Running DBs on Amazon RDS for Oracle
Migrating and Running DBs on Amazon RDS for OracleMigrating and Running DBs on Amazon RDS for Oracle
Migrating and Running DBs on Amazon RDS for Oracle
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
 
AWS Webcast - Migrating to RDS Oracle
AWS Webcast - Migrating to RDS OracleAWS Webcast - Migrating to RDS Oracle
AWS Webcast - Migrating to RDS Oracle
 
Virtualizing Apache Spark and Machine Learning with Justin Murray
Virtualizing Apache Spark and Machine Learning with Justin MurrayVirtualizing Apache Spark and Machine Learning with Justin Murray
Virtualizing Apache Spark and Machine Learning with Justin Murray
 
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
 
Advanced data migration techniques for Amazon RDS
Advanced data migration techniques for Amazon RDSAdvanced data migration techniques for Amazon RDS
Advanced data migration techniques for Amazon RDS
 
AWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentationAWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentation
 
Migrate Oracle database to Amazon RDS
Migrate Oracle database to Amazon RDSMigrate Oracle database to Amazon RDS
Migrate Oracle database to Amazon RDS
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
 
Docker y azure container service
Docker y azure container serviceDocker y azure container service
Docker y azure container service
 
Cassandra Summit 2014: Deploying Cassandra for Call of Duty
Cassandra Summit 2014: Deploying Cassandra for Call of DutyCassandra Summit 2014: Deploying Cassandra for Call of Duty
Cassandra Summit 2014: Deploying Cassandra for Call of Duty
 
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarC* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
 
Application Development with Apache Cassandra as a Service
Application Development with Apache Cassandra as a ServiceApplication Development with Apache Cassandra as a Service
Application Development with Apache Cassandra as a Service
 
Making (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with CachingMaking (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with Caching
 
What's New in Amazon RDS for Open Source and Commercial Databases
What's New in Amazon RDS for Open Source and Commercial DatabasesWhat's New in Amazon RDS for Open Source and Commercial Databases
What's New in Amazon RDS for Open Source and Commercial Databases
 
Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013
Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013
Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013
 
Cloud Computing101 Azure, updated june 2017
Cloud Computing101 Azure, updated june 2017Cloud Computing101 Azure, updated june 2017
Cloud Computing101 Azure, updated june 2017
 

Viewers also liked

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetupjlacefie
 
The Ultimate Logging Architecture - You KNOW you want it!
The Ultimate Logging Architecture - You KNOW you want it!The Ultimate Logging Architecture - You KNOW you want it!
The Ultimate Logging Architecture - You KNOW you want it!Michele Leroux Bustamante
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
Scala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecScala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecLoïc Descotte
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...DataWorks Summit
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosPaco Nathan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overviewABC Talks
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 

Viewers also liked (20)

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
Scala in practice
Scala in practiceScala in practice
Scala in practice
 
The Ultimate Logging Architecture - You KNOW you want it!
The Ultimate Logging Architecture - You KNOW you want it!The Ultimate Logging Architecture - You KNOW you want it!
The Ultimate Logging Architecture - You KNOW you want it!
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Scala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecScala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar Prokopec
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 

Similar to Spark 101 - First steps to distributed computing

Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @WindwardDemi Ben-Ari
 
Spark in the Maritime Domain
Spark in the Maritime DomainSpark in the Maritime Domain
Spark in the Maritime DomainDemi Ben-Ari
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - InstallationMartin Zapletal
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS GlueLaercio Serra
 

Similar to Spark 101 - First steps to distributed computing (20)

Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @Windward
 
Spark in the Maritime Domain
Spark in the Maritime DomainSpark in the Maritime Domain
Spark in the Maritime Domain
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Spark
SparkSpark
Spark
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS Glue
 

More from Demi Ben-Ari

Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
CTO Management Tool Box - Demi Ben-Ari at Panorays
CTO Management Tool Box - Demi Ben-Ari at PanoraysCTO Management Tool Box - Demi Ben-Ari at Panorays
CTO Management Tool Box - Demi Ben-Ari at PanoraysDemi Ben-Ari
 
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Demi Ben-Ari
 
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...Demi Ben-Ari
 
CTO Management ToolBox - Demi Ben-Ari -- Panorays
CTO Management ToolBox - Demi Ben-Ari -- PanoraysCTO Management ToolBox - Demi Ben-Ari -- Panorays
CTO Management ToolBox - Demi Ben-Ari -- PanoraysDemi Ben-Ari
 
All I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
All I Wanted Is to Found a Startup - Demi Ben-Ari - PanoraysAll I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
All I Wanted Is to Found a Startup - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysHacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
Community, Unifying the Geeks to Create Value - Demi Ben-Ari
Community, Unifying the Geeks to Create Value - Demi Ben-AriCommunity, Unifying the Geeks to Create Value - Demi Ben-Ari
Community, Unifying the Geeks to Create Value - Demi Ben-AriDemi Ben-Ari
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
Know the Startup World - Demi Ben-Ari - Ofek Alumni
Know the Startup World - Demi Ben-Ari - Ofek AlumniKnow the Startup World - Demi Ben-Ari - Ofek Alumni
Know the Startup World - Demi Ben-Ari - Ofek AlumniDemi Ben-Ari
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
Know the Startup World - Demi Ben Ari - Ofek Alumni
Know the Startup World - Demi Ben Ari - Ofek AlumniKnow the Startup World - Demi Ben Ari - Ofek Alumni
Know the Startup World - Demi Ben Ari - Ofek AlumniDemi Ben-Ari
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
Bootstrapping a Tech Community - Demi Ben-Ari
Bootstrapping a Tech Community - Demi Ben-AriBootstrapping a Tech Community - Demi Ben-Ari
Bootstrapping a Tech Community - Demi Ben-AriDemi Ben-Ari
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
 

More from Demi Ben-Ari (20)

Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
 
CTO Management Tool Box - Demi Ben-Ari at Panorays
CTO Management Tool Box - Demi Ben-Ari at PanoraysCTO Management Tool Box - Demi Ben-Ari at Panorays
CTO Management Tool Box - Demi Ben-Ari at Panorays
 
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
 
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
 
CTO Management ToolBox - Demi Ben-Ari -- Panorays
CTO Management ToolBox - Demi Ben-Ari -- PanoraysCTO Management ToolBox - Demi Ben-Ari -- Panorays
CTO Management ToolBox - Demi Ben-Ari -- Panorays
 
All I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
All I Wanted Is to Found a Startup - Demi Ben-Ari - PanoraysAll I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
All I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
 
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysHacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
 
Community, Unifying the Geeks to Create Value - Demi Ben-Ari
Community, Unifying the Geeks to Create Value - Demi Ben-AriCommunity, Unifying the Geeks to Create Value - Demi Ben-Ari
Community, Unifying the Geeks to Create Value - Demi Ben-Ari
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
Know the Startup World - Demi Ben-Ari - Ofek Alumni
Know the Startup World - Demi Ben-Ari - Ofek AlumniKnow the Startup World - Demi Ben-Ari - Ofek Alumni
Know the Startup World - Demi Ben-Ari - Ofek Alumni
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Know the Startup World - Demi Ben Ari - Ofek Alumni
Know the Startup World - Demi Ben Ari - Ofek AlumniKnow the Startup World - Demi Ben Ari - Ofek Alumni
Know the Startup World - Demi Ben Ari - Ofek Alumni
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
 
Bootstrapping a Tech Community - Demi Ben-Ari
Bootstrapping a Tech Community - Demi Ben-AriBootstrapping a Tech Community - Demi Ben-Ari
Bootstrapping a Tech Community - Demi Ben-Ari
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 

Recently uploaded

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 

Recently uploaded (20)

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 

Spark 101 - First steps to distributed computing

  • 2. About me Demi Ben-Ari Senior Software Engineer at Windward Ltd. BS’c Computer Science – Academic College Tel-Aviv Yaffo In the Past: Software Team Leader & Senior Java Software Engineer, Missile defense and Alert System - “Ofek” unit - IAF
  • 3.
  • 4. Agenda  What is Spark?  Spark Infrastructure and Basics  Spark Features and Suite  Development with Spark  Conclusion
  • 5. What is Spark? Efficient Usable  General execution graphs  In-memory storage  Rich APIs in Java, Scala, Python  Interactive shell Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop
  • 6. What is Spark?  Apache Spark is a general-purpose, cluster computing framework  Spark does computation In Memory & on Disk  Apache Spark has low level and high level APIs
  • 7. About Spark project  Spark was founded at UC Berkeley and the main contributor is “Databricks”.  Interactive shell Spark in Scala and Python ◦ (spark-shell, pyspark)  Currently stable in version 1.5
  • 8. Spark Philosophy  Make life easy and productive for data scientists  Well documented, expressive API’s  Powerful domain specific libraries  Easy integration with storage systems  … and caching to avoid data movement  Predictable releases, stable API’s  Stable release each 3 months
  • 11. Spark Core Features  Distributed In memory Computation  Stand alone and Local Capabilities  History server for Spark UI  Resource management Integration  Unified job submission tool
  • 12.
  • 13. Spark Contributors  Highly active open source community (09/2015) ◦ https://github.com/apache/spark/  https://www.openhub.net/p/apache-spark
  • 15. Basic Terms  Cluster (Master, Slaves)  Driver  Executors  Spark Context  RDD – Resilient Distributed Dataset
  • 18. Spark execution engine  Spark uses lazy evaluation  Runs the code only when it encounters an action operation  There is no need to design and write a single complex map-reduce job.  In Spark we can write smaller and manageable operations ◦ Spark will group operations together
  • 19. Spark execution engine  Serializes your code to the executors ◦ Can choose your serialization method (Java serialization, Kryo)  In Java - functions are specified as objects that implement one of Spark’s Function interfaces. ◦ Can use the same method of implementation in Scala and Python as well.
  • 21. Persistence layers for Spark  Distributed system ◦ Hadoop (HDFS) ◦ Local file system ◦ Amazon S3 ◦ Cassandra ◦ Hive ◦ Hbase  File formats ◦ Text file  CSV, TSV, Plain Text ◦ Sequence File ◦ AVRO ◦ Parquet
  • 22.
  • 23.
  • 24. History Server  Can be run on all Spark deployments, ◦ Stand Alone, YARN, Mesos  Integrates both with YARN and Mesos  In Yarn / Mesos, run history server as a daemon.
  • 25. Job Submission Tool  ./bin/spark-submit <app-jar> --class my.main.Class --name myAppName --master local[4] --master spark://some-cluster
  • 26. Multi Language API Support  Scala  Java  Python  Clojure
  • 27. Spark Shell  YouTube – Word Count Example
  • 28. Cassandra & Spark  Cassandra cluster ◦ Bare metal vs. On the cloud  DSE – DataStax Enterprise ◦ Cassandra & Spark in each node  Vs ◦ Separate Cassandra and Spark clusters
  • 30. Where do I start from?!  Download spark as a package ◦ Run it on “local” mode (no need of a real cluster) ◦ “spark-ec2” scripts to ramp-up a Stand Alone mode cluster ◦ Amazon Elastic Map Reduce (EMR)  Yarn vs. Mesos vs. Stand Alone
  • 31. Running Environments  Development – Testing – Production ◦ Don’t you need more? ◦ Be as flexible as you can  Cluster Utilization ◦ Unified Cluster for all environments  Vs. ◦ Cluster per Environment  (Cluster per Data Center)  Configuration ◦ Local Files vs. Distributed
  • 32. Saving and Maintaining the Data Local File System – Not effective in a distributed environment  HDFS ◦ Might be very Expensive ◦ Locality Rules – Spark + HDFS node + Same machine  S3 ◦ High latency and pretty slow but low costs  Cassandra ◦ Rigid data model ◦ Very fast and depends on the Volume of the data can be
  • 33. DevOps – Keep It Simple, Stupid Linux ◦ Bash scripts ◦ Crontab  Automation via Jenkins  Continuous Deployment – with every GIT push Dev Testing Live Staging Production Daily ManualAutomaticAutomatic
  • 34. Build Automation  Maven ◦ Sonatype Nexus artifact management  - ◦ Deploy and Script generation scripts ◦ Per Environment Testing ◦ Data Validation ◦ Scheduled Tasks
  • 35. Workflow Management  Oozie – Very hard to integrate with Spark ◦ XML configuration based and not that convenient  Azkaban (Haven’t tried it)  Chosen: ◦ Luigi ◦ Crontab + Jenkins (KISS again)
  • 37. Testing  Unit ◦ JUnit tests that run on the Spark “Functions”  End to End ◦ Simulate the full execution of an application on a single JVM (local mode) – Real input, Real output  Functional ◦ Stand alone application ◦ Running on the cluster ◦ Minimal coverage – Shows working data flow
  • 38. Logging  Runs by default log4j (slf4j)  How to log correctly: ◦ Separate logs for different applications ◦ Driver and Executors log to different locations ◦ Yarn logging also exists (Might find problems there too)  ELK Stack (Logstash - ElasticSearch – Kibana) ◦ By Logstash Shippers (Intrusive) or UDP Socket Appender (Log4j2) ◦ DO NOT use the regular TCP Log4J appender
  • 39. Reporting and Monitoring  Graphite ◦ Online application metrics  Grafana ◦ Good Graphite visualization  Jenkins - Monitoring ◦ Scheduled tests ◦ Validate result set of the applications ◦ Hung or stuck applications ◦ Failed application
  • 40. Reporting and Monitoring  Grafana + Graphite - Example
  • 43. Conclusion  Spark is a popular and very powerful distributed in memory computation framework  Broadly used and has lots of contributors  Leading tool in the new world of Petabytes of unexplored data in the world
  • 45. Thanks, Resources and Contact  Demi Ben-Ari ◦ LinkedIn ◦ Twitter: @demibenari ◦ Blog: http://progexc.blogspot.com/ ◦ Email: demi.benari@gmail.com ◦ “Big Things” Community  Meetup, YouTube, Facebook, Twitter

Editor's Notes

  1. Generalize the map/reduce framework
  2. Jenkins - Monitoring Scheduled tests Validate result set of the applications Hung or stuck applications Failed application