Submit Search
Upload
Introduction to Spark
•
6 likes
•
2,659 views
Carol McDonald
Follow
Introduction to Spark, dataframes, machine learning
Read less
Read more
Software
Report
Share
Report
Share
1 of 108
Download now
Download to read offline
Recommended
Introduction to Spark on Hadoop
Introduction to Spark on Hadoop
Carol McDonald
Apache Spark Overview
Apache Spark Overview
Carol McDonald
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
Apache Spark streaming and HBase
Apache Spark streaming and HBase
Carol McDonald
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
Carol McDonald
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Carol McDonald
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
Recommended
Introduction to Spark on Hadoop
Introduction to Spark on Hadoop
Carol McDonald
Apache Spark Overview
Apache Spark Overview
Carol McDonald
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
Apache Spark streaming and HBase
Apache Spark streaming and HBase
Carol McDonald
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
Carol McDonald
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Carol McDonald
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
Getting Started with HBase
Getting Started with HBase
Carol McDonald
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache Spark
MapR Technologies
Apache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
Applying Machine Learning to Live Patient Data
Applying Machine Learning to Live Patient Data
Carol McDonald
Design Patterns for working with Fast Data
Design Patterns for working with Fast Data
MapR Technologies
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Carol McDonald
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
Carol McDonald
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
BigDataEverywhere
Hadoop ecosystem
Hadoop ecosystem
Ran Silberman
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
Carol McDonald
Mathematical bridges From Old to New
Mathematical bridges From Old to New
MapR Technologies
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
Apache Spark Machine Learning
Apache Spark Machine Learning
Carol McDonald
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
Carol McDonald
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
MapR Technologies
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
Databricks
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyData
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
Apache spark core
Apache spark core
Thành Nguyễn
Apache spark when things go wrong
Apache spark when things go wrong
Pawel Szulc
More Related Content
What's hot
Getting Started with HBase
Getting Started with HBase
Carol McDonald
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache Spark
MapR Technologies
Apache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
Applying Machine Learning to Live Patient Data
Applying Machine Learning to Live Patient Data
Carol McDonald
Design Patterns for working with Fast Data
Design Patterns for working with Fast Data
MapR Technologies
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Carol McDonald
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
Carol McDonald
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
BigDataEverywhere
Hadoop ecosystem
Hadoop ecosystem
Ran Silberman
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
Carol McDonald
Mathematical bridges From Old to New
Mathematical bridges From Old to New
MapR Technologies
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
Apache Spark Machine Learning
Apache Spark Machine Learning
Carol McDonald
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
Carol McDonald
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
MapR Technologies
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
Databricks
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyData
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
What's hot
(20)
Getting Started with HBase
Getting Started with HBase
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache Spark
Apache Spark Overview @ ferret
Apache Spark Overview @ ferret
Applying Machine Learning to Live Patient Data
Applying Machine Learning to Live Patient Data
Design Patterns for working with Fast Data
Design Patterns for working with Fast Data
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
Hadoop ecosystem
Hadoop ecosystem
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
Mathematical bridges From Old to New
Mathematical bridges From Old to New
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Apache Spark Machine Learning
Apache Spark Machine Learning
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Viewers also liked
Apache spark core
Apache spark core
Thành Nguyễn
Apache spark when things go wrong
Apache spark when things go wrong
Pawel Szulc
Introduction to type classes
Introduction to type classes
Pawel Szulc
Spark streaming state of the union
Spark streaming state of the union
Databricks
Apache spark workshop
Apache spark workshop
Pawel Szulc
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
Taro L. Saito
Spark workshop
Spark workshop
Wojciech Pituła
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning
Keon Kim
Apache Spark Core
Apache Spark Core
Girish Khanzode
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
DataWorks Summit/Hadoop Summit
Apache Spark An Overview
Apache Spark An Overview
Mohit Jain
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
Russell Spitzer
Viewers also liked
(13)
Apache spark core
Apache spark core
Apache spark when things go wrong
Apache spark when things go wrong
Introduction to type classes
Introduction to type classes
Spark streaming state of the union
Spark streaming state of the union
Apache spark workshop
Apache spark workshop
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark workshop
Spark workshop
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning
Apache Spark Core
Apache Spark Core
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache Spark An Overview
Apache Spark An Overview
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
Similar to Introduction to Spark
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
Steve Min
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
MapR Technologies
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
MapR Technologies
Apache Spark & Hadoop
Apache Spark & Hadoop
MapR Technologies
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
Tomer Shiran
10c introduction
10c introduction
mapr-academy
10c introduction
10c introduction
Inyoung Cho
Is Spark Replacing Hadoop
Is Spark Replacing Hadoop
MapR Technologies
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
2014 08-20-pit-hug
2014 08-20-pit-hug
Andy Pernsteiner
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
Yahoo Developer Network
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
BigDataEverywhere
Tez Data Processing over Yarn
Tez Data Processing over Yarn
InMobi Technology
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Saptak Sen
Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
Map reducecloudtech
Map reducecloudtech
Jakir Hossain
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
Similar to Introduction to Spark
(20)
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
Apache Spark & Hadoop
Apache Spark & Hadoop
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
10c introduction
10c introduction
10c introduction
10c introduction
Is Spark Replacing Hadoop
Is Spark Replacing Hadoop
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
2014 08-20-pit-hug
2014 08-20-pit-hug
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Tez Data Processing over Yarn
Tez Data Processing over Yarn
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
Map reducecloudtech
Map reducecloudtech
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
More from Carol McDonald
Introduction to machine learning with GPUs
Introduction to machine learning with GPUs
Carol McDonald
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Carol McDonald
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Carol McDonald
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Carol McDonald
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Carol McDonald
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Carol McDonald
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Carol McDonald
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
Carol McDonald
Spark graphx
Spark graphx
Carol McDonald
Spark machine learning predicting customer churn
Spark machine learning predicting customer churn
Carol McDonald
Machine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
Carol McDonald
CU9411MW.DOC
CU9411MW.DOC
Carol McDonald
Getting started with HBase
Getting started with HBase
Carol McDonald
More from Carol McDonald
(14)
Introduction to machine learning with GPUs
Introduction to machine learning with GPUs
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
Spark graphx
Spark graphx
Spark machine learning predicting customer churn
Spark machine learning predicting customer churn
Machine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
CU9411MW.DOC
CU9411MW.DOC
Getting started with HBase
Getting started with HBase
Recently uploaded
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
ABSYZ Inc
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
itservices996
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
Lionel Briand
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
Andreas Kunz
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
Safe Software
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
akrivarotava
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
Tier1 app
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
VictoriaMetrics
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
ssuser9e7c64
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
Christian Birchler
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
BradBedford3
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
KrzysztofKkol1
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Angel Borroy López
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
31events.com
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
Shane Coughlan
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
rahul_net
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Cizo Technology Services
Osi security architecture in network.pptx
Osi security architecture in network.pptx
VinzoCenzo
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
OnePlan Solutions
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
Alexandre Beguel
Recently uploaded
(20)
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Osi security architecture in network.pptx
Osi security architecture in network.pptx
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
Introduction to Spark
1.
© 2014 MapR
Technologies 1© 2014 MapR Technologies An Overview of Apache Spark
2.
© 2014 MapR
Technologies 2 Find my presentation and other related resources here: http://events.mapr.com/JacksonvilleJava (you can find this link in the event’s page at meetup.com) Today’s Presentation Whiteboard & demo videos Free On-Demand Training Free eBooks Free Hadoop Sandbox And more…
3.
© 2014 MapR
Technologies 3 Agenda • MapReduce Refresher • What is Spark? • The Difference with Spark • Examples and Resources
4.
© 2014 MapR
Technologies 4© 2014 MapR Technologies MapReduce Refresher
5.
© 2014 MapR
Technologies 5 MapReduce: A Programming Model • MapReduce: Simplified Data Processing on Large Clusters (published 2004) • Parallel and Distributed Algorithm: • Data Locality • Fault Tolerance • Linear Scalability
6.
© 2014 MapR
Technologies 6 The Hadoop Strategy http://developer.yahoo.com/hadoop/tutorial/module4.html Distribute data (share nothing) Distribute computation (parallelization without synchronization) Tolerate failures (no single point of failure) Node 1 Mapping process Node 2 Mapping process Node 3 Mapping process Node 1 Reducing process Node 2 Reducing process Node 3 Reducing process
7.
© 2014 MapR
Technologies 7 Chunks are replicated across the cluster Distribute Data: HDFS User process NameNode . . . network HDFS splits large data files into chunks (64 MB) metadata access physical data access Location metadata DataNodes store & retrieve data data
8.
© 2014 MapR
Technologies 8 Distribute Computation MapReduce Program Data Sources Hadoop Cluster Result
9.
© 2014 MapR
Technologies 9 MapReduce Execution and Data Flow • Map – Loading data and defining a key values • Shuffle – Sorts,collects key-based data • Reduce – Receives (key, list)’s to process and output Files loaded from HDFS stores file file Files loaded from HDFS stores Node 1 InputFormat InputFormat OutputFormat OutputFormat Final (k, v) pairs Final (k, v) pairs reduce reduce (sort) (sort) Input (k, v) pairs map map map RR RR RR RecordReaders: Split Split Split Writeback to Local HDFS store file Writeback to Local HDFS store file SplitSplitSplit RRRRRR RecordReaders: Input (k, v) pairs mapmapmap Node2 “Shuffle” process Intermediate (k, v) Pairs exchanged By all nodes Partitioner Intermediate (k, v) pairs Partitioner Intermediate (k, v) pairs
10.
© 2014 MapR
Technologies 10 MapReduce Example: Word Count Output "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax the, 1 time, 1 has, 1 come, 1 … and, 1 … and, 1 … and, [1, 1, 1] come, [1,1,1] has, [1,1] the, [1,1,1] time, [1,1,1,1] … and, 3 come, 3 has, 2 the, 3 time, 4 … Input Map Shuffle and Sort Reduce Output Reduce
11.
© 2014 MapR
Technologies 11 Tolerate Failures Hadoop Cluster Failures are expected & managed gracefully DataNode fails -> name node will locate replica MapReduce task fails -> job tracker will schedule another one Data
12.
© 2014 MapR
Technologies 12 MapReduce Processing Model • For complex work, chain jobs together – Use a higher level language or DSL that does this for you
13.
© 2014 MapR
Technologies 13 Typical MapReduce Workflows Input to Job 1 SequenceFile Last Job Maps Reduces SequenceFile Job 1 Maps Reduces SequenceFile Job 2 Maps Reduces Output from Job 1 Output from Job 2 Input to last job Output from last job HDFS Iteration is slow because it writes/reads data to disk
14.
© 2014 MapR
Technologies 14 Iterations Step Step Step Step Step In-memory Caching • Data Partitions read from RAM instead of disk
15.
© 2014 MapR
Technologies 15 MapReduce Design Patterns • Summarization – Inverted index, counting • Filtering – Top ten, distinct • Aggregation • Data Organziation – partitioning • Join – Join data sets • Metapattern – Job chaining
16.
© 2014 MapR
Technologies 16 Inverted Index Example come, (alice.txt) do, (macbeth.txt) has, (alice.txt) time, (alice.txt, macbeth.txt) . . . "The time has come," the Walrus said alice.txt tis time to do it macbeth.txt time, alice.txt has, alice.txt come, alice.txt .. tis, macbeth.txt time, macbeth.txt do, macbeth.txt …
17.
© 2014 MapR
Technologies 20 MapReduce: The Good • Optimized to Read and write large data files quickly – Process data on node – Minimize movement of disk head • Scalable • Built in fault tolerance • Developer focuses on Map/Reduce, not infrastructure • simple? API
18.
© 2014 MapR
Technologies 21 MapReduce: The Bad • Optimized for disk IO – Doesn’t leverage memory well – Iterative algorithms go through disk IO path again and again • Primitive API – simple abstraction – Key/Value in/out
19.
© 2014 MapR
Technologies 22 Free Hadoop MapReduce On Demand Training • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
20.
© 2014 MapR
Technologies 23 What is Hive? • Data Warehouse on top of Hadoop – analytical querying of data • without programming • SQL like execution for Hadoop – SQL evaluates to MapReduce code • Submits jobs to your cluster
21.
© 2014 MapR
Technologies 24 Job Tracker Name Node HADOOP (MAP-REDUCE + HDFS) Data Node + Task Tracker Hive Metastore Driver (compiler, Optimizer, Executor) Command Line Interface Web Interface JDBC Thrift Server ODBC Metastore Hive The schema metadata is stored in the Hive metastore Hive Table definition HBase trades_tall Table
22.
© 2014 MapR
Technologies 26 Hive HBase – External Table key cf1:price cf1:vol AMZN_986186008 12.34 1000 AMZN_986186007 12.00 50 Selection WHERE key like SQL evaluates to MapReduce code SELECT AVG(price) FROM trades WHERE key LIKE “AMZN” ; Projection select price Aggregation Avg( price)
23.
© 2014 MapR
Technologies 27 Hive HBase – Hive Query SQL evaluates to MapReduce code SELECT AVG(price) FROM trades WHERE key LIKE "GOOG” ; HBase Tables Queries Parser Planner Execution
24.
© 2014 MapR
Technologies 28 Hive Query Plan • EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%"; STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan Filter Operator predicate: (key like 'GOOG%') (type: boolean) Select Operator Group By Operator Reduce Operator Tree: Group By Operator Select Operator File Output Operator
25.
© 2014 MapR
Technologies 29 Hive Query Plan – (2) output hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%"; col0 Trades table group aggregations: avg(price) scan filter Select key like 'GOOG% Select price Group by map() map() map() reduce() reduce()
26.
© 2014 MapR
Technologies 32 What is a Directed Acylic Graph (DAG) ? • Graph – vertices (points) and edges (lines) • Directed – Only in a single direction • Acyclic – No looping BA BA
27.
© 2014 MapR
Technologies 33 Hive Query Plan Map Reduce Execution FS1 AGG2 RS4 JOIN1 RS2 AGG1 RS1 t1 RS3 t1 Job 3 Job 2 FS1 AGG2 JOIN1 AGG1 RS1 t1 RS3Job 1 Job 1 Optimize
28.
© 2014 MapR
Technologies 35 Free HBase On Demand Training (includes Hive and MapReduce with HBase) • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
29.
© 2014 MapR
Technologies 40© 2014 MapR Technologies Apache Spark
30.
© 2014 MapR
Technologies 42 Spark SQL Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN Unified Platform
31.
© 2014 MapR
Technologies 43 Why Apache Spark? • Distributed Parallel Cluster computing – Scalable , Fault tolerant • Programming Framework – APIs in Java, Scala, Python – Less code than map reduce • Fast – Runs computations in memory – Tasks are threads
32.
© 2014 MapR
Technologies 46 Spark Use Cases • Iterative Algorithms on large amounts of data • Some Example Use Cases: – Anomaly detection – Classification – Predictions – Recommendations
33.
© 2014 MapR
Technologies 47 Why Iterative Algorithms • Algorithms that need iterations – Clustering (K-Means, Canopy, …) – Gradient descent (e.g., Logistic Regression, Matrix Factorization) – Graph Algorithms (e.g., PageRank, Line-Rank, components, paths, reachability, centrality, ) – Alternating Least Squares ALS – Graph communities / dense sub-components – Inference (believe propagation) – … 47
34.
© 2014 MapR
Technologies 48 Data Sources • Local Files • S3 • Hadoop Distributed Filesystem – any Hadoop InputFormat • HBase • other NoSQL data stores
35.
© 2014 MapR
Technologies 51© 2014 MapR Technologies How Spark Works
36.
© 2014 MapR
Technologies 52 Spark Programming Model sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Driver Program SparkContext cluster Worker Node Task Task Task Worker Node
37.
© 2014 MapR
Technologies 53 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements
38.
© 2014 MapR
Technologies 54 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements • operated on in parallel • Cached in memory – Or on disk • Fault tolerant
39.
© 2014 MapR
Technologies 55 Working With RDDs RDD textFile = sc.textFile(”SomeFile.txt”)
40.
© 2014 MapR
Technologies 56 Working With RDDs RDD RDD RDD RDD Transformations linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line) linesRDD = sc.textFile(”LogFile.txt”)
41.
© 2014 MapR
Technologies 57 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithErrorRDD.count() 6 linesWithErrorRDD.first() # Error line textFile = sc.textFile(”SomeFile.txt”) linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)
42.
© 2014 MapR
Technologies 58 Example Spark Word Count in Scala ...the ... "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax andtime and the, 1 time, 1 and, 1 and, 1 and, 12time, 4 ...the, 20 // Load our input data. val input = sc.textFile(inputFile) // Split it up into words. val words = input.flatMap(line => line.split(" ")) // Transform into pairs and count. val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} // Save the word count back out to a text file, counts.saveAsTextFile(outputFile) the, 20 time, 4 ….. and, 12 .........
43.
© 2014 MapR
Technologies 59 Example Spark Word Count in Scala 59 HadoopRDD textFile // Load input data. val input = sc.textFile(inputFile) RDD partitions MapPartitionsRDD
44.
© 2014 MapR
Technologies 60 Example Spark Word Count in Scala 60 // Load our input data. val input = sc.textFile(inputFile) // Split it up into words. val words = input.flatMap(line => line.split(" ")) HadoopRDD textFile flatmap MapPartitionsRDD MapPartitionsRDD
45.
© 2014 MapR
Technologies 61 FlatMap flatMap line => line.split(" ")) 1 to many mapping Ships Ships and wax and wax RDD<String> wordsRDD
46.
© 2014 MapR
Technologies 62 Example Spark Word Count in Scala 62 textFile flatmap map val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) // Transform into pairs val counts = words.map(word => (word, 1)) HadoopRDD MapPartitionsRDD MapPartitionsRDD MapPartitionsRDD
47.
© 2014 MapR
Technologies 63 Map map word => (word, 1)) 1 to 1 mapping PairRDD<String, Integer> word1s Ships and wax Ships,1 and, 1 wax, 1
48.
© 2014 MapR
Technologies 64 Example Spark Word Count in Scala 64 textFile flatmap map reduceByKey val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} HadoopRDD MapPartitionsRDD MapPartitionsRDD ShuffledRDD MapPartitionsRDD
49.
© 2014 MapR
Technologies 65 reduceByKey reduceByKey case (x, y) => x + y and, 1 , 1 wax, 1 and, 2 PairRDD<String, Integer> coun PairRDD Function : reduceByKey(func: (value, value) ⇒ value): RDD[(key, value) wax, 1 and, 1 and, 1 wax, 1 case (0, 1) => 0 + 1 case (1, 1) => 1 + 1
50.
© 2014 MapR
Technologies 66 Example Spark Word Count in Scala textFile flatmap map reduceByKey val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} val countArray = counts.collect() HadoopRDD MapPartitionsRDD MapPartitionsRDD MapPartitionsRDD collect ShuffledRDD Array
51.
© 2014 MapR
Technologies 67 Demo Interactive Shell • Iterative Development – Cache those RDDs – Open the shell and ask questions • We have all wished we could do this with MapReduce – Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark
52.
© 2014 MapR
Technologies 68© 2014 MapR Technologies Components Of Execution
53.
© 2014 MapR
Technologies 69 Spark RDD DAG -> Physical Execution plan HadoopRDD sc.textfile(…) MapPartitionsRDD flatmap flatmap reduceByKey RDD Graph Physical Plan collect MapPartitionsRDD ShuffledRDD MapPartitionsRDD Stage 1 Stage 2
54.
© 2014 MapR
Technologies 70 Physical Plan DAG Stage 1 Stage 2 Task Task Task Task Task Task Task Stage 1 Stage 2 Split into Tasks HFile HDFS Data Node Worker Node block cache partition Executor HFile block HFileHFile Task thread Task Set Task Scheduler Task Physical Execution plan -> Stages and Tasks
55.
© 2014 MapR
Technologies 71 Summary of Components • Task : unit of execution • Stage: Group of Tasks – Base on partitions of RDD – Tasks run in parallel • DAG : Logical Graph of RDD operations • RDD : Parallel dataset with partitions 71
56.
© 2014 MapR
Technologies 72 How Spark Application runs on a Hadoop cluster HFile HDFS Data Node Worker Node block cache partitiontask task Executor HFile block HFileHFile SparkContext zookeeper YARN Resource Manager HFile HDFS Data Node Worker Node block cache partitiontask task Executor HFile block HFileHFile Client node sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Driver Program Yarn Node Manger Yarn Node Manger
57.
© 2014 MapR
Technologies 73 Deploying Spark – Cluster Manager Types • Standalone mode • Mesos • YARN • EC2 • GCE
58.
© 2014 MapR
Technologies 74 RDD Transformations and Actions RDD RDD RDD RDDTransformations Action Value Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Actions (return a value) reduce collect count save lookupKey …
59.
© 2014 MapR
Technologies 75 MapR Tutorial: Getting Started with Spark on MapR Sandbox • https://www.mapr.com/products/mapr-sandbox- hadoop/tutorials/spark-tutorial
60.
© 2014 MapR
Technologies 76 MapR Blog: Getting Started with the Spark Web UI • https://www.mapr.com/blog/getting-started-spark-web-ui
61.
© 2014 MapR
Technologies 77© 2014 MapR Technologies Example: Log Mining
62.
© 2014 MapR
Technologies 78 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Based on slides from Pat McDonough at
63.
© 2014 MapR
Technologies 79 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver
64.
© 2014 MapR
Technologies 80 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”)
65.
© 2014 MapR
Technologies 81 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”) Base RDD
66.
© 2014 MapR
Technologies 82 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver
67.
© 2014 MapR
Technologies 83 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver Transformed RDD
68.
© 2014 MapR
Technologies 84 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count()
69.
© 2014 MapR
Technologies 85 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Action
70.
© 2014 MapR
Technologies 86 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3
71.
© 2014 MapR
Technologies 87 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver tasks tasks tasks
72.
© 2014 MapR
Technologies 88 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Read HDFS Block Read HDFS Block Read HDFS Block
73.
© 2014 MapR
Technologies 89 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data
74.
© 2014 MapR
Technologies 90 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 results results results
75.
© 2014 MapR
Technologies 91 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count()
76.
© 2014 MapR
Technologies 92 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() tasks tasks tasks Driver
77.
© 2014 MapR
Technologies 93 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Process from Cache Process from Cache Process from Cache
78.
© 2014 MapR
Technologies 94 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver results results results
79.
© 2014 MapR
Technologies 95 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Cache your data Faster Results
80.
© 2014 MapR
Technologies 96© 2014 MapR Technologies Dataframes
81.
© 2014 MapR
Technologies 97 Scenario: Online Auction Data AuctionID Bid Bid Time Bidder Bidder Rate Open Bid Price Item Days To Live 8213034705 95 2.927373 jake7870 0 95 117.5 xbox 3 8213034705 115 2.943484 davidbresler2 1 95 117.5 xbox 3 8213034705 100 2.951285 gladimacowgirl 58 95 117.5 xbox 3 8213034705 117.5 2.998947 daysrus 95 95 117.5 xbox 3
82.
© 2014 MapR
Technologies 98 Creating DataFrames // Define schema using a Case class case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: String, bidderrate: Int, openbid: Float, finprice: Float, itemtype: String, dtl: Int)
83.
© 2014 MapR
Technologies 99 Creating DataFrames // Create the RDD val auctionRDD=sc.textFile(“/user/user01/data/ebay.csv”) .map(_.split(”,")) // Map the data to the Auctions class val auctions=auctionRDD.map(a=>Auction(a(0), a(1).toFloat, a(2).toFloat, a(3), a(4).toInt, a(5).toFloat, a(6).toFloat, a(7), a(8).toInt))
84.
© 2014 MapR
Technologies 100 Creating DataFrames // Convert RDD to DataFrame val auctionsDF=auctions.toDF() //To see the data auctionsDF.show() auctionid bid bidtime bidder bidderrate openbid price item daystolive 8213034705 95.0 2.927373 jake7870 0 95.0 117.5 xbox 3 8213034705 115.0 2.943484 davidbresler2 1 95.0 117.5 xbox 3 …
85.
© 2014 MapR
Technologies 101 Inspecting Data: Examples //To see the schema auctionsDF.printSchema() root |-- auctionid: string (nullable = true) |-- bid: float (nullable = false) |-- bidtime: float (nullable = false) |-- bidder: string (nullable = true) |-- bidderrate: integer (nullable = true) |-- openbid: float (nullable = false) |-- price: float (nullable = false) |-- item: string (nullable = true) |-- daystolive: integer (nullable = true)
86.
© 2014 MapR
Technologies 102 Inspecting Data: Examples //Total number of bids val totbids=auctionsDF.count() //How many bids per item? auction.groupBy("auctionid", "item").count.show auctionid item count 3016429446 palm 10 8211851222 xbox 28 3014480955 palm 12 8214279576 xbox 4 3014844871 palm 18 3014012355 palm 35 1641457876 cartier 2
87.
© 2014 MapR
Technologies 103 Creating DataFrames // Register the DF as a table auctionsDF.registerTempTable(“auctionsDF”) val results =sqlContext.sql("SELECT auctionid, MAX(price) as maxprice FROM auction GROUP BY item,auctionid") results.show() auctionid maxprice 3019326300 207.5 8213060420 120.0 . . .
88.
© 2014 MapR
Technologies 104 The physical plan for DataFrames
89.
© 2014 MapR
Technologies 105 DataFrame Excecution plan // Print the physical plan to the console auction.select("auctionid").distinct.explain() == Physical Plan == Distinct false Exchange (HashPartitioning [auctionid#0], 200) Distinct true Project [auctionid#0] PhysicalRDD [auctionid#0,bid#1,bidtime#2,bidder#3, bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37
90.
© 2014 MapR
Technologies 106 MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data • https://www.mapr.com/blog/using-apache-spark-dataframes- processing-tabular-data
91.
© 2014 MapR
Technologies 107© 2014 MapR Technologies Machine Learning
92.
© 2014 MapR
Technologies 108 Collaborative Filtering with Spark • Recommend Items – (filtering) • Based on User preferences data – (collaborative)
93.
© 2014 MapR
Technologies 109 Alternating Least Squares • approximates sparse user item rating matrix • as product of two dense matrices, User and Item factor matrices
94.
© 2014 MapR
Technologies 110 Parse Input // parse input UserID::MovieID::Rating def parseRating(str: String): Rating= { val fields = str.split("::") Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble) } // create an RDD of Ratings objects val ratingsRDD = ratingText.map(parseRating).cache()
95.
© 2014 MapR
Technologies 111 ML Cross Validation Process Data Model Training/ Building Model Testing Test Set Train Test loop Training Set
96.
© 2014 MapR
Technologies 112 Create Model // Randomly split ratings RDD into training data RDD (80%) and test data RDD (20%) val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L) val trainingRatingsRDD = splits(0).cache() val testRatingsRDD = splits(1).cache() // build a ALS user product matrix model with rank=20, iterations=10 val model = (new ALS().setRank(20).setIterations(10).run(trainingRatingsRDD))
97.
© 2014 MapR
Technologies 113 Get predictions // Get the top 5 movie predictions for user 4169 val topRecsForUser = model.recommendProducts(4169, 5) // get (user,product) pairs val testUserProductRDD = testRatingsRDD.map { case Rating(user, product, rating) => (user, product) } // get predictions for (user,product) pairs val predictionsForTestRDD = model.predict(testUserProductRDD)
98.
© 2014 MapR
Technologies 114 Test Model // prepare predictions for comparison val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } // prepare test for comparison val testKeyedByUserProductRDD = testRatingsRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } //Join the test with predictions val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD.join(predictionsKeyedByUserProductRDD)
99.
© 2014 MapR
Technologies 115 Test Model val falsePositives =(testAndPredictionsJoinedRDD.filter{ case ((user, product), (ratingT, ratingP)) => (ratingT <= 1 && ratingP >=4) }) falsePositives.take(2) Array[((Int, Int), (Double, Double))] = ((3842,2858),(1.0,4.106488210964762)), ((6031,3194),(1.0,4.790778049100913))
100.
© 2014 MapR
Technologies 116 Test Model //Evaluate the model using Mean Absolute Error (MAE) between test and predictions val meanAbsoluteError = testAndPredictionsJoinedRDD.map { case ((user, product), (testRating, predRating)) => val err = (testRating - predRating) Math.abs(err) }.mean() meanAbsoluteError: Double = 0.7244940545944053
101.
© 2014 MapR
Technologies 117 Soon to Come • Spark On Demand Training – https://www.mapr.com/services/mapr-academy/ • Blogs and Tutorials: – Movie Recommendations with Collaborative Filtering – Spark Streaming
102.
© 2014 MapR
Technologies 118 Machine Learning Blog • https://www.mapr.com/blog/parallel-and-iterative-processing- machine-learning-recommendations-spark
103.
© 2014 MapR
Technologies 119© 2014 MapR Technologies Examples and Resources
104.
© 2014 MapR
Technologies 120 Coming soon Free Spark On Demand Training • https://www.mapr.com/services/mapr-academy/
105.
© 2014 MapR
Technologies 121 Find my presentation and other related resources here: http://events.mapr.com/JacksonvilleJava (you can find this link in the event’s page at meetup.com) Today’s Presentation Whiteboard & demo videos Free On-Demand Training Free eBooks Free Hadoop Sandbox And more…
106.
© 2014 MapR
Technologies 122 Spark on MapR • Certified Spark Distribution • Fully supported and packaged by MapR in partnership with Databricks – mapr-spark package with Spark, Shark, Spark Streaming today – Spark-python, GraphX and MLLib soon • YARN integration – Spark can then allocate resources from cluster when needed
107.
© 2014 MapR
Technologies 123 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
108.
© 2014 MapR
Technologies 124 Q&A @mapr maprtech kbotzum@mapr.com Engage with us! MapR maprtech mapr-technologies
Download now