Unlocking Your Hadoop Data with Apache Spark and CDH5

2
Next Sessions
• Currently planning
– New sessions
• Joint Seattle Spark Meetup and Seattle Mesos User Group meeting
• Joint Seattle Spark Meetup and Cassandra meeting
• Mobile marketing scenario with Tune
– Repeats requested for:
• Deep Dive into Spark, Tachyon, and Mesos Internals (Eastside)
• Spark at eBay: Troubleshooting the everyday issues (Seattle)
• This session (Eastside)

3
Unlocking your Hadoop data with Apache
Spark and CDH5
Denny Lee, Steven Hastings
Data Sciences Engineering

4
Purpose
• Showcasing Apache Spark scenarios within your Hadoop cluster
• Utilizing some Concur Data Sciences Engineering Scenarios
• Special Guests:
– Tableau Software: Showcasing Tableau to Apache Spark
– Cloudera: How to configure and deploy Spark on YARN via Cloudera Manager

5
Agenda
• Configuring and Deploying Spark on YARN with Cloudera Manager
• Quick primer on our expense receipt scenario
• Connecting to Spark on your CDH5.1 cluster
• Quick demos
– Pig vs. Hive
– SparkSQL
• Tableau connecting to SparkSQL
• Deep Dive demo
– MLLib: SVD

6
Take Picture of Receipt
601 108th Ave. NE Uel lcvue, WA 98004
Chantanee Thai Restaurant & Bar’
www.chantanee.com
TABLE: B 6 - 2 Guests
Your Server was Jerry
4/14/2014 1 ;14:02 PM Sequence #0000052
ID #0281727 Subtotal $40-00
Tota] Taxes $380 Grand Tota1 $43-80
Credit Purchase Name
BC Type : Amex
00 Num : xxxx xxxx xxxx 2000
Approval : 544882
Server : Jerry
Ticket Name : B 6
15% $6.00
I agree to pay the amount shown above.
Visit us i
Payment Amount:
2.5% .15 1 O .00
n BotheH-Pen Thai

8
Gateway Node
Hadoop
Gateway Node
- Can connect to HDFS
- Execute Hadoop on cluster
- Can execute Spark on cluster
- OR on local for adhoc
- Can setup multiple VMs

9
Connecting…
spark-shell --master spark://$masternode:7077 --executor-memory
1G --total-executor-cores 16
--master
specify the master node OR if using a gateway node
can just run locally to test
--executor-memory
limit amount of memory you use otherwise you’ll
use up as much as you can (should set defaults)
--total-executor-cores
limit amount of cores you use otherwise you’ll use up
as much as you can (should set defaults)

10
Connecting… and resources

11
RowCount: Pig
A = LOAD
'/user/hive/warehouse/dennyl.db/sample_ocr/000000_0' USING
TextLoader as (line:chararray);
B = group A all;
C = foreach B generate COUNT(A);
dump C;

12
RowCount: Spark
val ocrtxt =
sc.textFile("/user/hive/warehouse/dennyl.db/sample_ocr/0000
00_0")
ocrtxt.count

13
RowCount: Pig vs. Spark
Row Count against 1+ million categorized receipts
Query Pig Spark
1 0:00:41 0:00:02
2 0:00:42 0:00:00.5

15
WordCount: Pig
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as
word;
C = group B by word;
D = foreach C generate COUNT(B), group;
E = group D all;
F = foreach E generate COUNT(D);
dump F

16
WordCount: Spark
var wc =
ocrtxt.flatMap(
line => line.split(" ")
).map(
word => (word, 1)
).reduceByKey(_ + _)
wc.count

17
WordCount: Pig vs. Spark
Word Count against 1+ million categorized receipts
Query Pig Spark
1 0:02:09 0:00:38
2 0:02:07 0:00:02

18
SparkSQL: Querying
-- Utilize SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
-- Configure class
case class ocrdata(company_code: String, user_id: Long, date: Long,
category_desc: String, category_key: String, legacy_key: String,
legacy_desc: String, vendor: String, amount: Double)

19
SparkSQL: Querying (2)
-- Extract Data
val ocr = sc.textFile("/$HDFS_Location"
).map(_.split("t")).map(
m => ocrdata(m(0), m(1).toLong, m(2).toLong, m(3), m(4), m(5), m(6),
m(7), m(8).toDouble)
)
-- For Spark 1.0.2
ocr.registerAsTable("ocr")
-- For Spark 1.1.0+
ocr.registerTempTable("ocr")

20
SparkSQL: Querying (3)
-- Write a SQL statement
val blah = sqlContext.sql(
"SELECT company_code, user_id FROM ocr”
)
-- Show the first 10 rows
blah.map(
a => a(0) + ", " + a(1)
).collect().take(10).foreach(println)

21
Oops!
14/11/15 09:55:35 ERROR scheduler.TaskSetManager: Task 16.0:0 failed 4 times; aborting job
14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Cancelling stage 16
14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Stage 16 was cancelled
14/11/15 09:55:35 INFO scheduler.DAGScheduler: Failed to run collect at <console>:22
14/11/15 09:55:35 WARN scheduler.TaskSetManager: Task 136 was killed.
14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all
completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task 16.0:0 failed 4 times, most
recent failure: Exception failure in TID 135 on host $server$: java.lang.NumberFormatException: For
input string: "N"
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
java.lang.Double.parseDouble(Double.java:540)
scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)

22
Let’s find the error
-- incorrect configuration, let me try to find "N"
val errors = ocrtxt.filter(line => line.contains("N"))
-- error count (71053 lines)
errors.count
-- look at some of the data
errors.take(10).foreach(println)

23
Solution
-- Issue
The [amount] field contains N which is NULL value generated by Hive
-- Configure class (Original)
case class ocrdata(company_code: String, user_id: Long, ... amount:
Double)
-- Configure class (Original)
case class ocrdata(company_code: String, user_id: Long, ... amount:
String)

24
Re-running the Query
14/11/16 18:43:32 INFO scheduler.DAGScheduler: Stage 10 (collect at
<console>:22) finished in 7.249 s
14/11/16 18:43:32 INFO spark.SparkContext: Job finished: collect at
<console>:22, took 7.268298566 s
-1978639384, 20156192
-1978639384, 20164613
542292324, 20131109
-598558319, 20128132
1369654093, 20130970
-1351048937, 20130846

25
SparkSQL: By Category (Count)
-- Query
val blah = sqlContext.sql("SELECT category_desc, COUNT(1) FROM ocr GROUP BY
category_desc")
blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println)
-- Results
14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose
tasks have all completed, from pool
<console>:22, took 4.275620339 s
Category 1, 25
Category 2, 97
Category 3, 37

26
SparkSQL: via Sum(Amount)
-- Query
val blah = sqlContext.sql("SELECT category_desc, sum(amount) FROM ocr GROUP BY
category_desc")
blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println)
-- Results
14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose
tasks have all completed, from pool
<console>:22, took 4.275620339 s
Category 1, 2000
Category 2, 10
Category 3, 1800

27
Connecting Tableau to SparkSQL

28
Diving into MLLib / SVD for Expense
Receipt Scenario

29
Overview
• Re-Intro to Expense Receipt Prediction
• SVD
– What is it?
– Why do I care?
• Demo
– Basics (get the data)
– Data wrangling
– Compute SVD

30
Expense Receipts (Problems)
• Want to guess Expense type based on word on receipt.
• Receipt X Word matrix is sparse
• Some words are likely to be found together
• Some words are actually the same word

31
SVD (Singular Value Decomposition)
• Been around a while
– But still popular and useful
• Matrix Factorization
– Intuition: rotate your view of the data
• Data can be well approximated by fewer features
– And you can get an idea of how good of an approximation

32
Demo: Overview
Raw Data
Tokenized
Words
Grouped
Words,
Records
Matrix
SVD

33
Basics
val rawlines = sc.textFile(“/user/stevenh/subocr/subset.dat”)
val ocrRecords = rawlines map { rawline =>
rawline.split(“t”)
} filter { line =>
line.length == 10 && line(8) != “”
} zipWithIndex() map { case (lineItems, lineIdx) =>
OcrRecord(lineIdx, lineItems(0), lineItems(1).toLong, lineItems(4),
lineItems(8).toDouble, lineItems(9))
}
zipWithIndex() lets you give each record in your RDD an incrementing integer index

34
Tokenize Records
val splitRegex = new scala.util.matching.Regex("rn")
val wordRegex = new scala.util.matching.Regex("[a-zA-Z0-9_]+")
val recordWords = ocrRecords flatMap { rec =>
val s1 = splitRegex.replaceAllIn(rec.ocrText, "")
val s2 = wordRegex.findAllIn(s1)
for { S <- s2 } yield (rec.recIdx, S.toLowerCase)
}
Keep track of which record this came from

35
Demo: Overview
Raw Data
Tokenized
Words
Grouped
Words,
Records
Matrix
SVD

36
Group data by Record and Word
val wordCounts = recordWords groupBy { T => T }
val wordsByRecord = wordCounts map { gr =>
(gr._1._2, (gr._1._1, gr._1._2, gr._2.size))
}
val uniqueWords = wordsByRecord groupBy { T =>
T._2._2
} zipWithIndex() map { gr =>
(gr._1._1, gr._2)
}

37
Join Record, Word Data
val preJoined = wordsByRecord join uniqueWords
val joined = preJoined map { pj =>
RecordWord(pj._2._1._1, pj._2._2, pj._2._1._2, pj._2._1._3.toDouble)
}
Join 2-tuple RDDs on first value of tuple
Now we have data for each non-zero word/record combo

38
Demo: Overview
Raw Data
Tokenized
Words
Grouped
Words,
Records
Matrix
SVD

39
Generate Word x Record Matrix
val ncols = ocrRecords.count().toInt
val nrows = uniqueWords.count().toLong
val vectors: RDD[Vector] = joined groupBy { T =>
T.wordIdx
} map { gr =>
This is a Spark Vector, not a scala Vector
val indices = for { x <- gr._2 } yield x.recIdx.toInt
val data = for { x <- gr._2 } yield x.n
new SparseVector(ncols, indices.toArray, data.toArray)
}
val rowMatrix = new RowMatrix(vectors, nrows, ncols)

40
Demo: Overview
Raw Data
Tokenized
Words
Grouped
Words,
Records
Matrix
SVD

41
Compute SVD
val svd = rowMatrix.computeSVD(5, computeU = true)
• Ironically, in Spark v1.0 computeSVD is limited by an operation
which must complete on a single node…

42
Spark / SVD References
• Distributing the Singular Value Decomposition with Spark
– Spark-SVD gist
– Twitter / Databricks blog post
• Spotting Topics with the Singular Value Decomposition

43
Now do something with the data!

44
Upcoming Conferences
• Strata + Hadoop World
– http://strataconf.com/big-data-conference-ca-2015/public/content/home
– San Jose, Feb 17-20, 2015
• Spark Summit East
– http://spark-summit.org/east
– New York, March 18-19, 2015
• Ask for a copy of “Learning Spark”
– http://www.oreilly.com/pub/get/spark

Unlocking Your Hadoop Data with Apache Spark and CDH5

Unlocking Your Hadoop Data with Apache Spark and CDH5

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Unlocking Your Hadoop Data with Apache Spark and CDH5

Similar to Unlocking Your Hadoop Data with Apache Spark and CDH5 (20)

More from SAP Concur

More from SAP Concur (19)

Recently uploaded

Recently uploaded (20)

Unlocking Your Hadoop Data with Apache Spark and CDH5