Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unlocking Your Hadoop Data with Apache Spark and CDH5

3,864 views

Published on

Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure.

Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc.

To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/

Published in: Technology
  • Dating direct: ❤❤❤ http://bit.ly/2ZDZFYj ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ❶❶❶ http://bit.ly/2ZDZFYj ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE Format, ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Unlocking Your Hadoop Data with Apache Spark and CDH5

  1. 1. 1 Seattle Spark Meetup
  2. 2. 2 Next Sessions • Currently planning – New sessions • Joint Seattle Spark Meetup and Seattle Mesos User Group meeting • Joint Seattle Spark Meetup and Cassandra meeting • Mobile marketing scenario with Tune – Repeats requested for: • Deep Dive into Spark, Tachyon, and Mesos Internals (Eastside) • Spark at eBay: Troubleshooting the everyday issues (Seattle) • This session (Eastside)
  3. 3. 3 Unlocking your Hadoop data with Apache Spark and CDH5 Denny Lee, Steven Hastings Data Sciences Engineering
  4. 4. 4 Purpose • Showcasing Apache Spark scenarios within your Hadoop cluster • Utilizing some Concur Data Sciences Engineering Scenarios • Special Guests: – Tableau Software: Showcasing Tableau to Apache Spark – Cloudera: How to configure and deploy Spark on YARN via Cloudera Manager
  5. 5. 5 Agenda • Configuring and Deploying Spark on YARN with Cloudera Manager • Quick primer on our expense receipt scenario • Connecting to Spark on your CDH5.1 cluster • Quick demos – Pig vs. Hive – SparkSQL • Tableau connecting to SparkSQL • Deep Dive demo – MLLib: SVD
  6. 6. 6 Take Picture of Receipt 601 108th Ave. NE Uel lcvue, WA 98004 Chantanee Thai Restaurant & Bar’ www.chantanee.com TABLE: B 6 - 2 Guests Your Server was Jerry 4/14/2014 1 ;14:02 PM Sequence #0000052 ID #0281727 Subtotal $40-00 Tota] Taxes $380 Grand Tota1 $43-80 Credit Purchase Name BC Type : Amex 00 Num : xxxx xxxx xxxx 2000 Approval : 544882 Server : Jerry Ticket Name : B 6 15% $6.00 I agree to pay the amount shown above. Visit us i Payment Amount: 2.5% .15 1 O .00 n BotheH-Pen Thai
  7. 7. 7 Help Choose Expense Type
  8. 8. 8 Gateway Node Hadoop Gateway Node - Can connect to HDFS - Execute Hadoop on cluster - Can execute Spark on cluster - OR on local for adhoc - Can setup multiple VMs
  9. 9. 9 Connecting… spark-shell --master spark://$masternode:7077 --executor-memory 1G --total-executor-cores 16 --master specify the master node OR if using a gateway node can just run locally to test --executor-memory limit amount of memory you use otherwise you’ll use up as much as you can (should set defaults) --total-executor-cores limit amount of cores you use otherwise you’ll use up as much as you can (should set defaults)
  10. 10. 10 Connecting… and resources
  11. 11. 11 RowCount: Pig A = LOAD '/user/hive/warehouse/dennyl.db/sample_ocr/000000_0' USING TextLoader as (line:chararray); B = group A all; C = foreach B generate COUNT(A); dump C;
  12. 12. 12 RowCount: Spark val ocrtxt = sc.textFile("/user/hive/warehouse/dennyl.db/sample_ocr/0000 00_0") ocrtxt.count
  13. 13. 13 RowCount: Pig vs. Spark Row Count against 1+ million categorized receipts Query Pig Spark 1 0:00:41 0:00:02 2 0:00:42 0:00:00.5
  14. 14. 14 RowCount: Spark Stages
  15. 15. 15 WordCount: Pig B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; E = group D all; F = foreach E generate COUNT(D); dump F
  16. 16. 16 WordCount: Spark var wc = ocrtxt.flatMap( line => line.split(" ") ).map( word => (word, 1) ).reduceByKey(_ + _) wc.count
  17. 17. 17 WordCount: Pig vs. Spark Word Count against 1+ million categorized receipts Query Pig Spark 1 0:02:09 0:00:38 2 0:02:07 0:00:02
  18. 18. 18 SparkSQL: Querying -- Utilize SQLContext val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD -- Configure class case class ocrdata(company_code: String, user_id: Long, date: Long, category_desc: String, category_key: String, legacy_key: String, legacy_desc: String, vendor: String, amount: Double)
  19. 19. 19 SparkSQL: Querying (2) -- Extract Data val ocr = sc.textFile("/$HDFS_Location" ).map(_.split("t")).map( m => ocrdata(m(0), m(1).toLong, m(2).toLong, m(3), m(4), m(5), m(6), m(7), m(8).toDouble) ) -- For Spark 1.0.2 ocr.registerAsTable("ocr") -- For Spark 1.1.0+ ocr.registerTempTable("ocr")
  20. 20. 20 SparkSQL: Querying (3) -- Write a SQL statement val blah = sqlContext.sql( "SELECT company_code, user_id FROM ocr” ) -- Show the first 10 rows blah.map( a => a(0) + ", " + a(1) ).collect().take(10).foreach(println)
  21. 21. 21 Oops! 14/11/15 09:55:35 ERROR scheduler.TaskSetManager: Task 16.0:0 failed 4 times; aborting job 14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Cancelling stage 16 14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Stage 16 was cancelled 14/11/15 09:55:35 INFO scheduler.DAGScheduler: Failed to run collect at <console>:22 14/11/15 09:55:35 WARN scheduler.TaskSetManager: Task 136 was killed. 14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all completed, from pool org.apache.spark.SparkException: Job aborted due to stage failure: Task 16.0:0 failed 4 times, most recent failure: Exception failure in TID 135 on host $server$: java.lang.NumberFormatException: For input string: "N" sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241) java.lang.Double.parseDouble(Double.java:540) scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
  22. 22. 22 Let’s find the error -- incorrect configuration, let me try to find "N" val errors = ocrtxt.filter(line => line.contains("N")) -- error count (71053 lines) errors.count -- look at some of the data errors.take(10).foreach(println)
  23. 23. 23 Solution -- Issue The [amount] field contains N which is NULL value generated by Hive -- Configure class (Original) case class ocrdata(company_code: String, user_id: Long, ... amount: Double) -- Configure class (Original) case class ocrdata(company_code: String, user_id: Long, ... amount: String)
  24. 24. 24 Re-running the Query 14/11/16 18:43:32 INFO scheduler.DAGScheduler: Stage 10 (collect at <console>:22) finished in 7.249 s 14/11/16 18:43:32 INFO spark.SparkContext: Job finished: collect at <console>:22, took 7.268298566 s -1978639384, 20156192 -1978639384, 20164613 542292324, 20131109 -598558319, 20128132 1369654093, 20130970 -1351048937, 20130846
  25. 25. 25 SparkSQL: By Category (Count) -- Query val blah = sqlContext.sql("SELECT category_desc, COUNT(1) FROM ocr GROUP BY category_desc") blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println) -- Results 14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool 14/11/16 18:46:12 INFO spark.SparkContext: Job finished: collect at <console>:22, took 4.275620339 s Category 1, 25 Category 2, 97 Category 3, 37
  26. 26. 26 SparkSQL: via Sum(Amount) -- Query val blah = sqlContext.sql("SELECT category_desc, sum(amount) FROM ocr GROUP BY category_desc") blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println) -- Results 14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool 14/11/16 18:46:12 INFO spark.SparkContext: Job finished: collect at <console>:22, took 4.275620339 s Category 1, 2000 Category 2, 10 Category 3, 1800
  27. 27. 27 Connecting Tableau to SparkSQL
  28. 28. 28 Diving into MLLib / SVD for Expense Receipt Scenario
  29. 29. 29 Overview • Re-Intro to Expense Receipt Prediction • SVD – What is it? – Why do I care? • Demo – Basics (get the data) – Data wrangling – Compute SVD
  30. 30. 30 Expense Receipts (Problems) • Want to guess Expense type based on word on receipt. • Receipt X Word matrix is sparse • Some words are likely to be found together • Some words are actually the same word
  31. 31. 31 SVD (Singular Value Decomposition) • Been around a while – But still popular and useful • Matrix Factorization – Intuition: rotate your view of the data • Data can be well approximated by fewer features – And you can get an idea of how good of an approximation
  32. 32. 32 Demo: Overview Raw Data Tokenized Words Grouped Words, Records Matrix SVD
  33. 33. 33 Basics val rawlines = sc.textFile(“/user/stevenh/subocr/subset.dat”) val ocrRecords = rawlines map { rawline => rawline.split(“t”) } filter { line => line.length == 10 && line(8) != “” } zipWithIndex() map { case (lineItems, lineIdx) => OcrRecord(lineIdx, lineItems(0), lineItems(1).toLong, lineItems(4), lineItems(8).toDouble, lineItems(9)) } zipWithIndex() lets you give each record in your RDD an incrementing integer index
  34. 34. 34 Tokenize Records val splitRegex = new scala.util.matching.Regex("rn") val wordRegex = new scala.util.matching.Regex("[a-zA-Z0-9_]+") val recordWords = ocrRecords flatMap { rec => val s1 = splitRegex.replaceAllIn(rec.ocrText, "") val s2 = wordRegex.findAllIn(s1) for { S <- s2 } yield (rec.recIdx, S.toLowerCase) } Keep track of which record this came from
  35. 35. 35 Demo: Overview Raw Data Tokenized Words Grouped Words, Records Matrix SVD
  36. 36. 36 Group data by Record and Word val wordCounts = recordWords groupBy { T => T } val wordsByRecord = wordCounts map { gr => (gr._1._2, (gr._1._1, gr._1._2, gr._2.size)) } val uniqueWords = wordsByRecord groupBy { T => T._2._2 } zipWithIndex() map { gr => (gr._1._1, gr._2) }
  37. 37. 37 Join Record, Word Data val preJoined = wordsByRecord join uniqueWords val joined = preJoined map { pj => RecordWord(pj._2._1._1, pj._2._2, pj._2._1._2, pj._2._1._3.toDouble) } Join 2-tuple RDDs on first value of tuple Now we have data for each non-zero word/record combo
  38. 38. 38 Demo: Overview Raw Data Tokenized Words Grouped Words, Records Matrix SVD
  39. 39. 39 Generate Word x Record Matrix val ncols = ocrRecords.count().toInt val nrows = uniqueWords.count().toLong val vectors: RDD[Vector] = joined groupBy { T => T.wordIdx } map { gr => This is a Spark Vector, not a scala Vector val indices = for { x <- gr._2 } yield x.recIdx.toInt val data = for { x <- gr._2 } yield x.n new SparseVector(ncols, indices.toArray, data.toArray) } val rowMatrix = new RowMatrix(vectors, nrows, ncols)
  40. 40. 40 Demo: Overview Raw Data Tokenized Words Grouped Words, Records Matrix SVD
  41. 41. 41 Compute SVD val svd = rowMatrix.computeSVD(5, computeU = true) • Ironically, in Spark v1.0 computeSVD is limited by an operation which must complete on a single node…
  42. 42. 42 Spark / SVD References • Distributing the Singular Value Decomposition with Spark – Spark-SVD gist – Twitter / Databricks blog post • Spotting Topics with the Singular Value Decomposition
  43. 43. 43 Now do something with the data!
  44. 44. 44 Upcoming Conferences • Strata + Hadoop World – http://strataconf.com/big-data-conference-ca-2015/public/content/home – San Jose, Feb 17-20, 2015 • Spark Summit East – http://spark-summit.org/east – New York, March 18-19, 2015 • Ask for a copy of “Learning Spark” – http://www.oreilly.com/pub/get/spark

×