SlideShare a Scribd company logo
1 of 59
Download to read offline
Programming
Data Infrastructure Team
엄태욱
2015-10-29
• SlideShare: http://goo.gl/yWFglI
• Github: https://goo.gl/zGnNtt
• Spark Package Download: http://goo.gl/vvpcrd (275.6MB)
http://spark.apache.org/
http://spark.apache.org/
• Developed in 2009 at UC Berkeley AMPLab
• Open sourced in 2010
Spark’s Goal
Support batch, streaming, and interactive computations
in a unified framework
http://strataconf.com/stratany2013/public/schedule/detail/30959
Spark Streaming
(Stream processing)
Spark SQL
(SQL/HQL)
MLlib
(Machine learning)
GraphX
(Graph computation)
Spark (Core Execution Engine)
Mesos Standalone YARN
Unified Platform
File System (Local, HDFS, S3, Tachyon) Data Store (Hive, HBase, Cassandra, …)
BlinkDB
(Approximate SQL)
Berkeley Data Analytics Stack
https://amplab.cs.berkeley.edu/software/
To validate our hypothesis
that specialized frameworks provide value over general ones,
we have also built a new framework
on top of Mesos called Spark,
optimized for iterative jobs
where a dataset is reused in many parallel operations,
and shown that Spark can outperform Hadoop by 10x
in iterative machine learning workloads.
http://spark.apache.org/docs/latest/cluster-overview.html
• Cluster Manager
• external service for acquiring resources on the cluster
• Standalone, Mesos, YARN
• Worker node
• Any node that can run application code in the cluster
• Application
• driver program + executors
• SparkContext
• application session
• connection to a cluster
http://spark.apache.org/docs/latest/cluster-overview.html
• Driver Program
• Process running the main()
• create SparkContext
• Executor
• process launched on a worker node
• Each application has its own executors
• Long-running and runs many small tasks
• keeps data in memory/disk storage
• Task
• Unit of work that will be sent to one executor
http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/Spark-Survey-2015-Report.pdf
http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/Spark-Survey-2015-Report.pdf
Databricks Scala Guide
https://github.com/databricks/scala-style-guide
$ git clone https://github.com/apache/spark.git
$ git tag -l
$ git checkout tags/v1.5.1
Package Build
for CDH 5.4.7
$ JAVA_HOME=$(/usr/libexec/java_home -v 1.8) ./make-distribution.sh 
--name hadoop2.6.0-chd5.4.7 --tgz --with-tachyon 
-Pyarn -Phive -Phive-thriftserver -Phadoop-2.6 
-Dhadoop.version=2.6.0-cdh5.4.7 -DskipTest
spark-1.5.1-bin-hadoop2.6.0-chd5.4.7.tgz
for Local Test
$ JAVA_HOME=$(/usr/libexec/java_home -v 1.8) ./make-distribution.sh 
--name hadoop2.4 --tgz -Phadoop-2.4 -Pyarn -Phive -DskipTest
spark-1.5.1-bin-hadoop2.4.tgz
Pre-build Package
http://mirror.apache-kr.org/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.4.tgz
(275.6MB)
http://spark.apache.org/downloads.html
Shell in Local Mode
REPL(Read-Eval-Print Loop) = Interactive Shell
import org.apache.spark.{SparkContext, SparkConf}

val sc = new SparkContext("local[*]", "Spark shell", new SparkConf())
scala>
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@36c783ca
자동으로 다음과 같이 SparkContext(sc) 생성한 후 shell prompt
• 장점
• 분산처리에서 어려운 local test/debugging, unit test
• lazy 처리: 데이터 처리 없이 syntax 먼저 검증 가능
• 단점
• 메모리 한계
• jar loading 차이(YARN/Mesos Shell/Submit Mode 차이)
$ cd ../
$ tar zxvf spark/spark-1.5.1-bin-spark-hadoop2.4.tgz
$ cd spark-1.5.1-bin-spark-hadoop2.4/
$ bin/spark-shell --master local[*]
Run Spark Shell
$ cat init.script
import java.lang.Runtime

println(s"cores = ${Runtime.getRuntime.availableProcessors}")
$ SPARK_PRINT_LAUNCH_COMMAND=1 bin/spark-shell --master local[*] -i init.script
Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/
bin/java -cp /Users/taewook/spark-1.5.1-bin-spark-hadoop2.4/conf/:/Users/taewook/
spark-1.5.1-bin-spark-hadoop2.4/lib/spark-assembly-1.5.1-hadoop2.4.0.jar:/Users/
taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/Users/
taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/Users/
taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar -
Dscala.usejavacp=true -Xms1g -Xmx1g org.apache.spark.deploy.SparkSubmit --master
local[*] --class org.apache.spark.repl.Main --name Spark shell spark-shell -i
init.script
========================================
...
Type :help for more information.
...
Spark context available as sc.
...
Loading init.script...
import java.lang.Runtime
cores = 4
scala>
scala> :load init.script
Loading init.script...
import java.lang.Runtime
cores = 4
Run Spark Shell
:paste      enter paste mode: all input up to
  ctrl-D compiled together
:cp <path>  add a jar or directory to the classpath
:history [num] show the history
(optional num is commands to show)
~/.spark_history
java 실행 명령으로 classpath, JVM 옵션 등 확인
함수/클래스 정의 등 미리 실행하면 편리한 초기화 명령 수행
Web UI
http://localhost:4040/
sc.getConf.getAll.foreach(println)
local shell은
executor 1개(=driver)

core 수는 local[N]만큼
An RDD is a read-only collection of objects
partitioned across a set of machines
that can be rebuilt if a partitionis lost.
• Read-only = Immutable
• Parallelism ➔ 분산 처리
• 오랫동안 Caching 가능 ➔ 성능
• Transformation for change
• 데이터 복사 반복 ➔ 성능➡, 공간 낭비
• Laziness로 극복
Core abstraction in the core of Spark
An RDD is a read-only collection of objects
partitioned across a set of machines
that can be rebuilt if a partitionis lost.
Resilient Distributed Dataset
• Partitioned = Distributed
• more partiton = more parallelism
Resilient Distributed Dataset
Core abstraction in the core of Spark
An RDD is a read-only collection of objects
partitioned across a set of machines
that can be rebuilt if a partitionis lost.
Resilient Distributed Dataset
• can be rebuilt = Resilient
• recover from lost data partitions
• by data lineage
• can be cached
• lineage 짧게 줄여서 더 빠르게 복구
Core abstraction in the core of Spark
An RDD is a read-only collection of objects
partitioned across a set of machines
that can be rebuilt if a partitionis lost.
• Scala Collection API와 비슷 + 분산 데이터 연산
• map(), filter(), reduce(), count(), foreach(), …
• http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
• Transformations
• return a new RDD
• deterministic - 실패해서 재실행해도 결과 항상 같음
• lazy evaluation
• Actions
• return a final value(some other data type)
• 첫 RDD부터 실제 실행(Caching 하면 cache된 RDD부터 실행)
RDD Operations
http://training.databricks.com/workshop/sparkcamp.pdf
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html
http://www.supergloo.com/fieldnotes/apache-spark-examples-of-transformations/
http://www.supergloo.com/fieldnotes/apache-spark-examples-of-actions/
RDD API Examples
• Laziness = Lazily constructed
• 지연 계산 - 필요할 때까지 연산 미루기
• evaluation과 execution의 분리
• 실행 전에 최소한의 오류 검사
• 중간 RDD 결과값 저장 불필요
• Intermediate RDDs not materialized
• Immutability & Laziness
• Immutability ➔ Laziness 가능
• side-effect 없어 transformation들 combine 가능
• combine node steps into “stages” (최적화)
• ➔ 성능 , 분산 처리 가능
Resilient Distributed Dataset
Creating RDDs
• parallelizing a collection
• 한 대의 driver 장비의 메모리에 모두 올림
• for only prototyping, testing
• loading an external data set
• 외부 소스로부터 읽기
• sc.textFile(): file://, hdfs://, s3n://
• sc.hadoopFile(), sc.newAPIHadoopFile()
• sqlContext.sql(), JdbcRDD(), …
scala> val numbers = sc.parallelize(1 to 10)

numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0]
at parallelize at <console>:21
scala> val textFile = sc.textFile("README.md")

textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2]
at textFile at <console>:21
scala> val numbers = sc.parallelize(1 to 10)

numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0]
at parallelize at <console>:21
scala> numbers.partitions.length
res0: Int = 4


scala> numbers.glom().collect()
res1: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5),
Array(6, 7), Array(8, 9, 10))
scala> val numbersWith2Partitions = sc.parallelize(1 to 10, 2)

numbersWith2Partitions: org.apache.spark.rdd.RDD[Int] =
ParallelCollectionRDD[2] at parallelize at <console>:21
scala> numbersWith2Partitions.partitions.length

res2: Int = 2
scala> numbersWith2Partitions.glom().collect()
res3: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5), Array(6, 7,
8, 9, 10))
Partitions
numbers.mapPartitionsWithIndex()
https://github.com/deanwampler/spark-workshop
Partitions
• Logical division of data
• chunk of HDFS
• location aware
• basic unit of parallelism
• RDD is just collection of partitions
Word Count
val textFile = sc.textFile("README.md", 4)


val words = textFile.flatMap(line => line.split("[s]+"))
val realWords = words.filter(_.nonEmpty)


val wordTuple = realWords.map(word => (word, 1))


val groupBy = wordTuple.groupByKey(2)


val wordCount = groupBy.mapValues(value => value.reduce(_ + _))


wordCount.collect().sortBy(-_._2)
Word Count
val textFile = sc.textFile("README.md", 4)
textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21


val words = textFile.flatMap(line => line.split("[s]+"))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:23
val realWords = words.filter(_.nonEmpty)
realWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at filter at <console>:25


val wordTuple = realWords.map(word => (word, 1))
wordTuple: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:27


val groupBy = wordTuple.groupByKey(2)
groupBy: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[5] at groupByKey at
<console>:29


val wordCount = groupBy.mapValues(value => value.reduce(_ + _))
wordCount: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at mapValues at
<console>:31


wordCount.collect().sortBy(-_._2)
res0: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10),
(##,8), (run,7), (can,6), (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ...
wordCount.saveAsTextFile("wordcount.txt")
• Main program are executed on the Spark Driver
• Transformations are executed on the Spark Workers
• Actions may transfer from the Workers to the Driver
collect(), countByKey(), countByValue(), collectAsMap()
➔ • bounded output: count(), take(N)
• unbounded output: saveAsTextFile()
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
executor(on worker) 동작 영역과 driver 동작 영역 구분
driver에서는 action과 accumulator 외에 executor의 데이터 받을 수 없음
Data Lineage of RDD
scala> wordCount.toDebugString
res1: String =
(2) MapPartitionsRDD[6] at mapValues at <console>:31 []
| ShuffledRDD[5] at groupByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
scala> wordCount.dependencies.head.rdd
res2: org.apache.spark.rdd.RDD[_] = ShuffledRDD[5] at groupByKey
at <console>:29
scala> textFile.dependencies.head.rdd
res3: org.apache.spark.rdd.RDD[_] = README.md HadoopRDD[0] at
textFile at <console>:21
scala> textFile.dependencies.head.rdd.dependencies
res4: Seq[org.apache.spark.Dependency[_]] = List()
모든 RDD는 부모 RDD 추적 ➔ DAG Scheduling과 복구의 기본
Data Lineage of RDD
scala> wordCount.toDebugString
res1: String =
(2) MapPartitionsRDD[6] at mapValues at <console>:31 []
| ShuffledRDD[5] at groupByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
flatMap
filter
map
groupByKey
mapValues
Step
Stage
textFile
textFile
Nil
Stage 1
Stage 0
Parent
shuffle
boundary
Directed Acyclic Graph
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Wide Dependency 최소화
RDD Dependencies
Local
Reference
Network
Communication
(Shuffle)
Narrow Dependencies
N steps ➔ 1 stage = 1 task
cogroup()

join()

groupByKey()

reduceByKey()

combineByKey()

…
Quiz
groupByKeyhadoopFile
textFile map filter
map
join join
Schedule & Execute tasks
$ bin/spark-shell --master local[3]
...
val textFile = sc.textFile("README.md", 4)
...
val groupBy = wordTuple.groupByKey(2)
val groupBy = wordTuple.groupByKey(2)
groupBy: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[5] at groupByKey at
<console>:29


scala> groupBy.collect()
res5: Array[(String, Iterable[Int])] = Array((package,CompactBuffer(1)), (this,CompactBuffer(1)
), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-
version),CompactBuffer(1)), (Because,CompactBuffer(1)), (Python,CompactBuffer(1, 1)),
(cluster.,CompactBuffer(1)), (its,CompactBuffer(1)), ([run,CompactBuffer(1)),
(general,CompactBuffer(1, 1)), (YARN,,CompactBuffer(1)), (have,CompactBuffer(1)), (pre-
built,CompactBuffer(1)), (locally.,CompactBuffer(1)), (locally,CompactBuffer(1, 1)),
(changed,CompactBuffer(1)), (sc.parallelize(1,CompactBuffer(1)), (only,CompactBuffer(1)),
(several,CompactBuffer(1)), (learning,,CompactBuffer(1)), (basic,CompactBuffer(1)),
(first,CompactBuffer(1)), (This,CompactBuffer(1, 1)), (documentation,CompactBuffer(1, 1, 1)),
(Confi...
• HashMap within each partition
• no map-side aggregation (=combiner of MapReduce)
• single key-value pair must fit in memory (Out Of Memory/Disk)
groupByKey()
rdd.groupByKey().mapValues(value => value.reduce(func))
= rdd.reduceByKey(func)
대체 가능하다면 groupByKey 대신
reducedByKey, aggregateByKey, foldByKey, combineByKey 사용 권장
scala> val wordCountReduceByKey = wordTuple.reduceByKey(_ + _, 2)
wordCountReduceByKey: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceByKey at <console>:29
scala> wordCountReduceByKey.toDebugString
res6: String =
(2) ShuffledRDD[7] at reduceByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
scala> wordCountReduceByKey.collect().sortBy(-_._2)
res7: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10), (##,8), (run,7), (can,6)
, (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ...
reduceByKey()
scala> sc.setLogLevel("INFO")
scala> wordCountReduceByKey.collect().sortBy(-_._2)
...
... INFO DAGScheduler: Got job 3 (collect at <console>:32) with 2 output partitions
... INFO DAGScheduler: Final stage: ResultStage 7(collect at <console>:32)
... INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 6)
... INFO DAGScheduler: Missing parents: List()
... INFO DAGScheduler: Submitting ResultStage 7 (ShuffledRDD[7] at reduceByKey at <console>:29), which has no missing
parents
... INFO MemoryStore: ensureFreeSpace(2328) called with curMem=109774, maxMem=555755765
... INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 2.3 KB, free 529.9 MB)
... INFO MemoryStore: ensureFreeSpace(1378) called with curMem=112102, maxMem=555755765
... INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1378.0 B, free 529.9 MB)
... INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:60479 (size: 1378.0 B, free: 530.0 MB)
... INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:861
... INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 7 (ShuffledRDD[7] at reduceByKey at <console>:29)
...
res9: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10), (##,8), (run,7), (can,6)
, (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ...
Skip
scala> sc.setLogLevel("INFO")
scala> val textFile = sc.textFile("README.md", 4)
scala> val words = textFile.flatMap(line => line.split("[s]+"))
scala> val realWords = words.filter(_.nonEmpty)
scala> val wordTuple = realWords.map(word => (word, 1))
scala> wordTuple.cache()


scala> val groupBy = wordTuple.groupByKey(2)
scala> val wordCount = groupBy.mapValues(value => value.reduce(
_ + _))
scala> wordCount.toDebugString
res2: String =
(2) MapPartitionsRDD[6] at mapValues at <console>:31 []
| ShuffledRDD[5] at groupByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
scala> wordCount.collect().sortBy(-_._2)
...
... INFO BlockManagerInfo: Added rdd_4_0 in memory on localhost:60641 (size: 9.9 KB, free: 530.0 MB)
... INFO BlockManagerInfo: Added rdd_4_1 in memory on localhost:60641 (size: 9.5 KB, free: 530.0 MB)
... INFO BlockManagerInfo: Added rdd_4_2 in memory on localhost:60641 (size: 10.7 KB, free: 530.0 MB)
... INFO BlockManagerInfo: Added rdd_4_3 in memory on localhost:60641 (size: 8.0 KB, free: 530.0 MB)
...
Cache
scala> val wordCountReduceByKey = wordTuple.reduceByKey(_ + _, 2)
scala> wordCountReduceByKey.toDebugString

res4: String =
(2) ShuffledRDD[7] at reduceByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| CachedPartitions: 4; MemorySize: 38.1 KB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
scala> wordCountReduceByKey.collect().sortBy(-_._2)
...
... INFO BlockManager: Found block rdd_4_0 locally
... INFO BlockManager: Found block rdd_4_1 locally
... INFO BlockManager: Found block rdd_4_2 locally
... INFO BlockManager: Found block rdd_4_3 locally
...
Cache
• Spark은 반복 연산에 특화되어 빠르지만 cache 안하면 무의미
• cache 없으면 첫 RDD부터 모두 계산
• LRU(Least-Recently-Used) 정책
• 바로 재사용하지 않으면 불필요하게 메모리에 캐싱할 필요 없음
• 수동 cache 해제: RDD.unpersist()
• 기본 StorageLevel(MEMORY_ONLY)은 deserialized to memory
• 메모리 CPU
Cache
StorageLevel.MEMORY_ONLY
val textFile = sc.textFile("README.md", 4)
val words = textFile.flatMap(line => line.split("[s]+"))

val realWords = words.filter(_.nonEmpty)

realWords.cache()

val wordTuple = realWords.map(word => (word, 1))

wordTuple.cache()



val groupBy = wordTuple.groupByKey(2)

val wordCount = groupBy.mapValues(value => value.reduce(_ + _))

wordCount.collect().sortBy(-_._2)



val wordCountReduceByKey = wordTuple.reduceByKey(_ + _, 2)

wordCountReduceByKey.collect().sortBy(-_._2)



realWords.countByValue().toArray.sortBy(-_._2)
scala.collection.Map[String,Long]
Word Count
4 partitions
• RDD API의 단점
• 대부분의 데이터는 구조화되어 있음

(JSON, CSV, Avro, Parquet, ORC, Hive ...)
• 함수형 Transformation들은 직관적이지 않음
• ._1, ._2 방식으로 데이터 접근하기 불편하고 오류 가능성 높음
val data = sc.textFile(“people.tsv").map(line => line.split("t"))

data.map(row => (row(0), (row(1).toLong, 1)))

.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))

.map { case (dept, values: (Long, Int)) => (dept, values._1 / values._2) }

.collect()

sqlContext.table("people")

.groupBy('name)

.avg('age)

.collect()
DataFrame
DataFrame
• Distributed collection of rows organized into named columns.
• inspired by DataFrame in R and Pandas in Python
• RDD with schema (org.apache.spark.sql.SchemaRDD before v1.3)
• Python, Scala, Java, and R (via SparkR)
• Making Spark accessible to everyone (RDB, R에 익숙한 사람까지)
• data scientists, engineers, statisticians, ...
http://www.slideshare.net/databricks/2015-0616-spark-summit
DataFrame Operations
scala> val df = sqlContext.read.json("examples/src/main/resources/people.json")

df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
scala> df.explain(true)
== Parsed Logical Plan ==
Relation[age#3L,name#4] JSONRelation[file:/.../examples/src/main/resources/people.json]
== Analyzed Logical Plan ==
age: bigint, name: string
Relation[age#3L,name#4] JSONRelation[file:/.../examples/src/main/resources/people.json]
== Optimized Logical Plan ==
Relation[age#3L,name#4] JSONRelation[file:/.../examples/src/main/resources/people.json]
== Physical Plan ==
Scan JSONRelation[file:/.../examples/src/main/resources/people.json][age#3L,name#4]
Code Generation: true
DataFrame Operations
scala> df.select(df("name"), df("age") + 1).show()
+-------+---------+
| name|(age + 1)|
+-------+---------+
|Michael| null|
| Andy| 31|
| Justin| 20|
+-------+---------+


scala> df.filter(df("age") > 21).show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

scala> df.groupBy("age").count().show()
+----+-----+
| age|count|
+----+-----+
|null| 1|
| 19| 1|
| 30| 1|
+----+-----+
df.select('name, 'age + 1).show()
df.filter('age > 21).show()
df.groupBy('age).count().show()
DataFrame API
• API 사용법
• http://spark.apache.org/docs/latest/sql-programming-guide.html
• http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science
• https://github.com/yu-iskw/spark-dataframe-introduction/blob/master/doc/dataframe-introduction.md
• 책도 부족, 최신 내용은 Source Code 참고
• Example Code
• /spark/examples/src/main/scala/org/apache/spark/examples/sql/RDDRelation.scala
• TestSuite
• /spark/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
http://www.slideshare.net/databricks/2015-0616-spark-summit
최적화 포인트를
DataFrame
한 곳으로
Same performance for all languages
http://www.slideshare.net/databricks/spark-dataframes-simple-and-fast-analytics-on-structured-data-at-spark-summit-2015
• Simple tasks easy with DataFrame API
• Complex tasks possible with RDD API
scala> import sqlContext.implicits._

scala> case class Person(name: String, age: Int)


scala> val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p =>
Person(p(0), p(1).trim.toInt)).toDF()
people: org.apache.spark.sql.DataFrame = [name: string, age: int]


scala> people.registerTempTable("people")
teenagers: org.apache.spark.sql.DataFrame = [name: string, age: int]


scala> val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

scala> teenagers.show()
+------+---+
| name|age|
+------+---+
|Justin| 19|
+------+---+


scala> val teenagersRdd = teenagers.rdd

teenagersRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[8] at rdd at <console>:24
scala> teenagersRdd.toDebugString
res2: String =
(2) MapPartitionsRDD[8] at rdd at <console>:24 []
| MapPartitionsRDD[7] at rdd at <console>:24 []
| MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:26 []
| MapPartitionsRDD[3] at map at <console>:26 []
| MapPartitionsRDD[2] at map at <console>:26 []
| MapPartitionsRDD[1] at textFile at <console>:26 []
| examples/src/main/resources/people.txt HadoopRDD[0] at textFile at <console>:26 []


scala> teenagersRdd.collect()
res3: Array[org.apache.spark.sql.Row] = Array([Justin,19])
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
https://twitter.com/pwendell/status/649993257414340608
Questions?
questions.foreach( answer(_) )
• Apache Spark User List
• http://apache-spark-user-list.1001560.n3.nabble.com/
• Devops Advanced Class
• http://training.databricks.com/devops.pdf
• Intro to Apache Spark
• http://training.databricks.com/workshop/sparkcamp.pdf
• Apache Spark Tutorial
• http://cdn.liber118.com/workshop/fcss_spark.pdf
• Anatomy of RDD : Deep dive into Spark RDD abstraction
• http://www.slideshare.net/datamantra/anatomy-of-rdd
• A Deeper Understanding of Spark’s Internals
• https://spark-summit.org/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf
• Scala and the JVM for Big Data: Lessons from Spark
• https://deanwampler.github.io/polyglotprogramming/papers/ScalaJVMBigData-SparkLessons.pdf
• Lightning Fast Big Data Analytics with Apache Spark
• http://www.virdata.com/wp-content/uploads/Spark-Devoxx2014.pdf
References
Appendix
Quiz
groupByKeyhadoopFile
textFile map filter
map
join join
4 Stages!
http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/Spark-Survey-2015-Report.pdf
PySpark
http://spark.apache.org/docs/latest/api/python/
+
Jupyter Notebook
https://jupyter.org/
• Partition 수
• 너무 큰 파일 읽을 때는 coalesce(N) 사용해서 executor 수 줄임
• repartition 없이 하나의 executor가 여러 partition 처리
• 오래걸리는 CPU 연산은 repartition(N) 으로 executor 수 늘림
• Executor 수
• Job 내 최대 partition 수의 2배 이상
• Executor Memory
• 장비 메모리의 최대 75% 사용 권장
• 최소 Heap 크기는 8GB
• 최대 Heap 크기는 40GB 넘지 않도록 (GC 확인 필요)
• 메모리 사용량은 StorageLevel과 Serialization 형식에 영향 받음
• G1GC 설정
• https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
• 각종 Tuning
• http://spark.apache.org/docs/latest/tuning.html
Tips
• min, max, sum, mean 여러번 따로 구하지 말고 stats() 한 번에 구하기
• count, sum, min, max, mean, stdev, variance, sampleStdev, smapleVariance
• Shuffle 문제 확인 방법
• Web UI에서 오래 걸리거나 큰 input/output이 있는 stage/partition 확인
• KryoSerializer
• /conf/spark-defaults.conf
• "spark.serializer", “org.apache.spark.serializer.KryoSerializer"
• StorageLevel.MEMORY_ONLY_SER 과 함께 사용 권장
• Task not serializable: java.io.NotSerializableException
• executor 내에서 수행되는 함수 내에서 객체 생성 및 사용
• https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
Tips

More Related Content

What's hot

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in sparkPeng Cheng
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Spark Summit
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
 
Faster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache SolrFaster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache SolrChitturi Kiran
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsCheng Lian
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDatabricks
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 

What's hot (20)

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data Analytics
 
Faster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache SolrFaster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache Solr
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Hadoop on osx
Hadoop on osxHadoop on osx
Hadoop on osx
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 

Similar to Spark Programming

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraHandaru Sakti
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupRadu Chilom
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkMartin Toshev
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 

Similar to Spark Programming (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Spark core
Spark coreSpark core
Spark core
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Recently uploaded

ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 

Recently uploaded (20)

ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 

Spark Programming

  • 1. Programming Data Infrastructure Team 엄태욱 2015-10-29 • SlideShare: http://goo.gl/yWFglI • Github: https://goo.gl/zGnNtt • Spark Package Download: http://goo.gl/vvpcrd (275.6MB)
  • 2. http://spark.apache.org/ http://spark.apache.org/ • Developed in 2009 at UC Berkeley AMPLab • Open sourced in 2010
  • 3. Spark’s Goal Support batch, streaming, and interactive computations in a unified framework http://strataconf.com/stratany2013/public/schedule/detail/30959
  • 4. Spark Streaming (Stream processing) Spark SQL (SQL/HQL) MLlib (Machine learning) GraphX (Graph computation) Spark (Core Execution Engine) Mesos Standalone YARN Unified Platform File System (Local, HDFS, S3, Tachyon) Data Store (Hive, HBase, Cassandra, …) BlinkDB (Approximate SQL) Berkeley Data Analytics Stack https://amplab.cs.berkeley.edu/software/
  • 5. To validate our hypothesis that specialized frameworks provide value over general ones, we have also built a new framework on top of Mesos called Spark, optimized for iterative jobs where a dataset is reused in many parallel operations, and shown that Spark can outperform Hadoop by 10x in iterative machine learning workloads.
  • 6. http://spark.apache.org/docs/latest/cluster-overview.html • Cluster Manager • external service for acquiring resources on the cluster • Standalone, Mesos, YARN • Worker node • Any node that can run application code in the cluster • Application • driver program + executors • SparkContext • application session • connection to a cluster
  • 7. http://spark.apache.org/docs/latest/cluster-overview.html • Driver Program • Process running the main() • create SparkContext • Executor • process launched on a worker node • Each application has its own executors • Long-running and runs many small tasks • keeps data in memory/disk storage • Task • Unit of work that will be sent to one executor
  • 10. $ git clone https://github.com/apache/spark.git $ git tag -l $ git checkout tags/v1.5.1 Package Build for CDH 5.4.7 $ JAVA_HOME=$(/usr/libexec/java_home -v 1.8) ./make-distribution.sh --name hadoop2.6.0-chd5.4.7 --tgz --with-tachyon -Pyarn -Phive -Phive-thriftserver -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.4.7 -DskipTest spark-1.5.1-bin-hadoop2.6.0-chd5.4.7.tgz for Local Test $ JAVA_HOME=$(/usr/libexec/java_home -v 1.8) ./make-distribution.sh --name hadoop2.4 --tgz -Phadoop-2.4 -Pyarn -Phive -DskipTest spark-1.5.1-bin-hadoop2.4.tgz
  • 12. Shell in Local Mode REPL(Read-Eval-Print Loop) = Interactive Shell import org.apache.spark.{SparkContext, SparkConf}
 val sc = new SparkContext("local[*]", "Spark shell", new SparkConf()) scala> scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@36c783ca 자동으로 다음과 같이 SparkContext(sc) 생성한 후 shell prompt • 장점 • 분산처리에서 어려운 local test/debugging, unit test • lazy 처리: 데이터 처리 없이 syntax 먼저 검증 가능 • 단점 • 메모리 한계 • jar loading 차이(YARN/Mesos Shell/Submit Mode 차이)
  • 13. $ cd ../ $ tar zxvf spark/spark-1.5.1-bin-spark-hadoop2.4.tgz $ cd spark-1.5.1-bin-spark-hadoop2.4/ $ bin/spark-shell --master local[*] Run Spark Shell
  • 14. $ cat init.script import java.lang.Runtime
 println(s"cores = ${Runtime.getRuntime.availableProcessors}") $ SPARK_PRINT_LAUNCH_COMMAND=1 bin/spark-shell --master local[*] -i init.script Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/ bin/java -cp /Users/taewook/spark-1.5.1-bin-spark-hadoop2.4/conf/:/Users/taewook/ spark-1.5.1-bin-spark-hadoop2.4/lib/spark-assembly-1.5.1-hadoop2.4.0.jar:/Users/ taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/Users/ taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/Users/ taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar - Dscala.usejavacp=true -Xms1g -Xmx1g org.apache.spark.deploy.SparkSubmit --master local[*] --class org.apache.spark.repl.Main --name Spark shell spark-shell -i init.script ======================================== ... Type :help for more information. ... Spark context available as sc. ... Loading init.script... import java.lang.Runtime cores = 4 scala> scala> :load init.script Loading init.script... import java.lang.Runtime cores = 4 Run Spark Shell :paste      enter paste mode: all input up to   ctrl-D compiled together :cp <path>  add a jar or directory to the classpath :history [num] show the history (optional num is commands to show) ~/.spark_history java 실행 명령으로 classpath, JVM 옵션 등 확인 함수/클래스 정의 등 미리 실행하면 편리한 초기화 명령 수행
  • 16. An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partitionis lost.
  • 17. • Read-only = Immutable • Parallelism ➔ 분산 처리 • 오랫동안 Caching 가능 ➔ 성능 • Transformation for change • 데이터 복사 반복 ➔ 성능➡, 공간 낭비 • Laziness로 극복 Core abstraction in the core of Spark An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partitionis lost. Resilient Distributed Dataset
  • 18. • Partitioned = Distributed • more partiton = more parallelism Resilient Distributed Dataset Core abstraction in the core of Spark An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partitionis lost.
  • 19. Resilient Distributed Dataset • can be rebuilt = Resilient • recover from lost data partitions • by data lineage • can be cached • lineage 짧게 줄여서 더 빠르게 복구 Core abstraction in the core of Spark An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partitionis lost.
  • 20. • Scala Collection API와 비슷 + 분산 데이터 연산 • map(), filter(), reduce(), count(), foreach(), … • http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD • Transformations • return a new RDD • deterministic - 실패해서 재실행해도 결과 항상 같음 • lazy evaluation • Actions • return a final value(some other data type) • 첫 RDD부터 실제 실행(Caching 하면 cache된 RDD부터 실행) RDD Operations http://training.databricks.com/workshop/sparkcamp.pdf
  • 22. • Laziness = Lazily constructed • 지연 계산 - 필요할 때까지 연산 미루기 • evaluation과 execution의 분리 • 실행 전에 최소한의 오류 검사 • 중간 RDD 결과값 저장 불필요 • Intermediate RDDs not materialized • Immutability & Laziness • Immutability ➔ Laziness 가능 • side-effect 없어 transformation들 combine 가능 • combine node steps into “stages” (최적화) • ➔ 성능 , 분산 처리 가능 Resilient Distributed Dataset
  • 23. Creating RDDs • parallelizing a collection • 한 대의 driver 장비의 메모리에 모두 올림 • for only prototyping, testing • loading an external data set • 외부 소스로부터 읽기 • sc.textFile(): file://, hdfs://, s3n:// • sc.hadoopFile(), sc.newAPIHadoopFile() • sqlContext.sql(), JdbcRDD(), … scala> val numbers = sc.parallelize(1 to 10)
 numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:21 scala> val textFile = sc.textFile("README.md")
 textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at textFile at <console>:21
  • 24. scala> val numbers = sc.parallelize(1 to 10)
 numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:21 scala> numbers.partitions.length res0: Int = 4 
 scala> numbers.glom().collect() res1: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5), Array(6, 7), Array(8, 9, 10)) scala> val numbersWith2Partitions = sc.parallelize(1 to 10, 2)
 numbersWith2Partitions: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:21 scala> numbersWith2Partitions.partitions.length
 res2: Int = 2 scala> numbersWith2Partitions.glom().collect() res3: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5), Array(6, 7, 8, 9, 10)) Partitions numbers.mapPartitionsWithIndex()
  • 25. https://github.com/deanwampler/spark-workshop Partitions • Logical division of data • chunk of HDFS • location aware • basic unit of parallelism • RDD is just collection of partitions
  • 26. Word Count val textFile = sc.textFile("README.md", 4) 
 val words = textFile.flatMap(line => line.split("[s]+")) val realWords = words.filter(_.nonEmpty) 
 val wordTuple = realWords.map(word => (word, 1)) 
 val groupBy = wordTuple.groupByKey(2) 
 val wordCount = groupBy.mapValues(value => value.reduce(_ + _)) 
 wordCount.collect().sortBy(-_._2)
  • 27. Word Count val textFile = sc.textFile("README.md", 4) textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21 
 val words = textFile.flatMap(line => line.split("[s]+")) words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:23 val realWords = words.filter(_.nonEmpty) realWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at filter at <console>:25 
 val wordTuple = realWords.map(word => (word, 1)) wordTuple: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:27 
 val groupBy = wordTuple.groupByKey(2) groupBy: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[5] at groupByKey at <console>:29 
 val wordCount = groupBy.mapValues(value => value.reduce(_ + _)) wordCount: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at mapValues at <console>:31 
 wordCount.collect().sortBy(-_._2) res0: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10), (##,8), (run,7), (can,6), (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ... wordCount.saveAsTextFile("wordcount.txt")
  • 28. • Main program are executed on the Spark Driver • Transformations are executed on the Spark Workers • Actions may transfer from the Workers to the Driver collect(), countByKey(), countByValue(), collectAsMap() ➔ • bounded output: count(), take(N) • unbounded output: saveAsTextFile() http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs executor(on worker) 동작 영역과 driver 동작 영역 구분 driver에서는 action과 accumulator 외에 executor의 데이터 받을 수 없음
  • 29. Data Lineage of RDD scala> wordCount.toDebugString res1: String = (2) MapPartitionsRDD[6] at mapValues at <console>:31 [] | ShuffledRDD[5] at groupByKey at <console>:29 [] +-(4) MapPartitionsRDD[4] at map at <console>:27 [] | MapPartitionsRDD[3] at filter at <console>:25 [] | MapPartitionsRDD[2] at flatMap at <console>:23 [] | MapPartitionsRDD[1] at textFile at <console>:21 [] | README.md HadoopRDD[0] at textFile at <console>:21 [] scala> wordCount.dependencies.head.rdd res2: org.apache.spark.rdd.RDD[_] = ShuffledRDD[5] at groupByKey at <console>:29 scala> textFile.dependencies.head.rdd res3: org.apache.spark.rdd.RDD[_] = README.md HadoopRDD[0] at textFile at <console>:21 scala> textFile.dependencies.head.rdd.dependencies res4: Seq[org.apache.spark.Dependency[_]] = List() 모든 RDD는 부모 RDD 추적 ➔ DAG Scheduling과 복구의 기본
  • 30. Data Lineage of RDD scala> wordCount.toDebugString res1: String = (2) MapPartitionsRDD[6] at mapValues at <console>:31 [] | ShuffledRDD[5] at groupByKey at <console>:29 [] +-(4) MapPartitionsRDD[4] at map at <console>:27 [] | MapPartitionsRDD[3] at filter at <console>:25 [] | MapPartitionsRDD[2] at flatMap at <console>:23 [] | MapPartitionsRDD[1] at textFile at <console>:21 [] | README.md HadoopRDD[0] at textFile at <console>:21 [] flatMap filter map groupByKey mapValues Step Stage textFile textFile Nil Stage 1 Stage 0 Parent shuffle boundary
  • 32. https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Wide Dependency 최소화 RDD Dependencies Local Reference Network Communication (Shuffle) Narrow Dependencies N steps ➔ 1 stage = 1 task cogroup() join() groupByKey() reduceByKey() combineByKey() …
  • 34. Schedule & Execute tasks $ bin/spark-shell --master local[3] ... val textFile = sc.textFile("README.md", 4) ... val groupBy = wordTuple.groupByKey(2)
  • 35. val groupBy = wordTuple.groupByKey(2) groupBy: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[5] at groupByKey at <console>:29 
 scala> groupBy.collect() res5: Array[(String, Iterable[Int])] = Array((package,CompactBuffer(1)), (this,CompactBuffer(1) ), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop- version),CompactBuffer(1)), (Because,CompactBuffer(1)), (Python,CompactBuffer(1, 1)), (cluster.,CompactBuffer(1)), (its,CompactBuffer(1)), ([run,CompactBuffer(1)), (general,CompactBuffer(1, 1)), (YARN,,CompactBuffer(1)), (have,CompactBuffer(1)), (pre- built,CompactBuffer(1)), (locally.,CompactBuffer(1)), (locally,CompactBuffer(1, 1)), (changed,CompactBuffer(1)), (sc.parallelize(1,CompactBuffer(1)), (only,CompactBuffer(1)), (several,CompactBuffer(1)), (learning,,CompactBuffer(1)), (basic,CompactBuffer(1)), (first,CompactBuffer(1)), (This,CompactBuffer(1, 1)), (documentation,CompactBuffer(1, 1, 1)), (Confi... • HashMap within each partition • no map-side aggregation (=combiner of MapReduce) • single key-value pair must fit in memory (Out Of Memory/Disk) groupByKey() rdd.groupByKey().mapValues(value => value.reduce(func)) = rdd.reduceByKey(func) 대체 가능하다면 groupByKey 대신 reducedByKey, aggregateByKey, foldByKey, combineByKey 사용 권장
  • 36. scala> val wordCountReduceByKey = wordTuple.reduceByKey(_ + _, 2) wordCountReduceByKey: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceByKey at <console>:29 scala> wordCountReduceByKey.toDebugString res6: String = (2) ShuffledRDD[7] at reduceByKey at <console>:29 [] +-(4) MapPartitionsRDD[4] at map at <console>:27 [] | MapPartitionsRDD[3] at filter at <console>:25 [] | MapPartitionsRDD[2] at flatMap at <console>:23 [] | MapPartitionsRDD[1] at textFile at <console>:21 [] | README.md HadoopRDD[0] at textFile at <console>:21 [] scala> wordCountReduceByKey.collect().sortBy(-_._2) res7: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10), (##,8), (run,7), (can,6) , (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ... reduceByKey() scala> sc.setLogLevel("INFO") scala> wordCountReduceByKey.collect().sortBy(-_._2) ... ... INFO DAGScheduler: Got job 3 (collect at <console>:32) with 2 output partitions ... INFO DAGScheduler: Final stage: ResultStage 7(collect at <console>:32) ... INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 6) ... INFO DAGScheduler: Missing parents: List() ... INFO DAGScheduler: Submitting ResultStage 7 (ShuffledRDD[7] at reduceByKey at <console>:29), which has no missing parents ... INFO MemoryStore: ensureFreeSpace(2328) called with curMem=109774, maxMem=555755765 ... INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 2.3 KB, free 529.9 MB) ... INFO MemoryStore: ensureFreeSpace(1378) called with curMem=112102, maxMem=555755765 ... INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1378.0 B, free 529.9 MB) ... INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:60479 (size: 1378.0 B, free: 530.0 MB) ... INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:861 ... INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 7 (ShuffledRDD[7] at reduceByKey at <console>:29) ... res9: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10), (##,8), (run,7), (can,6) , (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ...
  • 37. Skip
  • 38. scala> sc.setLogLevel("INFO") scala> val textFile = sc.textFile("README.md", 4) scala> val words = textFile.flatMap(line => line.split("[s]+")) scala> val realWords = words.filter(_.nonEmpty) scala> val wordTuple = realWords.map(word => (word, 1)) scala> wordTuple.cache() 
 scala> val groupBy = wordTuple.groupByKey(2) scala> val wordCount = groupBy.mapValues(value => value.reduce( _ + _)) scala> wordCount.toDebugString res2: String = (2) MapPartitionsRDD[6] at mapValues at <console>:31 [] | ShuffledRDD[5] at groupByKey at <console>:29 [] +-(4) MapPartitionsRDD[4] at map at <console>:27 [] | MapPartitionsRDD[3] at filter at <console>:25 [] | MapPartitionsRDD[2] at flatMap at <console>:23 [] | MapPartitionsRDD[1] at textFile at <console>:21 [] | README.md HadoopRDD[0] at textFile at <console>:21 [] scala> wordCount.collect().sortBy(-_._2) ... ... INFO BlockManagerInfo: Added rdd_4_0 in memory on localhost:60641 (size: 9.9 KB, free: 530.0 MB) ... INFO BlockManagerInfo: Added rdd_4_1 in memory on localhost:60641 (size: 9.5 KB, free: 530.0 MB) ... INFO BlockManagerInfo: Added rdd_4_2 in memory on localhost:60641 (size: 10.7 KB, free: 530.0 MB) ... INFO BlockManagerInfo: Added rdd_4_3 in memory on localhost:60641 (size: 8.0 KB, free: 530.0 MB) ... Cache
  • 39. scala> val wordCountReduceByKey = wordTuple.reduceByKey(_ + _, 2) scala> wordCountReduceByKey.toDebugString
 res4: String = (2) ShuffledRDD[7] at reduceByKey at <console>:29 [] +-(4) MapPartitionsRDD[4] at map at <console>:27 [] | CachedPartitions: 4; MemorySize: 38.1 KB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B | MapPartitionsRDD[3] at filter at <console>:25 [] | MapPartitionsRDD[2] at flatMap at <console>:23 [] | MapPartitionsRDD[1] at textFile at <console>:21 [] | README.md HadoopRDD[0] at textFile at <console>:21 [] scala> wordCountReduceByKey.collect().sortBy(-_._2) ... ... INFO BlockManager: Found block rdd_4_0 locally ... INFO BlockManager: Found block rdd_4_1 locally ... INFO BlockManager: Found block rdd_4_2 locally ... INFO BlockManager: Found block rdd_4_3 locally ... Cache • Spark은 반복 연산에 특화되어 빠르지만 cache 안하면 무의미 • cache 없으면 첫 RDD부터 모두 계산 • LRU(Least-Recently-Used) 정책 • 바로 재사용하지 않으면 불필요하게 메모리에 캐싱할 필요 없음 • 수동 cache 해제: RDD.unpersist() • 기본 StorageLevel(MEMORY_ONLY)은 deserialized to memory • 메모리 CPU
  • 40. Cache
  • 42. val textFile = sc.textFile("README.md", 4) val words = textFile.flatMap(line => line.split("[s]+"))
 val realWords = words.filter(_.nonEmpty)
 realWords.cache()
 val wordTuple = realWords.map(word => (word, 1))
 wordTuple.cache()
 
 val groupBy = wordTuple.groupByKey(2)
 val wordCount = groupBy.mapValues(value => value.reduce(_ + _))
 wordCount.collect().sortBy(-_._2)
 
 val wordCountReduceByKey = wordTuple.reduceByKey(_ + _, 2)
 wordCountReduceByKey.collect().sortBy(-_._2)
 
 realWords.countByValue().toArray.sortBy(-_._2) scala.collection.Map[String,Long] Word Count
  • 44. • RDD API의 단점 • 대부분의 데이터는 구조화되어 있음
 (JSON, CSV, Avro, Parquet, ORC, Hive ...) • 함수형 Transformation들은 직관적이지 않음 • ._1, ._2 방식으로 데이터 접근하기 불편하고 오류 가능성 높음 val data = sc.textFile(“people.tsv").map(line => line.split("t"))
 data.map(row => (row(0), (row(1).toLong, 1)))
 .reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
 .map { case (dept, values: (Long, Int)) => (dept, values._1 / values._2) }
 .collect()
 sqlContext.table("people")
 .groupBy('name)
 .avg('age)
 .collect() DataFrame
  • 45. DataFrame • Distributed collection of rows organized into named columns. • inspired by DataFrame in R and Pandas in Python • RDD with schema (org.apache.spark.sql.SchemaRDD before v1.3) • Python, Scala, Java, and R (via SparkR) • Making Spark accessible to everyone (RDB, R에 익숙한 사람까지) • data scientists, engineers, statisticians, ... http://www.slideshare.net/databricks/2015-0616-spark-summit
  • 46. DataFrame Operations scala> val df = sqlContext.read.json("examples/src/main/resources/people.json")
 df: org.apache.spark.sql.DataFrame = [age: bigint, name: string] scala> df.printSchema() root |-- age: long (nullable = true) |-- name: string (nullable = true) scala> df.show() +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+ scala> df.explain(true) == Parsed Logical Plan == Relation[age#3L,name#4] JSONRelation[file:/.../examples/src/main/resources/people.json] == Analyzed Logical Plan == age: bigint, name: string Relation[age#3L,name#4] JSONRelation[file:/.../examples/src/main/resources/people.json] == Optimized Logical Plan == Relation[age#3L,name#4] JSONRelation[file:/.../examples/src/main/resources/people.json] == Physical Plan == Scan JSONRelation[file:/.../examples/src/main/resources/people.json][age#3L,name#4] Code Generation: true
  • 47. DataFrame Operations scala> df.select(df("name"), df("age") + 1).show() +-------+---------+ | name|(age + 1)| +-------+---------+ |Michael| null| | Andy| 31| | Justin| 20| +-------+---------+ 
 scala> df.filter(df("age") > 21).show() +---+----+ |age|name| +---+----+ | 30|Andy| +---+----+
 scala> df.groupBy("age").count().show() +----+-----+ | age|count| +----+-----+ |null| 1| | 19| 1| | 30| 1| +----+-----+ df.select('name, 'age + 1).show() df.filter('age > 21).show() df.groupBy('age).count().show()
  • 48. DataFrame API • API 사용법 • http://spark.apache.org/docs/latest/sql-programming-guide.html • http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science • https://github.com/yu-iskw/spark-dataframe-introduction/blob/master/doc/dataframe-introduction.md • 책도 부족, 최신 내용은 Source Code 참고 • Example Code • /spark/examples/src/main/scala/org/apache/spark/examples/sql/RDDRelation.scala • TestSuite • /spark/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
  • 50. Same performance for all languages http://www.slideshare.net/databricks/spark-dataframes-simple-and-fast-analytics-on-structured-data-at-spark-summit-2015
  • 51. • Simple tasks easy with DataFrame API • Complex tasks possible with RDD API scala> import sqlContext.implicits._
 scala> case class Person(name: String, age: Int) 
 scala> val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF() people: org.apache.spark.sql.DataFrame = [name: string, age: int] 
 scala> people.registerTempTable("people") teenagers: org.apache.spark.sql.DataFrame = [name: string, age: int] 
 scala> val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
 scala> teenagers.show() +------+---+ | name|age| +------+---+ |Justin| 19| +------+---+ 
 scala> val teenagersRdd = teenagers.rdd
 teenagersRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[8] at rdd at <console>:24 scala> teenagersRdd.toDebugString res2: String = (2) MapPartitionsRDD[8] at rdd at <console>:24 [] | MapPartitionsRDD[7] at rdd at <console>:24 [] | MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:26 [] | MapPartitionsRDD[3] at map at <console>:26 [] | MapPartitionsRDD[2] at map at <console>:26 [] | MapPartitionsRDD[1] at textFile at <console>:26 [] | examples/src/main/resources/people.txt HadoopRDD[0] at textFile at <console>:26 [] 
 scala> teenagersRdd.collect() res3: Array[org.apache.spark.sql.Row] = Array([Justin,19])
  • 54. • Apache Spark User List • http://apache-spark-user-list.1001560.n3.nabble.com/ • Devops Advanced Class • http://training.databricks.com/devops.pdf • Intro to Apache Spark • http://training.databricks.com/workshop/sparkcamp.pdf • Apache Spark Tutorial • http://cdn.liber118.com/workshop/fcss_spark.pdf • Anatomy of RDD : Deep dive into Spark RDD abstraction • http://www.slideshare.net/datamantra/anatomy-of-rdd • A Deeper Understanding of Spark’s Internals • https://spark-summit.org/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf • Scala and the JVM for Big Data: Lessons from Spark • https://deanwampler.github.io/polyglotprogramming/papers/ScalaJVMBigData-SparkLessons.pdf • Lightning Fast Big Data Analytics with Apache Spark • http://www.virdata.com/wp-content/uploads/Spark-Devoxx2014.pdf References
  • 58. • Partition 수 • 너무 큰 파일 읽을 때는 coalesce(N) 사용해서 executor 수 줄임 • repartition 없이 하나의 executor가 여러 partition 처리 • 오래걸리는 CPU 연산은 repartition(N) 으로 executor 수 늘림 • Executor 수 • Job 내 최대 partition 수의 2배 이상 • Executor Memory • 장비 메모리의 최대 75% 사용 권장 • 최소 Heap 크기는 8GB • 최대 Heap 크기는 40GB 넘지 않도록 (GC 확인 필요) • 메모리 사용량은 StorageLevel과 Serialization 형식에 영향 받음 • G1GC 설정 • https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html • 각종 Tuning • http://spark.apache.org/docs/latest/tuning.html Tips
  • 59. • min, max, sum, mean 여러번 따로 구하지 말고 stats() 한 번에 구하기 • count, sum, min, max, mean, stdev, variance, sampleStdev, smapleVariance • Shuffle 문제 확인 방법 • Web UI에서 오래 걸리거나 큰 input/output이 있는 stage/partition 확인 • KryoSerializer • /conf/spark-defaults.conf • "spark.serializer", “org.apache.spark.serializer.KryoSerializer" • StorageLevel.MEMORY_ONLY_SER 과 함께 사용 권장 • Task not serializable: java.io.NotSerializableException • executor 내에서 수행되는 함수 내에서 객체 생성 및 사용 • https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html Tips