Spark is an open-source cluster computing framework. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark supports batch, streaming, and interactive computations in a unified framework. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across a cluster for parallel processing. RDDs support transformations like map and filter that return new RDDs and actions that return values to the driver program.
3. Spark’s Goal
Support batch, streaming, and interactive computations
in a unified framework
http://strataconf.com/stratany2013/public/schedule/detail/30959
5. To validate our hypothesis
that specialized frameworks provide value over general ones,
we have also built a new framework
on top of Mesos called Spark,
optimized for iterative jobs
where a dataset is reused in many parallel operations,
and shown that Spark can outperform Hadoop by 10x
in iterative machine learning workloads.
6. http://spark.apache.org/docs/latest/cluster-overview.html
• Cluster Manager
• external service for acquiring resources on the cluster
• Standalone, Mesos, YARN
• Worker node
• Any node that can run application code in the cluster
• Application
• driver program + executors
• SparkContext
• application session
• connection to a cluster
7. http://spark.apache.org/docs/latest/cluster-overview.html
• Driver Program
• Process running the main()
• create SparkContext
• Executor
• process launched on a worker node
• Each application has its own executors
• Long-running and runs many small tasks
• keeps data in memory/disk storage
• Task
• Unit of work that will be sent to one executor
12. Shell in Local Mode
REPL(Read-Eval-Print Loop) = Interactive Shell
import org.apache.spark.{SparkContext, SparkConf}
val sc = new SparkContext("local[*]", "Spark shell", new SparkConf())
scala>
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@36c783ca
자동으로 다음과 같이 SparkContext(sc) 생성한 후 shell prompt
• 장점
• 분산처리에서 어려운 local test/debugging, unit test
• lazy 처리: 데이터 처리 없이 syntax 먼저 검증 가능
• 단점
• 메모리 한계
• jar loading 차이(YARN/Mesos Shell/Submit Mode 차이)
13. $ cd ../
$ tar zxvf spark/spark-1.5.1-bin-spark-hadoop2.4.tgz
$ cd spark-1.5.1-bin-spark-hadoop2.4/
$ bin/spark-shell --master local[*]
Run Spark Shell
14. $ cat init.script
import java.lang.Runtime
println(s"cores = ${Runtime.getRuntime.availableProcessors}")
$ SPARK_PRINT_LAUNCH_COMMAND=1 bin/spark-shell --master local[*] -i init.script
Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/
bin/java -cp /Users/taewook/spark-1.5.1-bin-spark-hadoop2.4/conf/:/Users/taewook/
spark-1.5.1-bin-spark-hadoop2.4/lib/spark-assembly-1.5.1-hadoop2.4.0.jar:/Users/
taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/Users/
taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/Users/
taewook/spark-1.5.1-bin-spark-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar -
Dscala.usejavacp=true -Xms1g -Xmx1g org.apache.spark.deploy.SparkSubmit --master
local[*] --class org.apache.spark.repl.Main --name Spark shell spark-shell -i
init.script
========================================
...
Type :help for more information.
...
Spark context available as sc.
...
Loading init.script...
import java.lang.Runtime
cores = 4
scala>
scala> :load init.script
Loading init.script...
import java.lang.Runtime
cores = 4
Run Spark Shell
:paste enter paste mode: all input up to
ctrl-D compiled together
:cp <path> add a jar or directory to the classpath
:history [num] show the history
(optional num is commands to show)
~/.spark_history
java 실행 명령으로 classpath, JVM 옵션 등 확인
함수/클래스 정의 등 미리 실행하면 편리한 초기화 명령 수행
16. An RDD is a read-only collection of objects
partitioned across a set of machines
that can be rebuilt if a partitionis lost.
17. • Read-only = Immutable
• Parallelism ➔ 분산 처리
• 오랫동안 Caching 가능 ➔ 성능
• Transformation for change
• 데이터 복사 반복 ➔ 성능➡, 공간 낭비
• Laziness로 극복
Core abstraction in the core of Spark
An RDD is a read-only collection of objects
partitioned across a set of machines
that can be rebuilt if a partitionis lost.
Resilient Distributed Dataset
18. • Partitioned = Distributed
• more partiton = more parallelism
Resilient Distributed Dataset
Core abstraction in the core of Spark
An RDD is a read-only collection of objects
partitioned across a set of machines
that can be rebuilt if a partitionis lost.
19. Resilient Distributed Dataset
• can be rebuilt = Resilient
• recover from lost data partitions
• by data lineage
• can be cached
• lineage 짧게 줄여서 더 빠르게 복구
Core abstraction in the core of Spark
An RDD is a read-only collection of objects
partitioned across a set of machines
that can be rebuilt if a partitionis lost.
20. • Scala Collection API와 비슷 + 분산 데이터 연산
• map(), filter(), reduce(), count(), foreach(), …
• http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
• Transformations
• return a new RDD
• deterministic - 실패해서 재실행해도 결과 항상 같음
• lazy evaluation
• Actions
• return a final value(some other data type)
• 첫 RDD부터 실제 실행(Caching 하면 cache된 RDD부터 실행)
RDD Operations
http://training.databricks.com/workshop/sparkcamp.pdf
22. • Laziness = Lazily constructed
• 지연 계산 - 필요할 때까지 연산 미루기
• evaluation과 execution의 분리
• 실행 전에 최소한의 오류 검사
• 중간 RDD 결과값 저장 불필요
• Intermediate RDDs not materialized
• Immutability & Laziness
• Immutability ➔ Laziness 가능
• side-effect 없어 transformation들 combine 가능
• combine node steps into “stages” (최적화)
• ➔ 성능 , 분산 처리 가능
Resilient Distributed Dataset
23. Creating RDDs
• parallelizing a collection
• 한 대의 driver 장비의 메모리에 모두 올림
• for only prototyping, testing
• loading an external data set
• 외부 소스로부터 읽기
• sc.textFile(): file://, hdfs://, s3n://
• sc.hadoopFile(), sc.newAPIHadoopFile()
• sqlContext.sql(), JdbcRDD(), …
scala> val numbers = sc.parallelize(1 to 10)
numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0]
at parallelize at <console>:21
scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2]
at textFile at <console>:21
24. scala> val numbers = sc.parallelize(1 to 10)
numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0]
at parallelize at <console>:21
scala> numbers.partitions.length
res0: Int = 4
scala> numbers.glom().collect()
res1: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5),
Array(6, 7), Array(8, 9, 10))
scala> val numbersWith2Partitions = sc.parallelize(1 to 10, 2)
numbersWith2Partitions: org.apache.spark.rdd.RDD[Int] =
ParallelCollectionRDD[2] at parallelize at <console>:21
scala> numbersWith2Partitions.partitions.length
res2: Int = 2
scala> numbersWith2Partitions.glom().collect()
res3: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5), Array(6, 7,
8, 9, 10))
Partitions
numbers.mapPartitionsWithIndex()
26. Word Count
val textFile = sc.textFile("README.md", 4)
val words = textFile.flatMap(line => line.split("[s]+"))
val realWords = words.filter(_.nonEmpty)
val wordTuple = realWords.map(word => (word, 1))
val groupBy = wordTuple.groupByKey(2)
val wordCount = groupBy.mapValues(value => value.reduce(_ + _))
wordCount.collect().sortBy(-_._2)
27. Word Count
val textFile = sc.textFile("README.md", 4)
textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
val words = textFile.flatMap(line => line.split("[s]+"))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:23
val realWords = words.filter(_.nonEmpty)
realWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at filter at <console>:25
val wordTuple = realWords.map(word => (word, 1))
wordTuple: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:27
val groupBy = wordTuple.groupByKey(2)
groupBy: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[5] at groupByKey at
<console>:29
val wordCount = groupBy.mapValues(value => value.reduce(_ + _))
wordCount: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at mapValues at
<console>:31
wordCount.collect().sortBy(-_._2)
res0: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10),
(##,8), (run,7), (can,6), (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ...
wordCount.saveAsTextFile("wordcount.txt")
28. • Main program are executed on the Spark Driver
• Transformations are executed on the Spark Workers
• Actions may transfer from the Workers to the Driver
collect(), countByKey(), countByValue(), collectAsMap()
➔ • bounded output: count(), take(N)
• unbounded output: saveAsTextFile()
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
executor(on worker) 동작 영역과 driver 동작 영역 구분
driver에서는 action과 accumulator 외에 executor의 데이터 받을 수 없음
29. Data Lineage of RDD
scala> wordCount.toDebugString
res1: String =
(2) MapPartitionsRDD[6] at mapValues at <console>:31 []
| ShuffledRDD[5] at groupByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
scala> wordCount.dependencies.head.rdd
res2: org.apache.spark.rdd.RDD[_] = ShuffledRDD[5] at groupByKey
at <console>:29
scala> textFile.dependencies.head.rdd
res3: org.apache.spark.rdd.RDD[_] = README.md HadoopRDD[0] at
textFile at <console>:21
scala> textFile.dependencies.head.rdd.dependencies
res4: Seq[org.apache.spark.Dependency[_]] = List()
모든 RDD는 부모 RDD 추적 ➔ DAG Scheduling과 복구의 기본
30. Data Lineage of RDD
scala> wordCount.toDebugString
res1: String =
(2) MapPartitionsRDD[6] at mapValues at <console>:31 []
| ShuffledRDD[5] at groupByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
flatMap
filter
map
groupByKey
mapValues
Step
Stage
textFile
textFile
Nil
Stage 1
Stage 0
Parent
shuffle
boundary
34. Schedule & Execute tasks
$ bin/spark-shell --master local[3]
...
val textFile = sc.textFile("README.md", 4)
...
val groupBy = wordTuple.groupByKey(2)
35. val groupBy = wordTuple.groupByKey(2)
groupBy: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[5] at groupByKey at
<console>:29
scala> groupBy.collect()
res5: Array[(String, Iterable[Int])] = Array((package,CompactBuffer(1)), (this,CompactBuffer(1)
), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-
version),CompactBuffer(1)), (Because,CompactBuffer(1)), (Python,CompactBuffer(1, 1)),
(cluster.,CompactBuffer(1)), (its,CompactBuffer(1)), ([run,CompactBuffer(1)),
(general,CompactBuffer(1, 1)), (YARN,,CompactBuffer(1)), (have,CompactBuffer(1)), (pre-
built,CompactBuffer(1)), (locally.,CompactBuffer(1)), (locally,CompactBuffer(1, 1)),
(changed,CompactBuffer(1)), (sc.parallelize(1,CompactBuffer(1)), (only,CompactBuffer(1)),
(several,CompactBuffer(1)), (learning,,CompactBuffer(1)), (basic,CompactBuffer(1)),
(first,CompactBuffer(1)), (This,CompactBuffer(1, 1)), (documentation,CompactBuffer(1, 1, 1)),
(Confi...
• HashMap within each partition
• no map-side aggregation (=combiner of MapReduce)
• single key-value pair must fit in memory (Out Of Memory/Disk)
groupByKey()
rdd.groupByKey().mapValues(value => value.reduce(func))
= rdd.reduceByKey(func)
대체 가능하다면 groupByKey 대신
reducedByKey, aggregateByKey, foldByKey, combineByKey 사용 권장
36. scala> val wordCountReduceByKey = wordTuple.reduceByKey(_ + _, 2)
wordCountReduceByKey: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceByKey at <console>:29
scala> wordCountReduceByKey.toDebugString
res6: String =
(2) ShuffledRDD[7] at reduceByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
scala> wordCountReduceByKey.collect().sortBy(-_._2)
res7: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10), (##,8), (run,7), (can,6)
, (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ...
reduceByKey()
scala> sc.setLogLevel("INFO")
scala> wordCountReduceByKey.collect().sortBy(-_._2)
...
... INFO DAGScheduler: Got job 3 (collect at <console>:32) with 2 output partitions
... INFO DAGScheduler: Final stage: ResultStage 7(collect at <console>:32)
... INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 6)
... INFO DAGScheduler: Missing parents: List()
... INFO DAGScheduler: Submitting ResultStage 7 (ShuffledRDD[7] at reduceByKey at <console>:29), which has no missing
parents
... INFO MemoryStore: ensureFreeSpace(2328) called with curMem=109774, maxMem=555755765
... INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 2.3 KB, free 529.9 MB)
... INFO MemoryStore: ensureFreeSpace(1378) called with curMem=112102, maxMem=555755765
... INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1378.0 B, free 529.9 MB)
... INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:60479 (size: 1378.0 B, free: 530.0 MB)
... INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:861
... INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 7 (ShuffledRDD[7] at reduceByKey at <console>:29)
...
res9: Array[(String, Int)] = Array((the,21), (Spark,14), (to,14), (for,12), (a,10), (and,10), (##,8), (run,7), (can,6)
, (is,6), (on,6), (also,5), (in,5), (of,5), (with,4), (if,4), ...
38. scala> sc.setLogLevel("INFO")
scala> val textFile = sc.textFile("README.md", 4)
scala> val words = textFile.flatMap(line => line.split("[s]+"))
scala> val realWords = words.filter(_.nonEmpty)
scala> val wordTuple = realWords.map(word => (word, 1))
scala> wordTuple.cache()
scala> val groupBy = wordTuple.groupByKey(2)
scala> val wordCount = groupBy.mapValues(value => value.reduce(
_ + _))
scala> wordCount.toDebugString
res2: String =
(2) MapPartitionsRDD[6] at mapValues at <console>:31 []
| ShuffledRDD[5] at groupByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
scala> wordCount.collect().sortBy(-_._2)
...
... INFO BlockManagerInfo: Added rdd_4_0 in memory on localhost:60641 (size: 9.9 KB, free: 530.0 MB)
... INFO BlockManagerInfo: Added rdd_4_1 in memory on localhost:60641 (size: 9.5 KB, free: 530.0 MB)
... INFO BlockManagerInfo: Added rdd_4_2 in memory on localhost:60641 (size: 10.7 KB, free: 530.0 MB)
... INFO BlockManagerInfo: Added rdd_4_3 in memory on localhost:60641 (size: 8.0 KB, free: 530.0 MB)
...
Cache
39. scala> val wordCountReduceByKey = wordTuple.reduceByKey(_ + _, 2)
scala> wordCountReduceByKey.toDebugString
res4: String =
(2) ShuffledRDD[7] at reduceByKey at <console>:29 []
+-(4) MapPartitionsRDD[4] at map at <console>:27 []
| CachedPartitions: 4; MemorySize: 38.1 KB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
| MapPartitionsRDD[3] at filter at <console>:25 []
| MapPartitionsRDD[2] at flatMap at <console>:23 []
| MapPartitionsRDD[1] at textFile at <console>:21 []
| README.md HadoopRDD[0] at textFile at <console>:21 []
scala> wordCountReduceByKey.collect().sortBy(-_._2)
...
... INFO BlockManager: Found block rdd_4_0 locally
... INFO BlockManager: Found block rdd_4_1 locally
... INFO BlockManager: Found block rdd_4_2 locally
... INFO BlockManager: Found block rdd_4_3 locally
...
Cache
• Spark은 반복 연산에 특화되어 빠르지만 cache 안하면 무의미
• cache 없으면 첫 RDD부터 모두 계산
• LRU(Least-Recently-Used) 정책
• 바로 재사용하지 않으면 불필요하게 메모리에 캐싱할 필요 없음
• 수동 cache 해제: RDD.unpersist()
• 기본 StorageLevel(MEMORY_ONLY)은 deserialized to memory
• 메모리 CPU
45. DataFrame
• Distributed collection of rows organized into named columns.
• inspired by DataFrame in R and Pandas in Python
• RDD with schema (org.apache.spark.sql.SchemaRDD before v1.3)
• Python, Scala, Java, and R (via SparkR)
• Making Spark accessible to everyone (RDB, R에 익숙한 사람까지)
• data scientists, engineers, statisticians, ...
http://www.slideshare.net/databricks/2015-0616-spark-summit
48. DataFrame API
• API 사용법
• http://spark.apache.org/docs/latest/sql-programming-guide.html
• http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science
• https://github.com/yu-iskw/spark-dataframe-introduction/blob/master/doc/dataframe-introduction.md
• 책도 부족, 최신 내용은 Source Code 참고
• Example Code
• /spark/examples/src/main/scala/org/apache/spark/examples/sql/RDDRelation.scala
• TestSuite
• /spark/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
50. Same performance for all languages
http://www.slideshare.net/databricks/spark-dataframes-simple-and-fast-analytics-on-structured-data-at-spark-summit-2015
51. • Simple tasks easy with DataFrame API
• Complex tasks possible with RDD API
scala> import sqlContext.implicits._
scala> case class Person(name: String, age: Int)
scala> val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p =>
Person(p(0), p(1).trim.toInt)).toDF()
people: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> people.registerTempTable("people")
teenagers: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
scala> teenagers.show()
+------+---+
| name|age|
+------+---+
|Justin| 19|
+------+---+
scala> val teenagersRdd = teenagers.rdd
teenagersRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[8] at rdd at <console>:24
scala> teenagersRdd.toDebugString
res2: String =
(2) MapPartitionsRDD[8] at rdd at <console>:24 []
| MapPartitionsRDD[7] at rdd at <console>:24 []
| MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:26 []
| MapPartitionsRDD[3] at map at <console>:26 []
| MapPartitionsRDD[2] at map at <console>:26 []
| MapPartitionsRDD[1] at textFile at <console>:26 []
| examples/src/main/resources/people.txt HadoopRDD[0] at textFile at <console>:26 []
scala> teenagersRdd.collect()
res3: Array[org.apache.spark.sql.Row] = Array([Justin,19])
54. • Apache Spark User List
• http://apache-spark-user-list.1001560.n3.nabble.com/
• Devops Advanced Class
• http://training.databricks.com/devops.pdf
• Intro to Apache Spark
• http://training.databricks.com/workshop/sparkcamp.pdf
• Apache Spark Tutorial
• http://cdn.liber118.com/workshop/fcss_spark.pdf
• Anatomy of RDD : Deep dive into Spark RDD abstraction
• http://www.slideshare.net/datamantra/anatomy-of-rdd
• A Deeper Understanding of Spark’s Internals
• https://spark-summit.org/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf
• Scala and the JVM for Big Data: Lessons from Spark
• https://deanwampler.github.io/polyglotprogramming/papers/ScalaJVMBigData-SparkLessons.pdf
• Lightning Fast Big Data Analytics with Apache Spark
• http://www.virdata.com/wp-content/uploads/Spark-Devoxx2014.pdf
References
58. • Partition 수
• 너무 큰 파일 읽을 때는 coalesce(N) 사용해서 executor 수 줄임
• repartition 없이 하나의 executor가 여러 partition 처리
• 오래걸리는 CPU 연산은 repartition(N) 으로 executor 수 늘림
• Executor 수
• Job 내 최대 partition 수의 2배 이상
• Executor Memory
• 장비 메모리의 최대 75% 사용 권장
• 최소 Heap 크기는 8GB
• 최대 Heap 크기는 40GB 넘지 않도록 (GC 확인 필요)
• 메모리 사용량은 StorageLevel과 Serialization 형식에 영향 받음
• G1GC 설정
• https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
• 각종 Tuning
• http://spark.apache.org/docs/latest/tuning.html
Tips
59. • min, max, sum, mean 여러번 따로 구하지 말고 stats() 한 번에 구하기
• count, sum, min, max, mean, stdev, variance, sampleStdev, smapleVariance
• Shuffle 문제 확인 방법
• Web UI에서 오래 걸리거나 큰 input/output이 있는 stage/partition 확인
• KryoSerializer
• /conf/spark-defaults.conf
• "spark.serializer", “org.apache.spark.serializer.KryoSerializer"
• StorageLevel.MEMORY_ONLY_SER 과 함께 사용 권장
• Task not serializable: java.io.NotSerializableException
• executor 내에서 수행되는 함수 내에서 객체 생성 및 사용
• https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
Tips