Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ML Pipelines with Apache
Spark & a little Apache Beam
Ottawa Reactive Meetup
2018
Hella-Legit
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google focused on OSS Big Data...
Who do I think you all are?
● Nice people*
● Mostly engineers
● Familiar with one of Java, Scala, or Python
● May or may n...
What is in store for our adventure?
● Why train models on distributed systems
○ Have big data? It’s better than down sampl...
What is out of scope for today:
● The details backing of these algorithms
○ If you ask I’ll just say “gradient descent” an...
So why you might be doing this?
● Maybe you’ve built a system with “hand tuned” weights
● Your static list of [X, Y, Z] no...
Why did I get into this?
● I built a few search systems
● We spent a lot of time… guessing… I mean tuning
● We hired some ...
What tools are we going to use today?
● Apache Spark - Model Training (plus fits into your ETL)
● emacs/vim - looking at r...
What is Spark?
● General purpose distributed system
○ Built in Scala with an FP inspired API
● Apache project (one of the ...
When we say distributed we mean...
Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougw...
Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this S...
Plus a little magic :)
Steven Saus
The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ...
Required: Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(word...
Companion notebook funtimes:
● Small companion IJupyter notebook to explore with:
○ Python: http://bit.ly/hkMLExample
○ Sc...
Transformers, Estimators and Pipelines
● Transformers transform a DataFrame into another
● Estimators can be trained on a ...
Let’s start with loading some data
● Genuine big data, doesn’t fit on a floppy disk
○ It’s ok if your inputs do fit on a f...
Loading with sparkSQL & spark-csv
sqlContext.read returns a DataFrameReader
We can specify general properties & data speci...
Loading with sparkSQL & spark-csv
val df = sqlContext.read
.format("csv")
.option("header", "true")
.option("inferSchema",...
Lets explore training a Decision Tree
● Step 1: Data loading (done!)
● Step 2a: Data prep (produce features, remove comple...
Data prep / cleaning
● We need to predict a double (can be 0.0, 1.0, but type
must be double)
● We need to train with a ve...
Data prep / cleaning continued
// Combines a list of double input features into a vector
val assembler = new VectorAssembl...
So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
stage
● Constructing the pipel...
What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vect...
Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
feature...
And predict the results on the same data:
pipeline_model.transform(df).select("prediction",
"category-index").take(20)
What does our tree look like?
val tree =
pipeline_model.stages(2).asInstanceOf[DecisionTreeClassificatio
nModel]
println(t...
What does our tree look like?
ooooh
[info] If (feature 1 <= 12.5)
[info] If (feature 0 <= 33.5)
[info] If (feature 0 <= 26...
And predict the results on new data
// Option 1: Add empty/place-holder label data
pipeline_model.transform(df.withColumn(...
I guess that looks ok? Lets serve it!
● Waaaaaait - why is evaluate only on a dataframe?
● Ewwww - embeding Spark local mo...
But wait Spark has PMML support right?
● Spark 2.4 timeframe for pipelines -- I’m sorry (but the
codes in master)
● Limite...
The state of serving is generally a mess
● One project which aims to improve this is KubeFlow
○ Goal is unifying training ...
Pipeline API has many models:
● org.apache.spark.ml.classification
○ BinaryLogisticRegressionClassification, DecissionTree...
It’s not always a standalone microservice:
● Linear regression is awesome because I can “serve”* it
inside as an embedding...
Cross-validation
because saving a test set is effort
● Automagically* fit your model params
○ Things like max tree depth, ...
The state of serving is generally a mess
● One project which aims to improve this is KubeFlow
○ Goal is unifying training ...
It’s not always a standalone microservice:
● Linear regression is awesome because I can “serve”* it
inside as an embedding...
Cross-validation
because saving a test set is effort
● Automagically* fit your model params
○ Things like max tree depth, ...
Cross-validation
because saving a test set is effort
// ParamGridBuilder constructs an Array of parameter
combinations.
va...
False sense of security:
● A/B test please even if CV says many many $s
● Rank based things can have training bias with pr...
TensorFlowOnSpark, everyone loves mnist!
cluster = TFCluster.run(sc, mnist_dist_dataset.map_fun, args,
args.cluster_size, ...
Enter: TF.Transform
● For pre-processing of your data
○ e.g. where you spend 90% of your dev time anyways
● Integrates int...
DO NOT USE THIS IN PRODUCTION TODAY
● I’m serious, I don’t want to die or cause the next
financial meltdown with software ...
Ooor from the chicago taxi data...
for key in taxi.DENSE_FLOAT_FEATURE_KEYS:
# Preserve this feature as a dense float, set...
Defining a Transform processing function
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_c...
mean stddev
normalize
multiply
quantiles
bucketize
Analyzers
Reduce (full pass)
Implemented as a distributed
data pipeline...
Analyze
normalize
multiply
bucketize
constant
tensors
data
mean stddev
normalize
multiply
quantiles
bucketize
Scale to ... Bag of Words / N-Grams
Bucketization Feature Crosses
tft.ngrams
tft.string_to_int
tf.string_split
tft.scale_t...
BEAM Beyond the JVM: Current release
● Non JVM BEAM doesn’t work outside of Google’s environment yet
● tl;dr : uses grpc /...
BEAM Beyond the JVM: Master w/ experiments
*ish
*ish
*ish
Nick
portability
*ish
So what does that look like?
Driver
Worker 1
Docker
grpc
Worker K
Docker
grpc
Updating your model
● The real world changes
● Online learning (streaming) is super cool, but hard to
version
● Iterative ...
So why should you test & validate
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
Validation
● For now checking file sizes & execution time seem like the most common best
practice (from survey)
● spark-va...
Using a Spark accumulator for validation:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records = input.map{ ...
Validating records read matches our expectations:
val vc = new ValidationConf(tempPath, "1", true,
List[ValidationRule](
n...
Common ML Specific Validation
● Number of iterations
○ did I coverage super quickly or slowly compared to last time? Could...
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analyt...
High Performance Spark!
You can buy it today on Amazon.ca.
Not a lot of ML focus but some!
Cats love it*
*Or at least the ...
What about the code lab?
● https://github.com/holdenk/spark-intro-ml-pipeline-workshop Chocodyno
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
I need to give a te...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup August 16 2018
Upcoming SlideShare
Loading in …5
×

1

Share

Download to read offline

Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup August 16 2018

Download to read offline

An introduction to Ml pipelines with Apache spark and Apache beam presented at the Ottawa Reactive meetup August 16 in 2018.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup August 16 2018

  1. 1. ML Pipelines with Apache Spark & a little Apache Beam Ottawa Reactive Meetup 2018 Hella-Legit
  2. 2. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google focused on OSS Big Data ● Apache Spark PMC (think committer with tenure) ● Contributor to a lot of other projects (including BEAM) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of High Performance Spark & Learning Spark (+ more) ● Twitter: @holdenkarau ● Slideshare http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Related Spark Videos http://bit.ly/holdenSparkVideos
  3. 3. Who do I think you all are? ● Nice people* ● Mostly engineers ● Familiar with one of Java, Scala, or Python ● May or may not know Apache Spark Amanda
  4. 4. What is in store for our adventure? ● Why train models on distributed systems ○ Have big data? It’s better than down sampling (more data often wins over better algorithms) ○ Have small data? Enjoy fast hyper parameter tuning and taking an early coffee break (but not too long) ● Why Apache Spark and Apache Beam in general ○ Answer questions in minutes instead of days* (*some of the time) ● “Classic” ML pipelines & then some deep learning Ada Doglace
  5. 5. What is out of scope for today: ● The details backing of these algorithms ○ If you ask I’ll just say “gradient descent” and run away after throwing a smoke bomb on the floor. ● Questions about stack traces (j/k) Ada Doglace
  6. 6. So why you might be doing this? ● Maybe you’ve built a system with “hand tuned” weights ● Your static list of [X, Y, Z] no longer cuts it ● Your system is overwhelmed with abuse & your budget for handling it is less than an intern. ● You want a new job and ML sounds nicer than Perl on the resume now days Amanda
  7. 7. Why did I get into this? ● I built a few search systems ● We spent a lot of time… guessing… I mean tuning ● We hired some smart people from Google ● Added ML magic ● Things went downhill from there (and then uphill at another local maximum later)
  8. 8. What tools are we going to use today? ● Apache Spark - Model Training (plus fits into your ETL) ● emacs/vim - looking at random output ● spark-testing-base - You still need unit tests ● (sort of) spark-validation - Validating your jobs ● csv files - hey at least its not XML ● XML - ahhh crap Demos will be in Scala (but I’ll try and avoid the odd things like _s) & you can stop me if its confusing. Mohammed Mustafa
  9. 9. What is Spark? ● General purpose distributed system ○ Built in Scala with an FP inspired API ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  10. 10. When we say distributed we mean...
  11. 11. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  12. 12. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  13. 13. Plus a little magic :) Steven Saus
  14. 14. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  15. 15. Required: Word count (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(“output”) Photo By: Will Keightley
  16. 16. Companion notebook funtimes: ● Small companion IJupyter notebook to explore with: ○ Python: http://bit.ly/hkMLExample ○ Scala: http://bit.ly/sparkMLScalaNB ● If you want to use it you will access to Apache Spark ○ Install from http://spark.apache.org ○ Or get access to one of the online notebook environments (Google Dataproc, DataBricks Cloud, Microsoft Spark HDInsights Cluster Notebook, etc.) David DeHetre
  17. 17. Transformers, Estimators and Pipelines ● Transformers transform a DataFrame into another ● Estimators can be trained on a DataFrame to produce a transformer ● Pipelines chain together multiple transformers and estimators A.Davey
  18. 18. Let’s start with loading some data ● Genuine big data, doesn’t fit on a floppy disk ○ It’s ok if your inputs do fit on a floppy disk, buuuuut more data generally works better ● In all seriousness, not a bad practice to down-sample first while your building your pipeline so you can find your errors fast (Spark pipelines discarded some type information -- sorry!) Jess Johnson
  19. 19. Loading with sparkSQL & spark-csv sqlContext.read returns a DataFrameReader We can specify general properties & data specific options ● option(“key”, “value”) ○ spark-csv ones we will use are header & inferSchema ● format(“formatName”) ○ built in formats include parquet, jdbc, etc. today we will use com.databricks.spark.csv ● load(“path”) Jess Johnson
  20. 20. Loading with sparkSQL & spark-csv val df = sqlContext.read .format("csv") .option("header", "true") .option("inferSchema", "true") .load("resources/adult.data") Jess Johnson
  21. 21. Lets explore training a Decision Tree ● Step 1: Data loading (done!) ● Step 2a: Data prep (produce features, remove complete garbage, etc.) - ● Step 2b: Data prep (select features, etc.) ● Step 3: Train ● Step 4: Predict
  22. 22. Data prep / cleaning ● We need to predict a double (can be 0.0, 1.0, but type must be double) ● We need to train with a vector of features** Huang Yun Chung ** There is work to allow images and other things too.
  23. 23. Data prep / cleaning continued // Combines a list of double input features into a vector val assembler = new VectorAssembler().setInputCols(Array("age", "education-num")).setOutputCol("features") // String indexer converts a set of strings into doubles val indexer = StringIndexer().setInputCol("category") .setOutputCol("category-index") // Can be used to combine pipeline components together val pipeline = Pipeline().setStages(Array(assembler, indexer)) Huang Yun Chung
  24. 24. So a bit more about that pipeline ● Each of our previous components has “fit” & “transform” stage ● Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform) ● Can re-use the fitted model on future data model=pipeline.fit(df) prepared = model.transform(df) Andrey
  25. 25. What does our pipeline look like so far? Input Data Assembler Input Data + Vectors StringIndexer Input Data +Cat ID + Vectors While not an ML learning algorithm this still needs to be fit This is a regular transformer - no fitting required.
  26. 26. Let's train a model on our prepared data: # Specify model dt = DecisionTreeClassifier(labelCol = "category-index", featuresCol="features") # Fit it dt_model = dt.fit(prepared) # Or as part of the pipeline pipeline_and_model = Pipeline().setStages([assembler, indexer, dt]) pipeline_model = pipeline_and_model.fit(df)
  27. 27. And predict the results on the same data: pipeline_model.transform(df).select("prediction", "category-index").take(20)
  28. 28. What does our tree look like? val tree = pipeline_model.stages(2).asInstanceOf[DecisionTreeClassificatio nModel] println(tree.toDebugString)
  29. 29. What does our tree look like? ooooh [info] If (feature 1 <= 12.5) [info] If (feature 0 <= 33.5) [info] If (feature 0 <= 26.5) [info] If (feature 0 <= 23.5) [info] If (feature 0 <= 21.5) [info] Predict: 0.0 [info] Else (feature 0 > 21.5) Win G
  30. 30. And predict the results on new data // Option 1: Add empty/place-holder label data pipeline_model.transform(df.withColumn("category", "dne")) .take(20)
  31. 31. I guess that looks ok? Lets serve it! ● Waaaaaait - why is evaluate only on a dataframe? ● Ewwww - embeding Spark local mode in our webapp ○ The jar conflicts: they burn! & the performance will burn “later” :p Options: ● See if your company has a model server, write export function to match that glorious 90s C/C++ code base ● Write our own serving code & copy n’ paste the predict code ● Use someone else’s copy n’ paste project Ambernectar 13
  32. 32. But wait Spark has PMML support right? ● Spark 2.4 timeframe for pipelines -- I’m sorry (but the codes in master) ● Limited support in both models & general data prep ○ No general whole pipeline export yet either ● Serving options: write your own, license something, or AGPL code
  33. 33. The state of serving is generally a mess ● One project which aims to improve this is KubeFlow ○ Goal is unifying training & serving experiences ● Despite the name targeting more than just TensorFlow ● Doesn’t work with Spark yet, but it’s on my TODO list.
  34. 34. Pipeline API has many models: ● org.apache.spark.ml.classification ○ BinaryLogisticRegressionClassification, DecissionTreeClassification, GBTClassifier, etc. ● org.apache.spark.ml.regression ○ DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc. ● org.apache.spark.ml.recommendation ○ ALS PROcarterse Follow
  35. 35. It’s not always a standalone microservice: ● Linear regression is awesome because I can “serve”* it inside as an embedding in my elasticsearch / solr query ● Batch prediction is pretty OK too for somethings ○ Videos you may be interested in etc. ● Sometimes hybrid systems ○ Off-line expensive models + on-line inexpensive models ○ At this point you should probably higher a data scientist though
  36. 36. Cross-validation because saving a test set is effort ● Automagically* fit your model params ○ Things like max tree depth, min info gain, and other regularization. ● Because thinking is effort ● org.apache.spark.ml.tuning has the tools ● If your going to use this for auto-tuning please please save a test set ● Otherwise your models will look awesome and perform like a ford pinto Jonathan Kotta
  37. 37. The state of serving is generally a mess ● One project which aims to improve this is KubeFlow ○ Goal is unifying training & serving experiences ● Despite the name targeting more than just TensorFlow ● Doesn’t work with Spark yet, but it’s on my TODO list.
  38. 38. It’s not always a standalone microservice: ● Linear regression is awesome because I can “serve”* it inside as an embedding in my elasticsearch / solr query ● Batch prediction is pretty OK too for somethings ○ Videos you may be interested in etc. ● Sometimes hybrid systems ○ Off-line expensive models + on-line inexpensive models ○ At this point you should probably higher a data scientist though
  39. 39. Cross-validation because saving a test set is effort ● Automagically* fit your model params ○ Things like max tree depth, min info gain, and other regularization. ● Because thinking is effort ● org.apache.spark.ml.tuning has the tools ● If your going to use this for auto-tuning please please save a test set ● Otherwise your models will look awesome and perform like a ford pinto Jonathan Kotta
  40. 40. Cross-validation because saving a test set is effort // ParamGridBuilder constructs an Array of parameter combinations. val paramGrid: Array[ParamMap] = new ParamGridBuilder() .addGrid(nb.smoothing, Array(0.1, 0.5, 1.0, 2.0)) .build() val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) val cvModel = cv.fit(df) val bestModel = cvModel.bestModel Jonathan Kotta
  41. 41. False sense of security: ● A/B test please even if CV says many many $s ● Rank based things can have training bias with previous orders ● Non-displayed options: unlikely to be chosen ● Sometimes can find previous formulaic corrections ● Sometimes we can “experimentally” determine ● Other times we just hope it’s better than nothing ● Try and make sure your ML isn’t evil or re-encoding human biases but stronger
  42. 42. TensorFlowOnSpark, everyone loves mnist! cluster = TFCluster.run(sc, mnist_dist_dataset.map_fun, args, args.cluster_size, num_ps, args.tensorboard, TFCluster.InputMode.SPARK) if args.mode == "train": cluster.train(dataRDD, args.epochs) Lida
  43. 43. Enter: TF.Transform ● For pre-processing of your data ○ e.g. where you spend 90% of your dev time anyways ● Integrates into serving time :D ● OSS ● Runs on top of Apache Beam, but current release not yet outside of GCP ○ On master this can run on Flink, but probably has bugs currently. ○ Please don’t use this in production today unless your on GCP/Dataflow PROKathryn Yengel
  44. 44. DO NOT USE THIS IN PRODUCTION TODAY ● I’m serious, I don’t want to die or cause the next financial meltdown with software I’m a part of ● By Today I mean August 15 2018, but it’s probably going to not be great for at least a “little while” Vladimir Pustovit PROTambako The Jaguar
  45. 45. Ooor from the chicago taxi data... for key in taxi.DENSE_FLOAT_FEATURE_KEYS: # Preserve this feature as a dense float, setting nan's to the mean. outputs[key] = transform.scale_to_z_score(inputs[key]) for key in taxi.VOCAB_FEATURE_KEYS: # Build a vocabulary for this feature. outputs[key] = transform.string_to_int( inputs[key], top_k=taxi.VOCAB_SIZE, num_oov_buckets=taxi.OOV_SIZE) for key in taxi.BUCKET_FEATURE_KEYS: outputs[key] = transform.bucketize(inputs[key],
  46. 46. Defining a Transform processing function def preprocessing_fn(inputs): x = inputs['x'] y = inputs['y'] s = inputs['s'] x_centered = x - tft.mean(x) y_normalized = tft.scale_to_0_1(y) s_int = tft.string_to_int(s) return { 'x_centered': x_centered, 'y_normalized': y_normalized, 's_int': s_int}
  47. 47. mean stddev normalize multiply quantiles bucketize Analyzers Reduce (full pass) Implemented as a distributed data pipeline Transforms Instance-to-instance (don’t change batch dimension) Pure TensorFlow
  48. 48. Analyze normalize multiply bucketize constant tensors data mean stddev normalize multiply quantiles bucketize
  49. 49. Scale to ... Bag of Words / N-Grams Bucketization Feature Crosses tft.ngrams tft.string_to_int tf.string_split tft.scale_to_z_score tft.apply_buckets tft.quantiles tft.string_to_int tf.string_join ... Some common use-cases...
  50. 50. BEAM Beyond the JVM: Current release ● Non JVM BEAM doesn’t work outside of Google’s environment yet ● tl;dr : uses grpc / protobuf ○ Similar to the common design but with more efficient representations (often) ● But exciting new plans to unify the runners and ease the support of different languages (called SDKS) ○ See https://beam.apache.org/contribute/portability/ ● If this is exciting, you can come join me on making BEAM work in Python3 ○ Yes we still don’t have that :( ○ But we're getting closer & you can come join us on BEAM-2874 :D Emma
  51. 51. BEAM Beyond the JVM: Master w/ experiments *ish *ish *ish Nick portability *ish
  52. 52. So what does that look like? Driver Worker 1 Docker grpc Worker K Docker grpc
  53. 53. Updating your model ● The real world changes ● Online learning (streaming) is super cool, but hard to version ● Iterative batches: automatically train on new data, deploy model, and A/B test ● But A/B testing isn’t enough -- bad data can result in wrong or even illegal results (ask me after a bud light lime)
  54. 54. So why should you test & validate Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  55. 55. Validation ● For now checking file sizes & execution time seem like the most common best practice (from survey) ● spark-validator is still in early stages and not ready for production use but interesting proof of concept ● Doesn’t need to be done in your Spark job (can be done in your scripting language of choice with whatever job control system you are using) ● Sometimes your rules will miss-fire and you’ll need to manually approve a job - that is ok! ● Remember those property tests? Could be great Validation rules! Photo by: Paul Schadler
  56. 56. Using a Spark accumulator for validation: val (ok, bad) = (sc.accumulator(0), sc.accumulator(0)) val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1 // Actual parse logic here } // An action (e.g. count, save, etc.) if (bad.value > 0.1* ok.value) { throw Exception("bad data - do not use results") // Optional cleanup } // Mark as safe P.S: If you are interested in this check out spark-validator (still early stages). Found Animals Foundation Follow
  57. 57. Validating records read matches our expectations: val vc = new ValidationConf(tempPath, "1", true, List[ValidationRule]( new AbsoluteSparkCounterValidationRule("recordsRead", Some(30), Some(1000))) ) val sqlCtx = new SQLContext(sc) val v = Validation(sc, sqlCtx, vc) //Business logic goes here assert(v.validate(5) === true) } Photo by Dvortygirl
  58. 58. Common ML Specific Validation ● Number of iterations ○ did I coverage super quickly or slowly compared to last time? Could indicate junk data. ● CV model performance versus previous run ● Performance on a “fixed” test set (periodically manually refresh) ● Shadow run model on input stream - % of failures or missing results
  59. 59. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  60. 60. High Performance Spark! You can buy it today on Amazon.ca. Not a lot of ML focus but some! Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  61. 61. What about the code lab? ● https://github.com/holdenk/spark-intro-ml-pipeline-workshop Chocodyno
  62. 62. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark I need to give a testing talk next month, help a “friend” out. Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :) Give feedback on this presentation http://bit.ly/holdenTalkFeedback
  • nickchervov

    Aug. 24, 2018

An introduction to Ml pipelines with Apache spark and Apache beam presented at the Ottawa Reactive meetup August 16 in 2018.

Views

Total views

163

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

5

Shares

0

Comments

0

Likes

1

×