I strongly believe in the combination of Apache Spark with Java. In this tutorial, prepared for NCDevCon, we are going through the basics of Spark as well as 2 examples: a basic ingestion and an analytics example based on joins & group by. Follow me @jgperrin.
5. Agenda
๏ What is ?
๏ What can I do with ?
๏ What is a app, anyway?
๏ Install a bunch of software
๏ A first example
๏ Understand what just happened
๏ Another example, slightly more complex, because you are now ready
๏ But now, sincerely what just happened?
๏ More and more examples (times permit!)
6. Caution
First time I am doing a hands-on
tutorial
Tons of content
Unknown crowd
Unknown setting
12. An Analytics Operating System?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
13. An Analytics Operating System?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
14. Use Cases
๏ NCEatery.com
๏ Restaurant analytics
๏ 1.57×10^21 datapoints analyzed
๏ (they are hiring!)
๏ General compute
๏ Distributed data transfer
๏ IBM
๏ DSX (Data Science Experiment)
๏ Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/
๏ Z
๏ Data wrangling solution
15. What a Typical App Looks Like?
Connect to the
cluster
Load Data
Do something
with the data
Share the results
18. Java Development Tools
๏ Java JDK 1.8
๏ http://bit.ly/javadk8
๏ Eclipse Oxygen
๏ http://bit.ly/eclipseo2
๏ Other nice to have
๏ Maven
๏ SourceTree or git (command line)
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
http://www.eclipse.org/downloads/eclipse-packages/
19. Get the C O D E
๏ GitHub
๏ http://bit.ly/
SparkJavaCookbookCode
https://github.com/jgperrin/net.jgp.labs.spark
git clone https://github.com/jgperrin/net.jgp.labs.spark.git
20. Getting Deeper
๏ Go to net.jgp.labs.spark.l000_ingestion.l000_csv
๏ Open CsvToDatasetApp.java
๏ Right click, Run As, Java Application
29. Basic Analytics
๏ Go to net.jgp.labs.spark.l200_join.l030_count_books
๏ Open AuthorsAndBooksCountBooksApp.java
๏ Right click, Run As, Java Application
35. A (Big) Data Scenario
Data
Raw
Data
Ingestion
DataQuality
Pure
Data
Transformation
Rich
Data
Load/Publish
Data
36. What You Learned
๏ Big Data is easier than one could think
๏ Java is the way to go (or Python)
๏ New vocabulary for using Spark
๏ You have a friend to help (ok, me)
๏ Spark is fun
37. Going Further
๏ Run more code from the examples (I add some weekly)
๏ Contact me @jgperrin
๏ Join the Spark User mailing list
๏ Get help from Stack Overflow
๏ Watch for my book on Spark + Java to come!