29. WHAT IS SPARK MLLIB?
Machine Learning algorithms on Spark
Analyze data to extract insights
30. WHAT IS MACHINE LEARNING?
Technique Question
Regression Predict revenue next month
Classification Is tumor cancerous or benign
Clustering Which customers are similar to each
other
Recommendation Which movie will you like
32. WHAT IS THE DIFFERENCE
BETWEEN REAL-TIME AND
BATCH?
Term Means Example
Real-
Time
Process data when
it arrives
Reject credit card
transaction
Batch Process data
periodically
Flag suspicious
transaction at night
44. WHY LAMBDA ARCHITECTURE?
How can we watch historical trends and what is
happening right now?
How can we show bestsellers from this year and from last
hour?
45. WHAT IS LAMBDA
ARCHITECTURE?
Big Data system which can handle both batch and real-
time
Uses historical data as well as real-time data
Best of both worlds
48. BATCH REVIEW
Technology Description
Hadoop Cluster operating system
HDFS Stores petabytes of data on 100s or 1000s of
machines
MapReduce Processes data in HDFS
Hive SQL MapReduce
Pig PigLatin MapReduce
Spark Faster MapReduce
Spark SQL SQL Spark
49. REAL-TIME REVIEW
Technology Description
HBase Fast NoSQL database on top of
HDFS
Kafka Queues incoming data into cluster
Spark Streaming Process in real-time
Lambda
Architecture
Combines real-time and batch