2. • Madhukara Phatak
• Consult in Bigdata and
FP
• Work with Spark,
Hadoop and ecosystem
• Training on Bigdata
• @madhukaraphatak
• http://www.madhukara
phatak.com
3. How many of You?
• Own a Smart phone?
• Want to know when next phone coming into
market?
• Next version of existing phone coming into
market?
• Specs and prices of new phone?
– Months before phone releases
• Data from multiple sources aggregated in one
place
4. Rumor Engine
• A practical implementation of machine
learning to solve phone rumor problem.
• Built in 3 months
– Learning machine learning
– Learning Spark
– Idea
– Implementation
– Release
5. My Journey
• Hadoop
• Mahout and Nectar
• JavaScript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine
6. Big data at work
• Worked for a BSS/OSS product company
• Big data is normal in Telecom
• CDR (call data record ) around 3TB for
companies like Airtel
• Need a solution for processing over 6 months
of data.
• Started to work around 4 years ago
7. Hadoop Saga
• Hadoop was default choice
• Challenge in the ecosystem in India
• Hype vs Reality
• Work
– Building ML library Nectar
– Working with companies to build hadoop
expertise and solutions
– POC’s
8. My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
9. Machine Learning in Hadoop
• Apache Mahout was the choice but its too
hard to map it to any new requirements
• Map/Reduce implementation suffered from
speed and complexity
• Accuracy of the results are often poor
• We set out to build our own and realized it
was too much of overhead even to build
simplest things
10. ML and Map Reduce
• M/R forgets everything once one operation is
done
• Everything has to go through HDFS , slower
because of disk over heads
• Mahout long tried to make as fast possible ,
but they kind of gave up.
• In Zinnia , we moved on with aggregation and
KPI based solutions rather than pure ML.
11. My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
13. Search for New Language
• Statically typed (Enterprise stack)
• Runs on JVM
• Ability to use Java libraries
• Functional programming
• Type inference
• Repl
14. My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine
15. Scala
• Statically typed
• Type inference
• Functional programming and OO built in
• Parallelism built in
• REPL
• Scalable language
16. Search for Functional Bigdata
• Pig attempted on Hadoop
• Tuple Map/Reduce
• Javascript API for Hadoop
• Why functional bigdata?
17. Big data platform requirement
• Immutability support
• Transformation not CRUD
• Built in laziness
• Concise API
• Type inference
18. Java and Hadoop (Productivity)
• No Laziness
– Every Map/Reduce operation needs to write
output to HDFS
• Java allows crud like variable assignments but
fails in distributed mode
• Type of each key/value pair has to be declared
no way to skip it
• Lots of boiler plate code for closures
19. Apache Spark
• Apache Spark is a framework for lightening
fast cluster computing .
• Build by AmpLabs and now Databricks.
• Competitor to M/R of Hadoop
• Runs on Hadoop 1.0 and Hadoop 2.0 yarn
• Written in scala
20. Spark and ML
• Built for Iterative programs Aka ML
• Support for intermediate result caching
• Support for in memory processing
• Remembers across jobs not just within job
• There is suddenly interest in Bigdata ML again
with spark as its finally possible to run fast and
accurate with spark
• Mahout is moving on to Spark
21. My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Coursera
• MLLib
• Rumour Engine
22. Learning Machine learning
• Coursera
• Example in octave
• Porting examples from octave to Spark
• https://github.com/zinniasystems/spark-ml-
class
• Uses
– MLLib
– JBlas
– Breeze
23. My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
24. MLLib
• Standard Spark library for Machine learning
• Built into spark
• Very small code base – 1200 line of scala code
• 40x – 100x faster than Mahout
• Supports
– Linear and Logistic regression
– SVM
– Recommender systems
25. Mahout vs MLLib
• Mahout has more algorithms than MLLib
• MLLib has less code than MLLib (1200 lines
scala vs >20,000 lines of java code
• Much improved performance and accuracy
• Mahout recognizes it , moving to spark
backend for next release
26. My Journey
• Hadoop
• Mahout and Nectar
• Javascript
• Scala and Spark
• Courseera
• MLLib
• Rumour Engine
27. Rumor Engine
• Crawls blog data
• As of 12 blogs everyday, more to add in future
• Naïve Bayes to classify
• Uses single node spark for prediction
• MLLib
• Has <200 lines of actual application scala
code.
28. ML-Scale challenges
• Choosing an algorithm
• Accuracy of algorithm implementation
• Modeling when data is noisy and big
• Faster sampling
• Real time processing
• Accuracy vs Performance
29. ML-People challenges
• Hard to find Data scientists
• Unique combination of skills – Programming
at scale and Maths.
• Mathematical reasoning and practicality of
implementation.