2. About Me
Principal Architect at Glassbeam
Founded two startups
Passionate about building products,
big data analytics, and machine
learning
www.linkedin.com/in/mohammedguller
@MohammedGuller
3
Available on Amazon
4. CPU Trend
CPU clock speed plateaued around 2004
CPUs are not getting any faster
Trend is to add more cores/CPU and more CPUs/system
5
5. Challenges
Multi-threaded programs required to utilize all cores in a machine
Writing multi-threaded program is hard
Tools provided by traditional languages are primitive
Problems such as deadlocks, livelocks, starvation, and race
conditions are difficult to avoid and detect
6
6. Functional Programming (FP)
Based on theory developed in the 1930s
Program composed of functions
– Executed by evaluating expressions
Functions are first-class citizens
– Can be passed as an argument to another function
– Can be returned by another function
– Can be defined inside another function
– Can be defined as an unnamed literal similar to a string literal
Functions do not have side effect
– Always returns the same output for a given input
– Order of execution is not important
Discourages mutable variables
7
7. Benefits of Functional Programming
Makes it easier to write multi-threaded programs
Improves developer productivity
Enables better quality code
8
13. 5x More Connected Things Than People by 2020
14
Network of objects embedded with software for
collecting and exchanging data over the Internet
14. Big Data Challenges
Storage
– Traditional SAN and NAS storage devices are expensive
Processing
– Traditional RDBMS were not designed to handle big data
How to get value out of data
How to do it economically
15
15. Open-source Big Data Storage Technologies
Distributed File Systems
– HDFS
NoSQL data stores
– Cassandra
– HBase
– MongoDB
– Druid
– ElasticSearch
– SolrCloud
16
16. How Much Data Can a Standard Server Process
100
GB
10
TB
100
TB1
TB
18. Scale-up
Use a more powerful high-end server
– Faster CPU
– Faster Disk
– Large number of CPUs
– Large amount of memory
Proprietary
Expensive
Limited scalability
19
19. Scale-out
Use a cluster of commodity servers
Inexpensive
Economical to scale
Preferred architecture
20
20. Challenges With Scale-out Architecture
Writing an distributed application is even harder than writing a
multi-threaded one
Many details involved
– Split a workload into chunks that can be distributed across a cluster
– Schedule compute resources among different jobs
– Manage inter-node communication
– Handle network and node failures
Hardware failures are more common at a cluster level
– Probability of a single node failing is very low
– Probability of any one node failing from a cluster of thousands of
nodes is very high
21
21. Getting Value Out of Data
Traditional analytics / BI
Machine Learning
– Predictive analytics
– Train software to do human tasks
22
22. Traditional Analytics / BI
What happened
– Revenue growth for the last month/quarter/year
– Customer growth for the last month/quarter/year
Why it happened
– Why profit dropped
– Why sales dropped
Other insights
– What is the country-wise breakup of people downloading an app
– How much time people spend in an app
23
23. Predictive Analytics
Ask software to predict
– What product will a customer most likely buy
– What ad will a visitor most likely click
– What movies/songs/books will a customer like
– What are chances that a patient may have an heart attack
More interesting and valuable than traditional analytics
24
24. Train Software To Do Human Tasks
Image classification
– Facebook
– Flickr
Voice recognition and natural
language processing
– Siri
Body movement recognition
– Xbox Kinect
Self-driving car
– Google car
Medical diagnosis
Anomaly detection
– Fraudulent transaction
– Security attack
25
28. Spark Benefits
Makes it easy to write distributed data processing applications
– Expressive API
Takes care of the messy details of distributed computing
Allows developers to just focus on the business logic
– Same code works on a single computer or a cluster of nodes
29
29. Integrated Libraries for a Variety of Tasks
30
Spark Core
Spark
SQL
GraphX
Spark
Streaming
MLlib &
Spark
ML
30. Spark is Fast
In-memory computation
Advanced Directed Acyclic Graph (DAG) execution engine
32
36. Opportunities
Big data will only get bigger
– Everything will be data driven
– New data-driven applications will be invented
– Data will enable us to solve extremely difficult problems
Spark and other big data technologies are rapidly evolving
Strong demand for people who know how to store, process and
get value out of big data
40