World we live in generates a massive amounts of data. These data comes from many sources like people's activities in the internet, mobile activities, IoT devices and sensors, real world systems like power grid or factories etc. Big data is collection of technologies used to make sense of that data.
A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Introduction to Big Data: making sense of the World around Us
1.
2. Image cedit, CC licence, http://ansem315.deviantart.com/art/Asimov-
Foundation-395188263
• Predict crime before it happens?
• Which is hard!
• Asimov’s “Foundation” talks about
mathematical models to predict the
future.
• We are entering that Era where
above are not just science fictions
3. e.g. Targeted
Marketing
• Assume mass emails to 1M
people, reaction rate of 1%,
2$ cost per email.
– Then cost 2M$ and reach of
10k people.
• Lets say that looking at
demographics (e.g. where
they live), you can find
250K people with reaction
rate of 6%, then (e.g. by
using decision trees)
• Then cost 500K$ and reach
of 15k people.
4. A day in your life
Think about a day in your life?
– What is the best road to take?
– Would there be any bad weather?
– How to invest my money?
– How is my health?
There are many decisions that
you can do better if only you can
access the data and process
them.
http://www.flickr.com/photos/kcolwell/5
512461652/ CC licence
5.
6. Internet of Things
• Currently physical world and
software worlds are
detached
• Internet of things promises
to bridge this
– It is about sensors and
actuators everywhere
– In your fridge, in your
blanket, in your chair, in your
carpet.. Yes even in your
socks
– Umbrella that light up when
there is rain and medicine
cups
7. What can we do with Big Data?
• Optimize (World is inefficient)
– 30% food wasted farm to plate
– GE 1% initiative (http://goo.gl/eYC0QE )
• 1% saving in trains can save 2B/ year
• 1% in US healthcare is 20B/ year
• In contrast, Sri Lanka total exports 9B/ year.
• Save lives
– Weather, Disease identification, Personalized
treatment
• Technology advancement
– Most high tech research are done via simulations
9. Why Big Data is hard?
• How store? Assuming 1TB bytes it
takes 1000 computers to store a
1PB
• How to move? Assuming 10Gb
network, it takes 2 hours to copy
1TB, or 83 days to copy a 1PB
• How to search? Assuming each
record is 1KB and one machine can
process 1000 records per sec, it
needs 277CPU days to process a
1TB and 785 CPU years to process a
1 PB
• Big data needs distributed systems
http://www.susanica.com/photo/9
12. MapReduce/ Hadoop
• First introduced by Google,
and used as the programming
model for their systems
• Implemented by opensource
projects like Apache Hadoop
and Spark
• Users writes two functions:
map and reduce
• The framework handles the
details like distributed
processing, fault tolerance,
load balancing etc.
• Widely used, and the one of
the catalyst of Big data
void map(ctx, k, line){
(player, speed) =
split(line, ‘,’);
ctx.emit(player,speed)
}
void reduce(ctx, player,
speeds[]){
ctx.emit(k,avg(speeds));
}
14. Apache Spark
• New programming
model built on
functional
programming concepts
• Can be much faster for
recursive usecases
• Performance: Spark on 206 EC2 machines, we
sorted 100 TB of data on disk in 23 minutes. The
previous world record set by Hadoop
MapReduce used 2100 machines and took 72
minutes. (e.g. 30X speedup)
15. Calculating Avg Speed with Spark
pairs = file.map(fnSplit2Pair);
tot = pairs.reduceByKey(a,b => a + b);
count = pairs.reduceByKey(a, b => 2);
avgSpeed = tot / count;
• Map data to a virtual variable, which does not
load the data
• Then apply lambda functions
file = spark.textFile("hdfs://… speed.data”)
16. What if you can freeze time!!
• Most solutions are overnight
• Think how you would buy
something! Research a bit and
buy, often overnight is too late.
• But not all trends takes time
– People change their mind
– Trends move fast
– React to what customer is doing
(do not let him move away)
• At CEP speed 400k/sec, if each
event takes a second, it takes 4
days to pass a second in real
world!!
17. Real-time Analytics
• Idea is to process data as they are
received in streaming fashion
• Used when we need
– Very fast output
– Lots of events (few 100k to millions)
– Processing without storing (e.g. too
much data)
• Two main technologies
– Stream Processing (e.g. Strom,
http://storm-project.net/ )
– Complex Event Processing (CEP)
http://wso2.com/products/complex-
event-processor/
18. Complex Event Processing (CEP)
• Sees inputs as Event streams and queried with
SQL like language
• Supports Filters, Windows, Join, Patterns and
Sequences
define partition “playerPartition” as PlayerDataStream.pid;
from PlayerDataStream#win.time(1m)
select pid, avg(speed) as avgSpeed
insert into AvgSpeedStream
using partition playerPartition;
19. DEBS Grand Challenge
• Event Processing
challenge
• Real football game,
sensors in player
shoes + ball
• Events in 15k Hz
• Event format
– Sensor ID, TS, x, y, z, v,
a
• Queries
– Running Stats
– Ball Possession
– Heat Map of Activity
– Shots at Goal
20. Example: Detect Ball Possession
• Possession is time a
player hit the ball until
someone else hits it or it
goes out of the ground
• See demo,
http://goo.gl/VW6xQN
from Ball#window.length(1) as b join
Players#window.length(1) as p
unidirectional
on debs: getDistance(b.x,b.y,b.z,
p.x, p.y, p.z) < 1000
and b.a > 55
select ...
insert into hitStream
from old = hitStream ,
b = hitStream [old. pid != pid ],
n= hitStream[b.pid == pid]*,
( e1 = hitStream[b.pid != pid ]
or e2= ballLeavingHitStream)
select ...
insert into BallPossessionStream
http://www.flickr.com/photos/glennharper/146164820/
22. Machine Learning Tools
• R – programming language for statistical
computing (most widely used)
• Weka – java machine learning library (single
node)
• Scikit-learn – very easy to use python library
• Scalable
– Mahout : MapReduce implementation of Machine
learning algorithms
– MLBase (based on Spark)
– Others: GraphLab, VW, 0xData
• PMML (Predictive model markup language)
– Let you port models between languages
24. Curious Case of Missing Data
http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from
http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/
• WW II, Returned
Aircrafts and data
on where they
were hit?
• How would you
add Armour?
25. Challenges due to Nature of Bigdata
• Lack of Control Experiment
– Often countered with A/B testing in the field
– Hard to prove causality
• Does it coming from a representative sample?
• Privacy
– Security
– Randomized techniques (see http://goo.gl/sLfKIb )
27. Making Sense of Data
• Hindsight (to know what
happened)
• Basic analytics + visualizations
(min, max, average, histogram,
distributions … )
• Oversight (to know what is
happening and fixing it)
– Realtime analytics
• Insight
– Pattern mining, Clustering,
• Foresight
– Neural networks,
Classification,
Recommendation
28. Hindsight (What happened?)
• Analytics Implemented with
MapReduce or Queries
– Min, Max, average,
correlation, histograms
– Might join or group data in
many ways
– Heatmaps, temporal trends
• Key Performance indicators
(KPIs)
– Avg time for a ticket for
customer service
– Profit per square feet for
retail
• Data is often presented with
some visualizations http://www.flickr.com/photos/isriya/2967310
333/
29. Drill Down
• Idea is to let users drill
down and explore a view of
the data
– E.g. find customers, region,
time of year etc., that
responsible for most
revenue
• With OLAP, Users go and
define 3 dimensions (or
more), and tool let users
explore the cube and only
look at subset of data.
– E.g. tool: Mondrian, Apache
Drill
http://pixabay.com, by Steven Iodice
30. Usecase: Planning
• Urban Planning
– People distribution
– Mobility
– Waste Management
– E.g. see
http://goo.gl/jPujmM
• Market Research
– Buying Patterns
– Sentiments
31. Oversight (What
happening?)
• Realtime analytics
• Realtime visualizations
• Alarms (find problems) and
action recommendations
– Classification
– Anomaly detection
• Drill down and look at
historical data as before.
32. Oversight : Usecases
• Preprocessing: Correlations, filtering, transformations
• Tracking - follow some related entity’s state (such as in
space, time or process status).
– e.g. location of airline baggage, vehicle, tracking wild life
• Respond to emergencies
– E.g. plan maintenance before aircraft lands
• Detect trends - event sequences, missing events,
thresholds, Outliers, Complex trends triple bottom etc.,
– (e.g. algorithmic trading, SLA, system management)
• Building Profiles – extract info, relationships (e.g.
targeted marketing)
– Marketing, Recommendations
33. Insight (Understanding Why ?)
• Pattern Mining – find frequent
associations (e.g. Market Basket),
frequent sequences
• Clustering
• Graph Analysis
• Knowledge Discovery
• Correlations between features and Finding principal
components
• Simulations, Complex System modeling, matching a
statistical distribution
34. Usecase 1: Clustering
• Clustering => group
similar items together.
(e.g. KMeans)
• Applications
– Similar documents,
Genes, Medical
imaging, similar
people, customers
– Crime Analysis
– Compare chemical
compounds
– Social network
analysis
35. Usecase 2: Graph Analytics
• Types of Graphs: Social,
communication, Biological
networks, Maps, Web,
Sematic Web/Ontologies
• Problems
– Counting triangles (influence)
– Find hubs and authorities (key people, pages)
– Finding shortest paths and minimum spanning tree
(Routing Internet traffic and UPS trucks)
– Modularity - strength of community / Centrality
– Graph, clique, sub graph detection
36. Usecase 3: Modeling Solar System
• PDEs not solvable
• Simulations
• Other examples: Weather
38. Prediction Technologies
• Trying to build a model for the
data
• Predict Next values in a
sequence
– Regression, Neural networks,
Markov and Hidden Makov
Models
• Classification
– Decision Trees, SVMs, Graphical
Models
• Finding Anomalies
– Markov Chains, outliers in a
distribution
• Recommendations
http://misterbijou.blogspot.com/201
0_09_01_archive.html
39. Usecase 1: Electricity Demand Forecast
• Find trends and
cycles (e.g. ARIMA)
• Use regression to
build a model using
earlier data
• Predict based on
the model
40. Usecase 2: Predictive Maintenance
• Idea is to fix the problem
before it broke, avoiding
expensive downtimes
– Airplanes, turbines,
windmills
– Construction Equipment
– Car, Golf carts
• How
– Anomaly detection (deviate
from normal operation)
– Match against known error
patterns
43. Big Data Projects are
• Access to data is the main assert
– Data owners set the terms
• Involve many Organizations
– Data owners rarely have expertise to make use of
data
• Multi-Domain
– Retain and teach cross domain people
• Complicated and built on lot of opensource tools
– Do not reinvent the wheel and let go NIH
• Mathematical
– Math, advance algorithms, Statistical methods,
machine learning. Brush up your math!!
• Distributed
– Learn bit of distributed systems