SlideShare a Scribd company logo
1 of 44
Image cedit, CC licence, http://ansem315.deviantart.com/art/Asimov-
Foundation-395188263
• Predict crime before it happens?
• Which is hard!
• Asimov’s “Foundation” talks about
mathematical models to predict the
future.
• We are entering that Era where
above are not just science fictions
e.g. Targeted
Marketing
• Assume mass emails to 1M
people, reaction rate of 1%,
2$ cost per email.
– Then cost 2M$ and reach of
10k people.
• Lets say that looking at
demographics (e.g. where
they live), you can find
250K people with reaction
rate of 6%, then (e.g. by
using decision trees)
• Then cost 500K$ and reach
of 15k people.
A day in your life
 Think about a day in your life?
– What is the best road to take?
– Would there be any bad weather?
– How to invest my money?
– How is my health?
 There are many decisions that
you can do better if only you can
access the data and process
them.
http://www.flickr.com/photos/kcolwell/5
512461652/ CC licence
Internet of Things
• Currently physical world and
software worlds are
detached
• Internet of things promises
to bridge this
– It is about sensors and
actuators everywhere
– In your fridge, in your
blanket, in your chair, in your
carpet.. Yes even in your
socks
– Umbrella that light up when
there is rain and medicine
cups
What can we do with Big Data?
• Optimize (World is inefficient)
– 30% food wasted farm to plate
– GE 1% initiative (http://goo.gl/eYC0QE )
• 1% saving in trains can save 2B/ year
• 1% in US healthcare is 20B/ year
• In contrast, Sri Lanka total exports 9B/ year.
• Save lives
– Weather, Disease identification, Personalized
treatment
• Technology advancement
– Most high tech research are done via simulations
Big Data Architecture
Why Big Data is hard?
• How store? Assuming 1TB bytes it
takes 1000 computers to store a
1PB
• How to move? Assuming 10Gb
network, it takes 2 hours to copy
1TB, or 83 days to copy a 1PB
• How to search? Assuming each
record is 1KB and one machine can
process 1000 records per sec, it
needs 277CPU days to process a
1TB and 785 CPU years to process a
1 PB
• Big data needs distributed systems
http://www.susanica.com/photo/9
Tools for Processing
Data
Big data Processing Technologies
Landscape
MapReduce/ Hadoop
• First introduced by Google,
and used as the programming
model for their systems
• Implemented by opensource
projects like Apache Hadoop
and Spark
• Users writes two functions:
map and reduce
• The framework handles the
details like distributed
processing, fault tolerance,
load balancing etc.
• Widely used, and the one of
the catalyst of Big data
void map(ctx, k, line){
(player, speed) =
split(line, ‘,’);
ctx.emit(player,speed)
}
void reduce(ctx, player,
speeds[]){
ctx.emit(k,avg(speeds));
}
MapReduce (Contd.)
Apache Spark
• New programming
model built on
functional
programming concepts
• Can be much faster for
recursive usecases
• Performance: Spark on 206 EC2 machines, we
sorted 100 TB of data on disk in 23 minutes. The
previous world record set by Hadoop
MapReduce used 2100 machines and took 72
minutes. (e.g. 30X speedup)
Calculating Avg Speed with Spark
pairs = file.map(fnSplit2Pair);
tot = pairs.reduceByKey(a,b => a + b);
count = pairs.reduceByKey(a, b => 2);
avgSpeed = tot / count;
• Map data to a virtual variable, which does not
load the data
• Then apply lambda functions
file = spark.textFile("hdfs://… speed.data”)
What if you can freeze time!!
• Most solutions are overnight
• Think how you would buy
something! Research a bit and
buy, often overnight is too late.
• But not all trends takes time
– People change their mind
– Trends move fast
– React to what customer is doing
(do not let him move away)
• At CEP speed 400k/sec, if each
event takes a second, it takes 4
days to pass a second in real
world!!
Real-time Analytics
• Idea is to process data as they are
received in streaming fashion
• Used when we need
– Very fast output
– Lots of events (few 100k to millions)
– Processing without storing (e.g. too
much data)
• Two main technologies
– Stream Processing (e.g. Strom,
http://storm-project.net/ )
– Complex Event Processing (CEP)
http://wso2.com/products/complex-
event-processor/
Complex Event Processing (CEP)
• Sees inputs as Event streams and queried with
SQL like language
• Supports Filters, Windows, Join, Patterns and
Sequences
define partition “playerPartition” as PlayerDataStream.pid;
from PlayerDataStream#win.time(1m)
select pid, avg(speed) as avgSpeed
insert into AvgSpeedStream
using partition playerPartition;
DEBS Grand Challenge
• Event Processing
challenge
• Real football game,
sensors in player
shoes + ball
• Events in 15k Hz
• Event format
– Sensor ID, TS, x, y, z, v,
a
• Queries
– Running Stats
– Ball Possession
– Heat Map of Activity
– Shots at Goal
Example: Detect Ball Possession
• Possession is time a
player hit the ball until
someone else hits it or it
goes out of the ground
• See demo,
http://goo.gl/VW6xQN
from Ball#window.length(1) as b join
Players#window.length(1) as p
unidirectional
on debs: getDistance(b.x,b.y,b.z,
p.x, p.y, p.z) < 1000
and b.a > 55
select ...
insert into hitStream
from old = hitStream ,
b = hitStream [old. pid != pid ],
n= hitStream[b.pid == pid]*,
( e1 = hitStream[b.pid != pid ]
or e2= ballLeavingHitStream)
select ...
insert into BallPossessionStream
http://www.flickr.com/photos/glennharper/146164820/
Lambda Architecture
Machine Learning Tools
• R – programming language for statistical
computing (most widely used)
• Weka – java machine learning library (single
node)
• Scikit-learn – very easy to use python library
• Scalable
– Mahout : MapReduce implementation of Machine
learning algorithms
– MLBase (based on Spark)
– Others: GraphLab, VW, 0xData
• PMML (Predictive model markup language)
– Let you port models between languages
Solving the Problem
Curious Case of Missing Data
http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from
http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/
• WW II, Returned
Aircrafts and data
on where they
were hit?
• How would you
add Armour?
Challenges due to Nature of Bigdata
• Lack of Control Experiment
– Often countered with A/B testing in the field
– Hard to prove causality
• Does it coming from a representative sample?
• Privacy
– Security
– Randomized techniques (see http://goo.gl/sLfKIb )
Big data lifecycle
• Get the data, clean up
Making Sense of Data
• Hindsight (to know what
happened)
• Basic analytics + visualizations
(min, max, average, histogram,
distributions … )
• Oversight (to know what is
happening and fixing it)
– Realtime analytics
• Insight
– Pattern mining, Clustering,
• Foresight
– Neural networks,
Classification,
Recommendation
Hindsight (What happened?)
• Analytics Implemented with
MapReduce or Queries
– Min, Max, average,
correlation, histograms
– Might join or group data in
many ways
– Heatmaps, temporal trends
• Key Performance indicators
(KPIs)
– Avg time for a ticket for
customer service
– Profit per square feet for
retail
• Data is often presented with
some visualizations http://www.flickr.com/photos/isriya/2967310
333/
Drill Down
• Idea is to let users drill
down and explore a view of
the data
– E.g. find customers, region,
time of year etc., that
responsible for most
revenue
• With OLAP, Users go and
define 3 dimensions (or
more), and tool let users
explore the cube and only
look at subset of data.
– E.g. tool: Mondrian, Apache
Drill
http://pixabay.com, by Steven Iodice
Usecase: Planning
• Urban Planning
– People distribution
– Mobility
– Waste Management
– E.g. see
http://goo.gl/jPujmM
• Market Research
– Buying Patterns
– Sentiments
Oversight (What
happening?)
• Realtime analytics
• Realtime visualizations
• Alarms (find problems) and
action recommendations
– Classification
– Anomaly detection
• Drill down and look at
historical data as before.
Oversight : Usecases
• Preprocessing: Correlations, filtering, transformations
• Tracking - follow some related entity’s state (such as in
space, time or process status).
– e.g. location of airline baggage, vehicle, tracking wild life
• Respond to emergencies
– E.g. plan maintenance before aircraft lands
• Detect trends - event sequences, missing events,
thresholds, Outliers, Complex trends triple bottom etc.,
– (e.g. algorithmic trading, SLA, system management)
• Building Profiles – extract info, relationships (e.g.
targeted marketing)
– Marketing, Recommendations
Insight (Understanding Why ?)
• Pattern Mining – find frequent
associations (e.g. Market Basket),
frequent sequences
• Clustering
• Graph Analysis
• Knowledge Discovery
• Correlations between features and Finding principal
components
• Simulations, Complex System modeling, matching a
statistical distribution
Usecase 1: Clustering
• Clustering => group
similar items together.
(e.g. KMeans)
• Applications
– Similar documents,
Genes, Medical
imaging, similar
people, customers
– Crime Analysis
– Compare chemical
compounds
– Social network
analysis
Usecase 2: Graph Analytics
• Types of Graphs: Social,
communication, Biological
networks, Maps, Web,
Sematic Web/Ontologies
• Problems
– Counting triangles (influence)
– Find hubs and authorities (key people, pages)
– Finding shortest paths and minimum spanning tree
(Routing Internet traffic and UPS trucks)
– Modularity - strength of community / Centrality
– Graph, clique, sub graph detection
Usecase 3: Modeling Solar System
• PDEs not solvable
• Simulations
• Other examples: Weather
Foresight (Predict)
• Build a Model
– Weather, Economic models
• Predict the future values
– Electricity load, traffic, demand,
sales
• Classification
– Spam detection, Group users,
Sentiment analysis
• Find anomalies
– Fraud, Predictive maintenance
• Recommendations
– Targeted advertising, product
recommendations
Prediction Technologies
• Trying to build a model for the
data
• Predict Next values in a
sequence
– Regression, Neural networks,
Markov and Hidden Makov
Models
• Classification
– Decision Trees, SVMs, Graphical
Models
• Finding Anomalies
– Markov Chains, outliers in a
distribution
• Recommendations
http://misterbijou.blogspot.com/201
0_09_01_archive.html
Usecase 1: Electricity Demand Forecast
• Find trends and
cycles (e.g. ARIMA)
• Use regression to
build a model using
earlier data
• Predict based on
the model
Usecase 2: Predictive Maintenance
• Idea is to fix the problem
before it broke, avoiding
expensive downtimes
– Airplanes, turbines,
windmills
– Construction Equipment
– Car, Golf carts
• How
– Anomaly detection (deviate
from normal operation)
– Match against known error
patterns
Usecase 3: Targeted Marketing
Outline
Big Data Projects are
• Access to data is the main assert
– Data owners set the terms
• Involve many Organizations
– Data owners rarely have expertise to make use of
data
• Multi-Domain
– Retain and teach cross domain people
• Complicated and built on lot of opensource tools
– Do not reinvent the wheel and let go NIH
• Mathematical
– Math, advance algorithms, Statistical methods,
machine learning. Brush up your math!!
• Distributed
– Learn bit of distributed systems
Questions?

More Related Content

More from Srinath Perera

Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesSrinath Perera
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?Srinath Perera
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsSrinath Perera
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Srinath Perera
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of BlockchainSrinath Perera
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesSrinath Perera
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata EraSrinath Perera
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksSrinath Perera
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeSrinath Perera
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies TimelineSrinath Perera
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsSrinath Perera
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglySrinath Perera
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through AnalyticsSrinath Perera
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySrinath Perera
 
Role of Analytics in Digital Business
Role of Analytics in Digital BusinessRole of Analytics in Digital Business
Role of Analytics in Digital BusinessSrinath Perera
 
What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?Srinath Perera
 
Machine Learning in the Real World
Machine Learning in the Real WorldMachine Learning in the Real World
Machine Learning in the Real WorldSrinath Perera
 
How IOT & Big Data will shape up Future Economies?
How IOT & Big Data will shape up Future Economies?How IOT & Big Data will shape up Future Economies?
How IOT & Big Data will shape up Future Economies?Srinath Perera
 

More from Srinath Perera (20)

Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & Challenges
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future Integrations
 
Future of Serverless
Future of ServerlessFuture of Serverless
Future of Serverless
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going?
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of Blockchain
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New Technologies
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata Era
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and Risks
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology Landscape
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies Timeline
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming Applications
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the Ugly
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through Analytics
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration Technology
 
Role of Analytics in Digital Business
Role of Analytics in Digital BusinessRole of Analytics in Digital Business
Role of Analytics in Digital Business
 
What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?
 
Doing Online Research
Doing Online ResearchDoing Online Research
Doing Online Research
 
Machine Learning in the Real World
Machine Learning in the Real WorldMachine Learning in the Real World
Machine Learning in the Real World
 
How IOT & Big Data will shape up Future Economies?
How IOT & Big Data will shape up Future Economies?How IOT & Big Data will shape up Future Economies?
How IOT & Big Data will shape up Future Economies?
 

Recently uploaded

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 

Recently uploaded (20)

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 

Introduction to Big Data: making sense of the World around Us

  • 1.
  • 2. Image cedit, CC licence, http://ansem315.deviantart.com/art/Asimov- Foundation-395188263 • Predict crime before it happens? • Which is hard! • Asimov’s “Foundation” talks about mathematical models to predict the future. • We are entering that Era where above are not just science fictions
  • 3. e.g. Targeted Marketing • Assume mass emails to 1M people, reaction rate of 1%, 2$ cost per email. – Then cost 2M$ and reach of 10k people. • Lets say that looking at demographics (e.g. where they live), you can find 250K people with reaction rate of 6%, then (e.g. by using decision trees) • Then cost 500K$ and reach of 15k people.
  • 4. A day in your life  Think about a day in your life? – What is the best road to take? – Would there be any bad weather? – How to invest my money? – How is my health?  There are many decisions that you can do better if only you can access the data and process them. http://www.flickr.com/photos/kcolwell/5 512461652/ CC licence
  • 5.
  • 6. Internet of Things • Currently physical world and software worlds are detached • Internet of things promises to bridge this – It is about sensors and actuators everywhere – In your fridge, in your blanket, in your chair, in your carpet.. Yes even in your socks – Umbrella that light up when there is rain and medicine cups
  • 7. What can we do with Big Data? • Optimize (World is inefficient) – 30% food wasted farm to plate – GE 1% initiative (http://goo.gl/eYC0QE ) • 1% saving in trains can save 2B/ year • 1% in US healthcare is 20B/ year • In contrast, Sri Lanka total exports 9B/ year. • Save lives – Weather, Disease identification, Personalized treatment • Technology advancement – Most high tech research are done via simulations
  • 9. Why Big Data is hard? • How store? Assuming 1TB bytes it takes 1000 computers to store a 1PB • How to move? Assuming 10Gb network, it takes 2 hours to copy 1TB, or 83 days to copy a 1PB • How to search? Assuming each record is 1KB and one machine can process 1000 records per sec, it needs 277CPU days to process a 1TB and 785 CPU years to process a 1 PB • Big data needs distributed systems http://www.susanica.com/photo/9
  • 11. Big data Processing Technologies Landscape
  • 12. MapReduce/ Hadoop • First introduced by Google, and used as the programming model for their systems • Implemented by opensource projects like Apache Hadoop and Spark • Users writes two functions: map and reduce • The framework handles the details like distributed processing, fault tolerance, load balancing etc. • Widely used, and the one of the catalyst of Big data void map(ctx, k, line){ (player, speed) = split(line, ‘,’); ctx.emit(player,speed) } void reduce(ctx, player, speeds[]){ ctx.emit(k,avg(speeds)); }
  • 14. Apache Spark • New programming model built on functional programming concepts • Can be much faster for recursive usecases • Performance: Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. The previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. (e.g. 30X speedup)
  • 15. Calculating Avg Speed with Spark pairs = file.map(fnSplit2Pair); tot = pairs.reduceByKey(a,b => a + b); count = pairs.reduceByKey(a, b => 2); avgSpeed = tot / count; • Map data to a virtual variable, which does not load the data • Then apply lambda functions file = spark.textFile("hdfs://… speed.data”)
  • 16. What if you can freeze time!! • Most solutions are overnight • Think how you would buy something! Research a bit and buy, often overnight is too late. • But not all trends takes time – People change their mind – Trends move fast – React to what customer is doing (do not let him move away) • At CEP speed 400k/sec, if each event takes a second, it takes 4 days to pass a second in real world!!
  • 17. Real-time Analytics • Idea is to process data as they are received in streaming fashion • Used when we need – Very fast output – Lots of events (few 100k to millions) – Processing without storing (e.g. too much data) • Two main technologies – Stream Processing (e.g. Strom, http://storm-project.net/ ) – Complex Event Processing (CEP) http://wso2.com/products/complex- event-processor/
  • 18. Complex Event Processing (CEP) • Sees inputs as Event streams and queried with SQL like language • Supports Filters, Windows, Join, Patterns and Sequences define partition “playerPartition” as PlayerDataStream.pid; from PlayerDataStream#win.time(1m) select pid, avg(speed) as avgSpeed insert into AvgSpeedStream using partition playerPartition;
  • 19. DEBS Grand Challenge • Event Processing challenge • Real football game, sensors in player shoes + ball • Events in 15k Hz • Event format – Sensor ID, TS, x, y, z, v, a • Queries – Running Stats – Ball Possession – Heat Map of Activity – Shots at Goal
  • 20. Example: Detect Ball Possession • Possession is time a player hit the ball until someone else hits it or it goes out of the ground • See demo, http://goo.gl/VW6xQN from Ball#window.length(1) as b join Players#window.length(1) as p unidirectional on debs: getDistance(b.x,b.y,b.z, p.x, p.y, p.z) < 1000 and b.a > 55 select ... insert into hitStream from old = hitStream , b = hitStream [old. pid != pid ], n= hitStream[b.pid == pid]*, ( e1 = hitStream[b.pid != pid ] or e2= ballLeavingHitStream) select ... insert into BallPossessionStream http://www.flickr.com/photos/glennharper/146164820/
  • 22. Machine Learning Tools • R – programming language for statistical computing (most widely used) • Weka – java machine learning library (single node) • Scikit-learn – very easy to use python library • Scalable – Mahout : MapReduce implementation of Machine learning algorithms – MLBase (based on Spark) – Others: GraphLab, VW, 0xData • PMML (Predictive model markup language) – Let you port models between languages
  • 24. Curious Case of Missing Data http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/ • WW II, Returned Aircrafts and data on where they were hit? • How would you add Armour?
  • 25. Challenges due to Nature of Bigdata • Lack of Control Experiment – Often countered with A/B testing in the field – Hard to prove causality • Does it coming from a representative sample? • Privacy – Security – Randomized techniques (see http://goo.gl/sLfKIb )
  • 26. Big data lifecycle • Get the data, clean up
  • 27. Making Sense of Data • Hindsight (to know what happened) • Basic analytics + visualizations (min, max, average, histogram, distributions … ) • Oversight (to know what is happening and fixing it) – Realtime analytics • Insight – Pattern mining, Clustering, • Foresight – Neural networks, Classification, Recommendation
  • 28. Hindsight (What happened?) • Analytics Implemented with MapReduce or Queries – Min, Max, average, correlation, histograms – Might join or group data in many ways – Heatmaps, temporal trends • Key Performance indicators (KPIs) – Avg time for a ticket for customer service – Profit per square feet for retail • Data is often presented with some visualizations http://www.flickr.com/photos/isriya/2967310 333/
  • 29. Drill Down • Idea is to let users drill down and explore a view of the data – E.g. find customers, region, time of year etc., that responsible for most revenue • With OLAP, Users go and define 3 dimensions (or more), and tool let users explore the cube and only look at subset of data. – E.g. tool: Mondrian, Apache Drill http://pixabay.com, by Steven Iodice
  • 30. Usecase: Planning • Urban Planning – People distribution – Mobility – Waste Management – E.g. see http://goo.gl/jPujmM • Market Research – Buying Patterns – Sentiments
  • 31. Oversight (What happening?) • Realtime analytics • Realtime visualizations • Alarms (find problems) and action recommendations – Classification – Anomaly detection • Drill down and look at historical data as before.
  • 32. Oversight : Usecases • Preprocessing: Correlations, filtering, transformations • Tracking - follow some related entity’s state (such as in space, time or process status). – e.g. location of airline baggage, vehicle, tracking wild life • Respond to emergencies – E.g. plan maintenance before aircraft lands • Detect trends - event sequences, missing events, thresholds, Outliers, Complex trends triple bottom etc., – (e.g. algorithmic trading, SLA, system management) • Building Profiles – extract info, relationships (e.g. targeted marketing) – Marketing, Recommendations
  • 33. Insight (Understanding Why ?) • Pattern Mining – find frequent associations (e.g. Market Basket), frequent sequences • Clustering • Graph Analysis • Knowledge Discovery • Correlations between features and Finding principal components • Simulations, Complex System modeling, matching a statistical distribution
  • 34. Usecase 1: Clustering • Clustering => group similar items together. (e.g. KMeans) • Applications – Similar documents, Genes, Medical imaging, similar people, customers – Crime Analysis – Compare chemical compounds – Social network analysis
  • 35. Usecase 2: Graph Analytics • Types of Graphs: Social, communication, Biological networks, Maps, Web, Sematic Web/Ontologies • Problems – Counting triangles (influence) – Find hubs and authorities (key people, pages) – Finding shortest paths and minimum spanning tree (Routing Internet traffic and UPS trucks) – Modularity - strength of community / Centrality – Graph, clique, sub graph detection
  • 36. Usecase 3: Modeling Solar System • PDEs not solvable • Simulations • Other examples: Weather
  • 37. Foresight (Predict) • Build a Model – Weather, Economic models • Predict the future values – Electricity load, traffic, demand, sales • Classification – Spam detection, Group users, Sentiment analysis • Find anomalies – Fraud, Predictive maintenance • Recommendations – Targeted advertising, product recommendations
  • 38. Prediction Technologies • Trying to build a model for the data • Predict Next values in a sequence – Regression, Neural networks, Markov and Hidden Makov Models • Classification – Decision Trees, SVMs, Graphical Models • Finding Anomalies – Markov Chains, outliers in a distribution • Recommendations http://misterbijou.blogspot.com/201 0_09_01_archive.html
  • 39. Usecase 1: Electricity Demand Forecast • Find trends and cycles (e.g. ARIMA) • Use regression to build a model using earlier data • Predict based on the model
  • 40. Usecase 2: Predictive Maintenance • Idea is to fix the problem before it broke, avoiding expensive downtimes – Airplanes, turbines, windmills – Construction Equipment – Car, Golf carts • How – Anomaly detection (deviate from normal operation) – Match against known error patterns
  • 41. Usecase 3: Targeted Marketing
  • 43. Big Data Projects are • Access to data is the main assert – Data owners set the terms • Involve many Organizations – Data owners rarely have expertise to make use of data • Multi-Domain – Retain and teach cross domain people • Complicated and built on lot of opensource tools – Do not reinvent the wheel and let go NIH • Mathematical – Math, advance algorithms, Statistical methods, machine learning. Brush up your math!! • Distributed – Learn bit of distributed systems