SlideShare a Scribd company logo
1 of 52
Outline• Forecasts and
Why Big data?
• What is Big
Data?
• Parts of the
Puzzle
• Quick Tutorial
• Conclusion
Photo by John Trainoron Flickr
http://www.flickr.com/photos/trainor/2902023575/, Licensed under CC
Asimov Foundation
• 20 million worlds
• A Scientist who
mathematically calculate
the fate of the Universe
• His effort to change that
Fate
• A Beautiful story
Consider a day in your life
• What is the best road to take?
• Would there be any bad
weather?
• What is the best way to invest
the money?
• Should I take that loan?
• Can I optimize my day?
• Is there a way to do this
faster?
• What have others done in
similar cases?
• Which product should I buy?
People wanted to (through ages)
• To know (what
happened?)
• To Explain (why it
happened)
• To Predict (what will
happen?)
Many Cultures had
Claim for Omniscience
• Oracle
• Astrology
• Book of Changes
• Tarot Cards
• Crystal balls
Claim for Omniscience
To know, explain and predict!!
• Grand challenge of our time
• We have been trying to do this in
many other means
• Now we trying to do this via
science
Any sufficiently advanced technology
is indistinguishable from magic.
---Arthur C. Clarke.
• We see a possibilities though “lot
of data”
You does not seem to be convinced!!
• Why, sometimes I've believed as many as six
impossible things before breakfast.
Prophecies of our time
• We can predict Weather
• We do understand language
translation pretty well
• We can predict how an air
plane behave good enough to
fly
• Our ability on forensics
• We understand remedies for
many diseases
• We can tell how astro bodies
will behave (e.g. comets)
This is being Done: Weather
• Remarkably accurate for few
days
– SL incident
• Data :- weather radars, satellite
data, weather balloons, planes
etc.,
• forecast models
– Simulation
– Numerical method
• Challenge is computing power
– algorithms that take more than
24 hours
– resolution is the key ……
Democratizing Analysis
• Forecasting was done, but in limited manner
and only by few (e.g. National Labs,
Intelligence community)
• That is changing!!
What is Big data?
• There is lot of data available
– E.g. Internet of things
• We have computing power
• We have technology
• Goal is same
– To know
– To Explain
– To predict
• Challenge is the full lifecycle
Data, the wealth of our time
"Data is a precious
thing because they last
longer than systems” -
Tim Barnes Lee
• Access to data is becoming ultimate
competitive advantage
– E.g. Google+ vs. Facebook
– Why many organizations try hard to give us free
things and keep us always logged in (e.g. Gmail,
facebook, search engine tool bars)
Drivers of Big Data
Data Avalanche/ Moore’s law of data
• We are now collecting and converting large amount
of data to digital forms
• 90% of the data in the world today was created
within the past two years.
• Amount of data we have doubles very fast
In real life, most data are Big
• Web does millions of activities per second, and so
much server logs are created.
• Social networks e.g. Facebook, 800 Million active
users, 40 billion photos from its user base.
• There are >4 billion phones and >25% are smart
phones. There are billions of RFID tags.
• Observational and Sensor data
– Weather Radars, Balloons
– Environmental Sensors
– Telescopes
– Complex physics simulations
Why Big Data is hard?
• How store? Assuming 1TB bytes it
takes 1000 computers to store a 1PB
• How to move? Assuming 10Gb
network, it takes 2 hours to copy 1TB,
or 83 days to copy a 1PB
• How to search? Assuming each record
is 1KB and one machine can process
1000 records per sec, it needs 277CPU
days to process a 1TB and 785 CPU
years to process a 1 PB
• How to process?
– How to convert algorithms to work in large
size
– How to create new algorithms
http://www.susanica.com/photo/9
Why it is hard (Contd.)?
• System build of many
computers
• That handles lots of data
• Running complex logic
• This pushes us to frontier of
Distributed Systems and
Databases
• More data does not mean
there is a simple model
• Some models can be complex
as the system
http://www.flickr.com/photos/mariachily/5250487136,
Licensed CC
Big Data Architecture
Sensors
• There are sensors everywhere
– RFID (e.g. Walmart), GPS
sensors, Mobile Phone …
• Internet
– Click streams, Emails, chat,
search, tweets ,Transactions …
• Real Word
– Video surveillance, Cash flows,
Traffic, Surveillance, Stock
exchange, Smart Grid,
Production line …
– Internet of Things
http://www.flickr.com/photos/imuttoo/4257813689/ by Ian Muttoo,
http://www.flickr.com/photos/eastcapital/4554220770/,
http://www.flickr.com/photos/patdavid/4619331472/ by Pat David copyright CC
Collecting Data
• Data collected at sensors and sent to big data
system via events or flat files
• Event Streams: we name the events by its
content/ originator
• Get data through
– Point to Point
– Event Bus
• E.g. Data bridge
– a thrift based transport we
did that do about 400k
events/ sec
Storing Data
• Historically we used databases
– Scale is a challenge: replication,
sharding
• Scalable options
– NoSQL (Cassandra, Hbase) [If
data is structured]
• Column families Gaining Ground
– Distributed file systems (e.g.
HDFS) [If data is unstructured]
• New SQL
– In Memory computing, VoltDB
• Specialized data structures
– Graph Databases, Data structure
servers http://www.flickr.com/photos/keso/3631339
67/
Making Sense of Data
• To know (what happened?)
– Basic analytics +
visualizations (min, max,
average, histogram,
distributions … )
– Interactive drill down
• To explain (why)
– Data mining, classifications,
building models, clustering
• To forecast
– Neural networks, decision
models
To know (what happened?)
• Mainly Analytics
– Min, Max, average,
correlation, histograms
– Might join group data in many
ways
• Implemented with
MapReduce or Queries
• Data is often presented with
some visualizations
• Examples
– forensics
– Assessments
– Historical data/ reports/
trends
http://www.flickr.com/photos/isriya/2967310
333/
Search
• Process and Index the data. The killer app of
our time.
http://pixabay.com, by Steven Iodice
• Web Search
• Graph Search
• Semantic Search
• Drug Discovery
• …
Patterns
• Correlation
– Scatter plot, statistical
correlation
• Data Mining (Detecting
Patterns)
– Clustering and classification
– Finding Similar items
– Finding Hubs and authorities
in a Graph
– Finding frequent item sets
– Making recommendation
• Apache Mahout
http://www.flickr.com/photos/eriwst/2987739376/ and http://www.flickr.com/photos/focx/5035444779/
Forecasts and Models
• Trying to build a model for the
data
• Theoretically or empirically
– Analytical models (e.g. Physics)
– Neural networks
– Reinforcement learning
– Unsupervised learning (clustering,
dimensionality reduction, kernel
methods)
• Examples
– Translation
– Weather Forecast models
– Building profiles of users
– Traffic models
– Economic models
• Remember: Correlation does not
mean causality
http://misterbijou.blogspot.com/201
0_09_01_archive.html
Information Visualization
• Presenting information
– To end user
– To decision takers
– To scientist
• Interactive exploration
• Sending alerts
http://www.flickr.com/photos/stevefae
embra/3604686097/
Napoleon's Russian Campaign
Show Han Rosling’s Data
• This use a tool called “Gapminder”
Usecase 1: Targeted Marketing
• Collect data and build a model on
– What user like
– What he has brought
– What is his buying power
• Making recommendations
• Giving personalized deals
Usecase 2: Travel
• Collect traffic data +
transportation data from
sensors
• Build a model (e.g. Car
Following Models)
• Predict the traffic ..
Possibilities on congestion
• E.g. divert traffic, adjust
troll, adjust traffic lights
Practical Tutorial
• Data collection
– Pub/sub, event architectures
• Processing
– Store and Process: MapReduce/Hadoop
– Processing Moving Data: CEP
• Visualization
– GNU Plot/ R
DEBS Challenge
• Event Processing
challenge
• Real football game,
sensors in player
shoes + ball
• Events in 15k Hz
• Event format
– Sensor ID, TS, x, y, z, v,
a
• Queries
– Running Stats
– Ball Possession
– Heat Map of Activity
– Shots at Goal
MapReduce/ Hadoop
• First introduced by Google,
and used as the processing
model for their architecture
• Implemented by opensource
projects like Apache Hadoop
and Spark
• Users writes two functions:
map and reduce
• The framework handles the
details like distributed
processing, fault tolerance,
load balancing etc.
• Widely used, and the one of
the catalyst of Big data
void map(ctx, k, v){
tokens = v.split();
for t in tokens
ctx.emit(t,1)
}
void reduce(ctx, k, values[]){
count = 0;
for v in values
count = count + v;
ctx.emit(k,count);
}
MapReduce (Contd.)
Histogram of Speeds
Histogram of Speeds(Contd.)
void map(ctx, k, v){
event = parse(v);
range = calcuateRange(event.v);
ctx.emit(range,1)
}
void reduce(ctx, k, values[]){
count = 0;
for v in values
count = count + v;
ctx.emit(k,count);
}
How long player Spend Near the ball?
How long player Spend Near the
ball?(Contd.)
void map(ctx, k, v){
event = parse(v);
cell = calcuateOverlappingCells(v.x,
v.y, v.z);
ctx.emit(cell,v.x, v.y, v.z);
}
Void reduce(ctx, k, values[]){
players = joinAndFindPlayersNearBall(values);
for p in players
ctx.emit(p,p);
}
Hadoop Landscape
• HIVE - Query data using SQL style queries, and Hive will
convert them to MapReduce jobs and run in Hadoop.
• Pig - We write programs using data flow style scripts, and
Pig convert them to MapReduce jobs and run in Hadoop.
• Mahout
– Collection of MapReduce jobs implementing many Data
mining and Artificial Intelligence algorithms using MapReduce
A = load 'hdi-data.csv' using PigStorage(',')
AS (id:int, country:chararray, hdi:float,
lifeex:int, mysch:int, eysch:int, gni:int);
B = FILTER A BY gni > 2000;
C = ORDER B BY gni;
dump C;
hive> SELECT country, gni from HDI WHERE gni > 2000;
Is Hadoop Enough?
• Limitations
– Takes time for processing
– Lack of Incremental processing
– Weak with Graph usecases
– Not very easy to create a processing pipeline (addressed
with HIVE, Pig etc.)
– Too close to programmers
– Faster implementations are possible
• Alternatives
– Apache Drill http://incubator.apache.org/drill/
– Spark http://spark-project.org/
– Graph Processors like http://giraph.apache.org/
Data In the Move
• Idea is to process data as they
are received in streaming
fashion
• Used when we need
– Very fast output
– Lots of events (few 100k to
millions)
– Processing without storing (e.g.
too much data)
• Two main technologies
– Stream Processing (e.g. Strom,
http://storm-project.net/ )
– Complex Event Processing (CEP)
http://wso2.com/products/complex
-event-processor/
Complex Event Processing (CEP)
• Sees inputs as Event streams and queried with
SQL like language
• Supports Filters, Windows, Join, Patterns and
Sequences
from p=PINChangeEvents#win.time(3600) join
t=TransactionEvents[p.custid=custid][amount>1000
0] #win.time(3600)
return t.custid, t.amount;
Example: Detect ball Possession
• Possession is time a
player hit the ball
until someone else
hits it or it goes out
of the ground
from Ball#window.length(1) as b join
Players#window.length(1) as p
unidirectional
on debs: getDistance(b.x,b.y,b.z,
p.x, p.y, p.z) < 1000
and b.a > 55
select ...
insert into hitStream
from old = hitStream ,
b = hitStream [old. pid != pid ],
n= hitStream[b.pid == pid]*,
( e1 = hitStream[b.pid != pid ]
or e2= ballLeavingHitStream)
select ...
insert into BallPossessionStream
http://www.flickr.com/photos/glennharper/146164820/
GNU Plot
set terminal png
set output "buyfreq.png"
set title "Frequency Distribution of Items
brought by Buyer";
setylabel "Number of Items Brought";
setxlabel "Buyers Sorted by Items count";
set key left top
set log y
set log x
plot "1.data" using 2 title "Frequency"
with linespoints
• Open source
• Very powerful
Big data lifecycle
• Realizing the big data lifecycle is hard
• Need wide understanding about many fields
• Big data teams will include members from
many fields working together
• Data Scientists: A new role, evolving Job
description
• His Job is to make sense of data, by using Big data
processing, advance algorithms, Statistical
methods, and data mining etc.
• Likely to be a highest paid/ cool Jobs.
• Big organizations (Google, Facebook, Banks) hiring
Math PhD by a lot for this.
Data Scientist
Statisticians
System experts
Domain Experts
DB experts
AI experts
Dark Side
• Privacy
– Invasion of privacy
– Data might be used for
unexpected things
• Big Brother
– Data likely to used for
control (e.g. governments)
• If technology is out there,
may be it is OK. It is very
hard to hide any thing,
which work both ways
Challenges
• Speed (e.g. targeted advertising, reacting to data)
• Extracting semantics and handling multiple
representations and formats
• Security Data ownership, delegation, permissions,
and Privacy
• Making data accessible to all intended parties,
from anywhere, anytime, from any device,
through any format (subjected to permissions).
• Map-Reduce good enough? What about other
parallel problems?
• Handling Uncertainty
Conclusions
• Lot of data, realizing data -> insight -> predications
• We do lot of predications even now.
• There is lot between data and forecasts, OK to do
them.
– Analytics
– Visualizations
– Patterns
– Data mining
• If you looking to start, learn MapReduce, CEP, and
GNU plot.
• Visualization is the key!!
• Learn some distributed systems and AI
Questions?

More Related Content

What's hot (20)

Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Big Data
Big DataBig Data
Big Data
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Big data
Big dataBig data
Big data
 
Our big data
Our big dataOur big data
Our big data
 
Big Data
Big DataBig Data
Big Data
 
Big data.
Big data.Big data.
Big data.
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Big data
Big dataBig data
Big data
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data
Big dataBig data
Big data
 
Overview of Big data(ppt)
Overview of Big data(ppt)Overview of Big data(ppt)
Overview of Big data(ppt)
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Big data and its applications
Big data and its applicationsBig data and its applications
Big data and its applications
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 

Similar to Introduction to Big Data

Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution WSO2
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introductionamiyadash
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...Srinath Perera
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 
Introduction to Data Processing (by Srinath Perera)
Introduction to Data Processing (by Srinath Perera)Introduction to Data Processing (by Srinath Perera)
Introduction to Data Processing (by Srinath Perera)SLASSCOM Technology Forum
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfphongnguyen312110237
 
Internet of Things and Big Data
Internet of Things and Big DataInternet of Things and Big Data
Internet of Things and Big DataSrinath Perera
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.pptEqinNiftalyev
 
Data sciences and marketing analytics
Data sciences and marketing analyticsData sciences and marketing analytics
Data sciences and marketing analyticsMJ Xavier
 

Similar to Introduction to Big Data (20)

Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Lecture1
Lecture1Lecture1
Lecture1
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Introduction to Data Processing (by Srinath Perera)
Introduction to Data Processing (by Srinath Perera)Introduction to Data Processing (by Srinath Perera)
Introduction to Data Processing (by Srinath Perera)
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdf
 
datamining-lect1.pptx
datamining-lect1.pptxdatamining-lect1.pptx
datamining-lect1.pptx
 
Internet of Things and Big Data
Internet of Things and Big DataInternet of Things and Big Data
Internet of Things and Big Data
 
Data Mining Lecture_1.pptx
Data Mining Lecture_1.pptxData Mining Lecture_1.pptx
Data Mining Lecture_1.pptx
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.ppt
 
Data sciences and marketing analytics
Data sciences and marketing analyticsData sciences and marketing analytics
Data sciences and marketing analytics
 
SKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSISSKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSIS
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 

More from Srinath Perera

Book: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingBook: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingSrinath Perera
 
Data science Applications in the Enterprise
Data science Applications in the EnterpriseData science Applications in the Enterprise
Data science Applications in the EnterpriseSrinath Perera
 
An Introduction to APIs
An Introduction to APIs An Introduction to APIs
An Introduction to APIs Srinath Perera
 
An Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsAn Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsSrinath Perera
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?Srinath Perera
 
Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesSrinath Perera
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?Srinath Perera
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsSrinath Perera
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Srinath Perera
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of BlockchainSrinath Perera
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesSrinath Perera
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata EraSrinath Perera
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksSrinath Perera
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeSrinath Perera
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies TimelineSrinath Perera
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsSrinath Perera
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglySrinath Perera
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through AnalyticsSrinath Perera
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySrinath Perera
 

More from Srinath Perera (20)

Book: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingBook: Software Architecture and Decision-Making
Book: Software Architecture and Decision-Making
 
Data science Applications in the Enterprise
Data science Applications in the EnterpriseData science Applications in the Enterprise
Data science Applications in the Enterprise
 
An Introduction to APIs
An Introduction to APIs An Introduction to APIs
An Introduction to APIs
 
An Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsAn Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance Professionals
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?
 
Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & Challenges
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future Integrations
 
Future of Serverless
Future of ServerlessFuture of Serverless
Future of Serverless
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going?
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of Blockchain
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New Technologies
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata Era
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and Risks
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology Landscape
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies Timeline
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming Applications
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the Ugly
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through Analytics
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration Technology
 

Recently uploaded

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 

Recently uploaded (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 

Introduction to Big Data

  • 1.
  • 2. Outline• Forecasts and Why Big data? • What is Big Data? • Parts of the Puzzle • Quick Tutorial • Conclusion Photo by John Trainoron Flickr http://www.flickr.com/photos/trainor/2902023575/, Licensed under CC
  • 3. Asimov Foundation • 20 million worlds • A Scientist who mathematically calculate the fate of the Universe • His effort to change that Fate • A Beautiful story
  • 4. Consider a day in your life • What is the best road to take? • Would there be any bad weather? • What is the best way to invest the money? • Should I take that loan? • Can I optimize my day? • Is there a way to do this faster? • What have others done in similar cases? • Which product should I buy?
  • 5. People wanted to (through ages) • To know (what happened?) • To Explain (why it happened) • To Predict (what will happen?)
  • 6. Many Cultures had Claim for Omniscience • Oracle • Astrology • Book of Changes • Tarot Cards • Crystal balls Claim for Omniscience
  • 7. To know, explain and predict!! • Grand challenge of our time • We have been trying to do this in many other means • Now we trying to do this via science Any sufficiently advanced technology is indistinguishable from magic. ---Arthur C. Clarke. • We see a possibilities though “lot of data”
  • 8. You does not seem to be convinced!! • Why, sometimes I've believed as many as six impossible things before breakfast.
  • 9. Prophecies of our time • We can predict Weather • We do understand language translation pretty well • We can predict how an air plane behave good enough to fly • Our ability on forensics • We understand remedies for many diseases • We can tell how astro bodies will behave (e.g. comets)
  • 10. This is being Done: Weather • Remarkably accurate for few days – SL incident • Data :- weather radars, satellite data, weather balloons, planes etc., • forecast models – Simulation – Numerical method • Challenge is computing power – algorithms that take more than 24 hours – resolution is the key ……
  • 11. Democratizing Analysis • Forecasting was done, but in limited manner and only by few (e.g. National Labs, Intelligence community) • That is changing!!
  • 12. What is Big data? • There is lot of data available – E.g. Internet of things • We have computing power • We have technology • Goal is same – To know – To Explain – To predict • Challenge is the full lifecycle
  • 13. Data, the wealth of our time "Data is a precious thing because they last longer than systems” - Tim Barnes Lee • Access to data is becoming ultimate competitive advantage – E.g. Google+ vs. Facebook – Why many organizations try hard to give us free things and keep us always logged in (e.g. Gmail, facebook, search engine tool bars)
  • 15. Data Avalanche/ Moore’s law of data • We are now collecting and converting large amount of data to digital forms • 90% of the data in the world today was created within the past two years. • Amount of data we have doubles very fast
  • 16. In real life, most data are Big • Web does millions of activities per second, and so much server logs are created. • Social networks e.g. Facebook, 800 Million active users, 40 billion photos from its user base. • There are >4 billion phones and >25% are smart phones. There are billions of RFID tags. • Observational and Sensor data – Weather Radars, Balloons – Environmental Sensors – Telescopes – Complex physics simulations
  • 17. Why Big Data is hard? • How store? Assuming 1TB bytes it takes 1000 computers to store a 1PB • How to move? Assuming 10Gb network, it takes 2 hours to copy 1TB, or 83 days to copy a 1PB • How to search? Assuming each record is 1KB and one machine can process 1000 records per sec, it needs 277CPU days to process a 1TB and 785 CPU years to process a 1 PB • How to process? – How to convert algorithms to work in large size – How to create new algorithms http://www.susanica.com/photo/9
  • 18. Why it is hard (Contd.)? • System build of many computers • That handles lots of data • Running complex logic • This pushes us to frontier of Distributed Systems and Databases • More data does not mean there is a simple model • Some models can be complex as the system http://www.flickr.com/photos/mariachily/5250487136, Licensed CC
  • 20. Sensors • There are sensors everywhere – RFID (e.g. Walmart), GPS sensors, Mobile Phone … • Internet – Click streams, Emails, chat, search, tweets ,Transactions … • Real Word – Video surveillance, Cash flows, Traffic, Surveillance, Stock exchange, Smart Grid, Production line … – Internet of Things http://www.flickr.com/photos/imuttoo/4257813689/ by Ian Muttoo, http://www.flickr.com/photos/eastcapital/4554220770/, http://www.flickr.com/photos/patdavid/4619331472/ by Pat David copyright CC
  • 21. Collecting Data • Data collected at sensors and sent to big data system via events or flat files • Event Streams: we name the events by its content/ originator • Get data through – Point to Point – Event Bus • E.g. Data bridge – a thrift based transport we did that do about 400k events/ sec
  • 22. Storing Data • Historically we used databases – Scale is a challenge: replication, sharding • Scalable options – NoSQL (Cassandra, Hbase) [If data is structured] • Column families Gaining Ground – Distributed file systems (e.g. HDFS) [If data is unstructured] • New SQL – In Memory computing, VoltDB • Specialized data structures – Graph Databases, Data structure servers http://www.flickr.com/photos/keso/3631339 67/
  • 23. Making Sense of Data • To know (what happened?) – Basic analytics + visualizations (min, max, average, histogram, distributions … ) – Interactive drill down • To explain (why) – Data mining, classifications, building models, clustering • To forecast – Neural networks, decision models
  • 24. To know (what happened?) • Mainly Analytics – Min, Max, average, correlation, histograms – Might join group data in many ways • Implemented with MapReduce or Queries • Data is often presented with some visualizations • Examples – forensics – Assessments – Historical data/ reports/ trends http://www.flickr.com/photos/isriya/2967310 333/
  • 25. Search • Process and Index the data. The killer app of our time. http://pixabay.com, by Steven Iodice • Web Search • Graph Search • Semantic Search • Drug Discovery • …
  • 26. Patterns • Correlation – Scatter plot, statistical correlation • Data Mining (Detecting Patterns) – Clustering and classification – Finding Similar items – Finding Hubs and authorities in a Graph – Finding frequent item sets – Making recommendation • Apache Mahout http://www.flickr.com/photos/eriwst/2987739376/ and http://www.flickr.com/photos/focx/5035444779/
  • 27. Forecasts and Models • Trying to build a model for the data • Theoretically or empirically – Analytical models (e.g. Physics) – Neural networks – Reinforcement learning – Unsupervised learning (clustering, dimensionality reduction, kernel methods) • Examples – Translation – Weather Forecast models – Building profiles of users – Traffic models – Economic models • Remember: Correlation does not mean causality http://misterbijou.blogspot.com/201 0_09_01_archive.html
  • 28. Information Visualization • Presenting information – To end user – To decision takers – To scientist • Interactive exploration • Sending alerts http://www.flickr.com/photos/stevefae embra/3604686097/
  • 30. Show Han Rosling’s Data • This use a tool called “Gapminder”
  • 31. Usecase 1: Targeted Marketing • Collect data and build a model on – What user like – What he has brought – What is his buying power • Making recommendations • Giving personalized deals
  • 32. Usecase 2: Travel • Collect traffic data + transportation data from sensors • Build a model (e.g. Car Following Models) • Predict the traffic .. Possibilities on congestion • E.g. divert traffic, adjust troll, adjust traffic lights
  • 33. Practical Tutorial • Data collection – Pub/sub, event architectures • Processing – Store and Process: MapReduce/Hadoop – Processing Moving Data: CEP • Visualization – GNU Plot/ R
  • 34. DEBS Challenge • Event Processing challenge • Real football game, sensors in player shoes + ball • Events in 15k Hz • Event format – Sensor ID, TS, x, y, z, v, a • Queries – Running Stats – Ball Possession – Heat Map of Activity – Shots at Goal
  • 35. MapReduce/ Hadoop • First introduced by Google, and used as the processing model for their architecture • Implemented by opensource projects like Apache Hadoop and Spark • Users writes two functions: map and reduce • The framework handles the details like distributed processing, fault tolerance, load balancing etc. • Widely used, and the one of the catalyst of Big data void map(ctx, k, v){ tokens = v.split(); for t in tokens ctx.emit(t,1) } void reduce(ctx, k, values[]){ count = 0; for v in values count = count + v; ctx.emit(k,count); }
  • 38. Histogram of Speeds(Contd.) void map(ctx, k, v){ event = parse(v); range = calcuateRange(event.v); ctx.emit(range,1) } void reduce(ctx, k, values[]){ count = 0; for v in values count = count + v; ctx.emit(k,count); }
  • 39. How long player Spend Near the ball?
  • 40. How long player Spend Near the ball?(Contd.) void map(ctx, k, v){ event = parse(v); cell = calcuateOverlappingCells(v.x, v.y, v.z); ctx.emit(cell,v.x, v.y, v.z); } Void reduce(ctx, k, values[]){ players = joinAndFindPlayersNearBall(values); for p in players ctx.emit(p,p); }
  • 41. Hadoop Landscape • HIVE - Query data using SQL style queries, and Hive will convert them to MapReduce jobs and run in Hadoop. • Pig - We write programs using data flow style scripts, and Pig convert them to MapReduce jobs and run in Hadoop. • Mahout – Collection of MapReduce jobs implementing many Data mining and Artificial Intelligence algorithms using MapReduce A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int); B = FILTER A BY gni > 2000; C = ORDER B BY gni; dump C; hive> SELECT country, gni from HDI WHERE gni > 2000;
  • 42. Is Hadoop Enough? • Limitations – Takes time for processing – Lack of Incremental processing – Weak with Graph usecases – Not very easy to create a processing pipeline (addressed with HIVE, Pig etc.) – Too close to programmers – Faster implementations are possible • Alternatives – Apache Drill http://incubator.apache.org/drill/ – Spark http://spark-project.org/ – Graph Processors like http://giraph.apache.org/
  • 43. Data In the Move • Idea is to process data as they are received in streaming fashion • Used when we need – Very fast output – Lots of events (few 100k to millions) – Processing without storing (e.g. too much data) • Two main technologies – Stream Processing (e.g. Strom, http://storm-project.net/ ) – Complex Event Processing (CEP) http://wso2.com/products/complex -event-processor/
  • 44. Complex Event Processing (CEP) • Sees inputs as Event streams and queried with SQL like language • Supports Filters, Windows, Join, Patterns and Sequences from p=PINChangeEvents#win.time(3600) join t=TransactionEvents[p.custid=custid][amount>1000 0] #win.time(3600) return t.custid, t.amount;
  • 45. Example: Detect ball Possession • Possession is time a player hit the ball until someone else hits it or it goes out of the ground from Ball#window.length(1) as b join Players#window.length(1) as p unidirectional on debs: getDistance(b.x,b.y,b.z, p.x, p.y, p.z) < 1000 and b.a > 55 select ... insert into hitStream from old = hitStream , b = hitStream [old. pid != pid ], n= hitStream[b.pid == pid]*, ( e1 = hitStream[b.pid != pid ] or e2= ballLeavingHitStream) select ... insert into BallPossessionStream http://www.flickr.com/photos/glennharper/146164820/
  • 46. GNU Plot set terminal png set output "buyfreq.png" set title "Frequency Distribution of Items brought by Buyer"; setylabel "Number of Items Brought"; setxlabel "Buyers Sorted by Items count"; set key left top set log y set log x plot "1.data" using 2 title "Frequency" with linespoints • Open source • Very powerful
  • 47. Big data lifecycle • Realizing the big data lifecycle is hard • Need wide understanding about many fields • Big data teams will include members from many fields working together
  • 48. • Data Scientists: A new role, evolving Job description • His Job is to make sense of data, by using Big data processing, advance algorithms, Statistical methods, and data mining etc. • Likely to be a highest paid/ cool Jobs. • Big organizations (Google, Facebook, Banks) hiring Math PhD by a lot for this. Data Scientist Statisticians System experts Domain Experts DB experts AI experts
  • 49. Dark Side • Privacy – Invasion of privacy – Data might be used for unexpected things • Big Brother – Data likely to used for control (e.g. governments) • If technology is out there, may be it is OK. It is very hard to hide any thing, which work both ways
  • 50. Challenges • Speed (e.g. targeted advertising, reacting to data) • Extracting semantics and handling multiple representations and formats • Security Data ownership, delegation, permissions, and Privacy • Making data accessible to all intended parties, from anywhere, anytime, from any device, through any format (subjected to permissions). • Map-Reduce good enough? What about other parallel problems? • Handling Uncertainty
  • 51. Conclusions • Lot of data, realizing data -> insight -> predications • We do lot of predications even now. • There is lot between data and forecasts, OK to do them. – Analytics – Visualizations – Patterns – Data mining • If you looking to start, learn MapReduce, CEP, and GNU plot. • Visualization is the key!! • Learn some distributed systems and AI