Large scale data processing analyses and makes sense of large amounts of data. Spanning many fields, Large scale data processing brings together technologies like Distributed Systems, Machine Learning, Statistics, and Internet of Things together. It is a multi-billion-dollar industry including use cases like targeted advertising, fraud detection, product recommendations, and market surveys. With new technologies like Internet of Things (IoT), these use cases are expanding to scenarios like Smart Cities, Smart health, and Smart Agriculture. Some usecases like Urban Planning can be slow, which is done in batch mode, while others like stock markets need results within Milliseconds, which are done in streaming fashion. Predictive analytics let us learn models from data often providing us ability to predict the outcome of our actions.
WSO2 Data analytics platform is fast and scalable platform that is being used by more than 40 organizations including Banks, Financial Institutions, Smart Cities, Hospitals, Media Companies, Telecom Companies, State and Federal Governments, and High Tech companies. This talk will start with a discussion on large scale data analysis. Then we will look at WSO2 Data analytics platform and discuss in detail how we can use the platform to build end to end Big data applications combining power of batch processing, real-time analytics, and predictive technologies.
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
Introduction to Large Scale Data Analysis with WSO2 Analytics Platform
1. Introduction to
Large Scale Data
Analysis and
WSO2 Analytics
Platform
Srinath Perera
Director Research WSO2, Apache Member
(@srinath_perera)
srinath@wso2.com
At Indiana University Bloomington
2. Who We are?
We are an opensource Middleware
company
- We build systems upon which others
build their systems
Venture funded – Intel Capital, Cisco,
Toba Capital
400+ people & Offices at Silicon valley, Sri Lanka, London and
Bloomington
Customers including Banks, Aircraft Manufacturers, Governments
(State and Federal), Media Companies, Telco, Retail, Healthcare ..
4. A Day inYour Life
Think about a day in your life?
- What is the best road to take?
- Would there be any bad weather?
- How to invest my money?
- How is my health?
There are many decisions that you can do
better if only you can access the data and
process them.
http://www.flickr.com/photos/kcolwell/55124616
CC licence
5.
6. Internet ofThings
Currently th physical world and
software worlds are detached
Internet of things promises to bridge
this
- It is about sensors and actuators
everywhere
- In your fridge, in your blanket, in your
chair, in your carpet.. Yes even in your
socks
- Umbrella that light up when there is
rain and medicine cups
7. What can We do with Big Data?
Optimize (World is inefficient)
- 30% food wasted farm to plate
- GE Save 1% initiative (http://goo.gl/eYC0QE )
- Trains => 2B/ year
- US healthcare => 20B/ year
Save lives
- Weather, Disease identification, Personalized treatment
Technology advancement
- Most high tech research are done via simulations
10. (Batch) Analytics
Scientists are doing this for 25 year with
MPI (1991) on special Hardware
- OpenMPI is being done at IU!
Took off with Google’s MapReduce
paper (2004), Apache Hadoop, Hive and
whole eco system created.
It was successful, So we are here!!
But, processing takes time.
11. Usecase:Targeted Advertising
Analytics Implemented with MapReduce or Queries
- Min, Max, average, correlation, histograms, might join or group data in
many ways
- Heatmaps, temporal trends
Key Performance indicators (KPIs)
- E.g. Profit per square feet for retail
12. Usecase: Big Data for development
Done using CDR data
People density noon vs. midnight
(red => increased, blue =>
decreased)
Urban Planning
- People distribution
- Mobility
- Waste Management
- E.g. see http://goo.gl/jPujmM
From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/
13. Value of some Insights degrade Fast!
For some usecases ( e.g. stock markets, traffic, surveillance, patient
monitoring) the value of insights degrades very quickly with time.
- E.g. stock markets and speed of light
We need technology that can produce
outputs fast
- Static Queries, but need very fast output
(Alerts, Realtime control)
- Dynamic and Interactive Queries ( Data
exploration)
14.
15. Predictive Analytics
If we know how to solve a problem, that is if we know
a finite set of rules, then we can programs it.
For some problems (e.g. Drive a car, character
recognition), we do not know a finite fix rule set.
Instead of programming, we give lot of examples and
ask the computer to learn (often called Machine
Learning)
Lot of tools
- R ( Statistical language)
- Sci-kit learn (Phython)
- Apache Spark’s MLBase and Apache Mahout (Java)
16. Usecase: Predictive Maintenance
Idea is to fix the problem before it
happens, avoiding expensive
downtimes
- Airplanes, turbines, windmills
- Construction Equipment
- Car, Golf carts
How
- Build a model for normal operation
and compare deviation
- Match against known error patterns
17. Problem we are trying to
Solve!
Build a platform using which others can
build their analytics systems
- Collect, Analyze, Communicate
- End to end, starts from humans and ends
with humans
Different Audiences
- Technical (Developers)
- Non-technical (CXOs, sales, analysts)
There are two things you need to
know about business,: make
something users love and make
more than you spend.
--Paul Graham
( Lisp, Y-combinator)
18.
19. Running Example
Monitor Temperature and hot airflow across multiple buildings (e.g.
central AC)
- More people => hot
Analytics
- Historical behavior of temperature by the hour
- Alerts if temperature falls too much or too high
- Modeling and predicating temperature to adjust proactively
define TemperatureStream(ts long, buildingNo long, t double);
define AirflowStream(ts long, buildingNo long,
aflow double, aT);
20. Collect Data
One Sensor API to publish events
- REST, Thrift, Java, JMS, Kafka
- Java clients, java script clients*
First you define streams (think it
as a infinite table in SQL DB)
Then send events via API
* Challenges ( performance,
guaranteed delivery, scale)
Can send to batch pipeline, Realtime pipeline or both via
configuration!
21. Collecting Data: Example
Java example: create and send events
Events send asynchronously
See client given in http://goo.gl/vIJzqc for more info
Agent agent = new Agent(agentConfiguration);
publisher = new AsyncDataPublisher("tcp://hostname:7612", .. );
StreamDefinition definition = new StreamDefinition(STREAM_NAME,VERSION);
definition.addPayloadData("sid", STRING);
...
publisher.addStreamDefinition(definition);
...
Event event = new Event();
event.setPayloadData(eventData);
publisher.publish(STREAM_NAME, VERSION, event); Send events
Define Stream
Initialize Stream
22. Batch Analytics: Spark
Two frameworks: Hadoop (http://hadoop.apache.org ) and
Spark (https://spark.apache.org )
- Hadoop is a MapReduce implementation
Spark is faster (30X and ) and much more flexible.
They set a record at Gray Sort (100TB) 3X faster with 10X less
machines, http://goo.gl/r5LGvD
For Hadoop and MapReduce resources, Google it.
file = spark.textFile("hdfs://...”)
file.flatMap(tsToHourFunction)
.reduceByKey(lambda a, b: a+b)
23. SQL like Queries: Hive
Apache Hive provides a SQL like data
processing language
Since many understands SQL, Hive
made large scale data processing Big
Data accessible to many
Expressive, short, and sweet.
Define core operations that covers 90%
of problems
Lets experts dig in when they like! (via
User Defined functions)
24. HourlyTemperature Average
Hive compile the SQL like query to set of MapReduce jobs running
in Hadoop or Spark (in WSO2 BAM from 15, Q2 release)
insert overwrite table TemperatureHistory
select hour, average(t) as avgT, buildingId
from TemperatureStream group by buildingId, getHour(ts);
26. Operators: Filters
Assume a temperature stream
Here weather:convertFtoC() is a
user defined function. They are
used to extend the language.
define stream TemperatureStream(ts long, temp double);
from TemperatureStream[weather:convertFtoC(temp) > 30.0)
and roomNo != 2043]
select roomNo, temp
insert into HotRoomsStream ;
Usecases:
- Alerts , thresholds (e.g. Alarm on
high temperature)
- Preprocessing: filtering,
transformations (e.g. data cleanup)
27. Operators:Windows and Aggregation
Support many window types
- Batch Windows, Sliding windows, Custom windows
Usecases
- Simple counting (e.g. failure count)
- Counting with Windows ( e.g. failure count every hour)
from TemperatureStream#window.time(1 min)
select roomNo, avg(temp) as avgTemp
insert into HotRoomsStream ;
28. Operators: Patterns
Models a followed by relation: e.g.
event A followed by event B
Very powerful tool for tracking
and detecting patterns
from every (a1 = TemperatureStream)
-> a2 = TemperatureStream [temp > a1.temp + 5 ]
within 1 day
select a2.ts as ts, a2.temp – a1.temp as diff
insert into HotDayAlertStream;
Usecases
- Detecting Event Sequence Patterns
- Tracking
- Detect trends
29. Operators: Joins
Join two data streams based on a condition and windows
Usecases
- Data Correlation, Detect missing events, detecting erroneous data
- Joining event streams
from TemperatureStream [temp > 30.0]#window.time(1 min) as T
join RegulatorStream[isOn == false]#window.length(1) as R on
T.roomNo == R.roomNo
select T.roomNo, R.deviceID, ‘start’ as action insert into
RegulatorActionStream
30. Operators:Access Data from the Disk
Event tables allow users to map a database to a window and join a
data stream with the window
Usecases
- Merge with data in a database, collect, update data conditionally
define table HistTempTable(day long, avgT double);
from TemperatureStream#window.length(1) join OldTempTable
on getDayOfYear(ts) == HistTempTable.day && ts > avgT
select ts, temp
insert into PurchaseUserStream ;
31. Realtime Analytics Patterns
Simple counting (e.g. failure count)
Counting with Windows ( e.g. failure count every hour)
Preprocessing: filtering, transformations (e.g. data cleanup)
Alerts , thresholds (e.g. Alarm on high temperature)
Data Correlation, Detect missing events, detecting erroneous data
(e.g. detecting failed sensors)
Joining event streams (e.g. detect a hit on soccer ball)
Merge with data in a database, collect, update data conditionally
32. Realtime Analytics Patterns (contd.)
Detecting Event Sequence Patterns (e.g. small transaction followed
by large transaction)
Tracking - follow some related entity’s state in space, time etc. (e.g.
location of airline baggage, vehicle, tracking wild life)
Detect trends – Rise, turn, fall, Outliers, Complex trends like triple
bottom etc., (e.g. algorithmic trading, SLA, load balancing)
Learning a Model (e.g. Predictive maintenance)
Predicting next value and corrective actions (e.g. automated car)
33. Predictive Analytics
Build models and use them with
WSO2 CEP, BAM and ESB using
upcoming WSO2 Machine Learner
Product ( 2015 Q2)
Build model using R, export them as
PMML, and use within WSO2 CEP
Call R Scripts from CEP queries
Regression and Anomaly Detection
Operators in CEP
34. Predictive Analytics
WSO2 Machine Learner provide
an wizard to explore and build
model
E.g. Build a model to predict next 15
minutes temperature
- Trivial Option : (historical mean
+last 15m mean)/2
- Better model via ARIMA from time
series analysis
To know more, take a ML class
35. Communicate:
Dashboards
Idea is to given the “Overall idea” in a glance
(e.g. car dashboard)
Support for personalization, you can build
your own dashboard.
Also the entry point for Drill down
How to build?
- Dashboard via Google Gadget and content
via HTML5 + java scripts
- Use WSO2 User Engagement Server to
build a dashboard. (or a JSP or PHP)
- Use charting libraries like Vega or D3
36. Communicate:
Dashboards
Idea is to given the “Overall idea” in a glance
(e.g. car dashboard)
Support for personalization, you can build
your own dashboard.
Also the entry point for Drill down
How to build?
- Dashboard via Google Gadget and content
via HTML5 + java scripts
- Use WSO2 User Engagement Server to
build a dashboard. (or a JSP or PHP)
- Use charting libraries like Vega or D3
37. Communicate:Alerts
Detecting conditions can be done via
CEP Queries
Key is the “Last Mile”
- Email
- SMS
- Push notifications to a UI
- Pager
- Trigger physical Alarm
How?
- Select Email sender “Output Adaptor” from CEP, or send from CEP to ESB, and ESB has lot of
connectors
38. Communicate:APIs
With mobile Apps, most data are
exposed and shared as APIs
(REST/Json ) to end users.
Following are some challenges
- Security and Permissions
- API Discovery
- Billing, throttling, quote
- SLA enforcement
How?
- Write data to a database from CEP event tables
- Build Services via WSO2 Data Service
- Expose them as APIs via API Manager
39. Smart Home
2015 yearly DEBS (Distributed Event Based Systems)
DEBS Grand Challenge (http://goo.gl/0htxlj)
Smart Home electricity data: 2000 sensors, 40 houses,
4 Billion events
We posted (400K events/sec) and close to one million
distributed throughput with 4 nodes.
WSO2 CEP based solution is one of the four finalists
(with Dresden University of Technology, Fraunhofer
Institute, and Imperial College London)
Only generic solution to become a finalist
40. Case Study: Realtime Soccer Analysis
Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM
43. Conclusion
Goal: Build a platform using
which others can build their
analytics systems
- End to end, starts from humans
and ends with humans
Whole platform is opensource
under Apache License
What can you do with the
platform?
- Solve hard problems, build Great
Apps with the platform
- Add and contribute extensions to
the platform (e.g. GSoc
http://goo.gl/QNFP6Y )
- Fix problems ( Patches)
Find us at architecture@wso2.org list or Stackoverflow (tag
wso2)