The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
1. Importance of ‘Centralized Event collection’
and BigData platform for Analysis !
DevOpsDays India, Bangalore - 2013
~/Piyush
Manager, Website Operations at MakeMyTrip
2. What to expect:
MakeMyTrip data challenges!
Event Data a.k.a. Logs & Log Analysis
Why Centralized Logging …for systems and applications !
Capturing Events: Why structured data emitted from apps for
machines is a better approach!
Data Service Platform : DSP – Why ?
Inputs: Data for DSP
Top Architecture Considerations
Top level key tasks
Tools Arsenal and API Management and Service Cloud
DevOpsDays India 2013 : ~/Piyush
3. MakeMyTrip data challenges …!
•
•
Multi-DC/colocation setup
Different type of data sources : internal/ external(structured, semi-structured,
unstructured))
– Online Transaction Data Store
– ERP
– CRM
•
Email Behavior / Survey results
– Web Analytics
– Logs
•
•
•
–
–
–
–
Web
Application
User Activity logs
Social Media
Inventory / Catalog
Data residing in excel files
Monitoring Metric Data :
•
•
Graphite (Time-series whisper),
Splunk , ElasticSearch (Logstash)
– Many other different sources
•
Storing and Analyzing Huge Event Data !
DevOpsDays India 2013 : ~/Piyush
4. Some challenges …!
•
•
•
•
•
•
Aggregate web usage data and transactional data to generate one view
Process multiple GB's-TB’s of data every day
Serve more than a million data services API request / day
Ensure business continuity as more and more reliance on MyDSP increases
Store Terabytes of historical data
Meshing transactional (online and offline) data with consumer behavior
and derive analytics
• Build flexible data ingestion platform to manage many data feeds from
multiple data sources
DevOpsDays India 2013 : ~/Piyush
5. Flow of an Event
DevOpsDays India 2013 : ~/Piyush
6. Event Data a.k.a. Logs
• Event Data -> set of chronologically sequenced data records that capture
information about an event !
• Virtually every form of system produces event data
– Capture it from all components and both client and server side events!
• You may call logs as the footprint generated by any activity with the
system/app.
• Event Data has different characteristics from data stored in traditional
data warehouses
– Huge Volume: Event data accumulates rapidly and often must be stored for years; many
organizations are managing hundreds of terabytes and some are managing petabytes.
– Format: Because of the huge variety of sources, event data is unstructured and semi
structured.
– Velocity – New event data is constantly coming in
– Collection : Event data is difficult to collect because of broadly dispersed systems and
networks.
– Time-stamped : Event data is always inserted once with a time-stamp. It never changes.
DevOpsDays India 2013 : ~/Piyush
7. Log Analysis
• Logs are one of the most useful things when it comes to analysis; in simple
terms Log analysis is making sense out of system/app-generated log
messages (or just LOGS). Through logs we get insights into what is
happening into the system.
• Help root cause analysis that occurs after any incident.
• Personalize User Experience Analyzing Web Usage Data
“Security Req“:
• Traditionally some compliance requirements too of : Log Management
/SEM+ SIM => SIEM
• For Data Security – to have one centralized platform for collecting ALL
events (Logs) , correlate them and have real time intelligent visibility.
• To not just monitor network, OS , devices etc. but ALL applications ,
business processes too.
DevOpsDays India 2013 : ~/Piyush
8. Why Centralized Logging …for systems and applications !
• Need for Centralized Logging is quiet important nowadays due to:–
–
–
–
growth in number of applications,
distributed architecture (Service Oriented Architecture)
Cloud based apps
number of machines and infrastructure size is increasing day by day.
• This means that centralized logging and the ability to spot errors in a
distributed systems & applications has become even more “valuable” &
“needed”.
And most importantly
– be able to understand the customers and how they interact with websites;
– Understanding Change: whether using A/B or Multivariate experiments or tweak /
understand new implementations.
DevOpsDays India 2013 : ~/Piyush
9. Capturing Events: Why structured data emitted from apps for
machines is a better approach!
• Need for standardization:– Developers assume that the first level consumer of a log message is a human and they
only know what information is needed to debug an issue.
Logs are not just for humans!
The primary consumers of logs are shifting from humans to computers. This means log
formats should have a well-defined structure that can be parsed easily and robustly.
Logs change!
If the logs never changed, writing a custom parser might not be too terrible. The
engineer would write it once and be done. But in reality, logs change.
Every time you add a feature, you start logging more data, and as you add more data,
the printf-style format inevitably changes. This implies that the custom parser has to be
updated constantly, consuming valuable development time.
• Suggested Approach : “Logging in JSON Format”
– Just to keep it simple and generic for any Application the approach
recommended is to {Key: Value} , JSON Log Format (structured/semistructured).
– This approach will be helpful for easy parsing and consumption, which
would be irrespective of whatever technology/tools we choose to use!
DevOpsDays India 2013 : ~/Piyush
10. Key things to keep in mind/ Rules
•
•
•
•
•
•
•
•
•
•
Use timestamps for every event
Use unique identifiers (IDs) like Transaction ID / User ID / Session ID or may be
append unique user Identification (UUID) number to track unique users.
Log in text format / means Avoid logging binary information!
Log anything that can add value when aggregated, charted, or further
analyzed.
Use categories: like “severity”: “WARN”, INFO, WARN, ERROR, and DEBUG.
The 80/20 Rule: %80 or of our goals can be achieved with %20 of the work, so
don’t log too much
NTP synced same date time / timezone on every producer and collector
machine(#ntpdate ntp.example.com).
Reliability: Like video recordings … you don’t’ want to lose the most valuable
shoot … so you record every frame and then later during analysis; you may
throw away rest of the stuff…picking your best shoot / frame. Here also – logs
as events are recorded & should be recorded with proper reliability so that
you don’t’ lose any important and usable part of it like the important video
frame.
Correlation Rules for various event streams to generated and minimize
alerts/events.
Write Connectors for integrations
DevOpsDays India 2013 : ~/Piyush
11. Data Service Platform : DSP
Why we need a data services platform ?
-
-
Integration Layer to bring data from more
sources in less time
Serve various components – applications
and also to Monitoring systems etc.
DevOpsDays India 2013 : ~/Piyush
12. Inputs : Data – what data to include
• Clickstream / Web Usage Data
– User Activity Logs
• Transactional Data Store
• Off-line
– CRM
– Email Behavior -> Logs/ Events
DevOpsDays India 2013 : ~/Piyush
13. Top Architecture Considerations
•
•
•
•
•
Non blocking data ingestion
UUID Tagged Events / messages
Load balanced data processing across data centers
Use of memory based data storage for real-time data systems
Easy scalable, HA - highly available and easy to maintain large historical
data sets
• Data caching to achieve low latency
• To ensure Business Continuity , parallel process between two different
data centers
• Use of Centralized service cloud for API management , security
(authentication, authorization), metering and integration
DevOpsDays India 2013 : ~/Piyush
14. Top level key tasks for User Activity Logging & Analysis
1. Data Collection of both Client-Side and Server-Side user activity streams
•
•
Tag every Website visitor with UUID similar to the System UUID’s
Collect the activity streams on BigData Platform for Analysis through Kafka Queues & NoSQL data
stores
2. Near real-time Data Processing
•
Preprocessing / Aggregations
•
•
Filtering etc.
Pattern Discovery along with the already available cooked data from point 4
•
Clustering/Classification/association discovery/Sequence Mining
3. Rule Engine / recommendations algorithms
•
•
Rule Engine : Building effective business rule engine / Correlate Events
Content-based filtering / Collaborative Filtering
4. Batch Processing / post processing using Hadoop Ecosystem
•
Analysis & Storing Cooked data in NoSQL data store
5. Data Services (Web-services)
•
RESTful API’s to make the data/insights consumable through various data services
6. Reporting/Search interface & Visualization for Product Development teams and other
business owners.
DevOpsDays India 2013 : ~/Piyush
15. Data System
Lets’ store
everything!
Query =
function (data)
Layered
Architecture:
• every event : Data !
• Precompute View
• Batch Layer : Hadoop M/R
• Speed Layer : Storm NRT Computation
• Serving Layer
DevOpsDays India 2013 : ~/Piyush
17. Clickstream / User Activities Capture : Data is-> “Events”
•
•
Tag every Website visitor with UUID using Apache module - Done
– https://github.com/piykumar/modified_mod_cookietrack
– Cookie : UUID like 24617072-3124-674f-4b72-675746562434.1381297617597249
JSON Messages like
{
"timestamp": "2012-12-14T02:30:18",
"facility": "clientSide",
"clientip": "123.123.123.123",
"uuid": "24617072-3124-5544-2f61-695256432432.1379399183414528",
"domain": "www.example.com",
"server": "abc-123",
"request": "/page/request",
"pagename": "funnel:example com:page1",
"searchKey": "1234567890_",
"sessionID": "11111111111111",
"event1": "loading",
"event2": "interstitial display banner",
"severity": "WARN",
"short_message": "....meaning short message for aggregation...",
"full_message": "full LOG message",
"userAgent": "...blah...blah..blah...",
"RT": 2
}
DevOpsDays India 2013 : ~/Piyush
18. Tools Arsenal
•
•
•
•
•
•
•
•
•
•
•
ETL : Talend
BI : SpagoBI & QlikView
Hadoop : Hortonworks
NRT Computation: Twitter Storm
Document-Oriented NoSQL DB : Couchbase
Distributed Search: ElasticSearch
Log Collection: Flume, Logstash, Syslog-NG
Distributed messaging system : Kafka , RabbitMQ
NoSQL : Cassandra, Redis, Neo4J (Graph)
API Management : WSO2 API Manager, 3Scale /Nginx
Programming Languages : Java , Python, R
DevOpsDays India 2013 : ~/Piyush
19. API Management and Data Services
Cloud
• 3Scale / Nginx , WSO2: API Manager etc
– For centralized distributed repository to serve API’s and provides
throttling,meetring, Security features etc.
• Inject building a data services layer in Culture
and make sure what ever components you
create you have some way to chain it in the
pipeline or call in independently.
DevOpsDays India 2013 : ~/Piyush
20. Thanks!
Questions – If Any !
~/Piyush
@piykumar
http://piyush.me
DevOpsDays India 2013 : ~/Piyush