This document provides an overview of steps to build an agile analytics application, beginning with raw event data and ending with a web application to explore and visualize that data. The steps include:
1) Serializing raw event data (emails, logs, etc.) into a document format like Avro or JSON
2) Loading the serialized data into Pig for exploration and transformation
3) Publishing the data to a "database" like MongoDB
4) Building a web interface with tools like Sinatra, Bootstrap, and JavaScript to display and link individual records
The overall approach emphasizes rapid iteration, with the goal of creating an application that allows continuous discovery of insights from the source data.
2. 2
About MeâŚBearding.
⢠Bearding is my #1 natural talent.
⢠Iâm going to beat this guy.
⢠Seriously.
⢠Salty Sea Beard
⢠Fortified with Pacific Ocean Minerals
2
3. 3
Agile Data: The Book
(August, 2013)
A philosophy.
Not the only way,
but itâs a really good way!
3
4. 4
We Go Fast, But Donât Worry!
⢠Download the slides - click the links - read examples!
⢠If itâs not on the blog, itâs in the book!
⢠Order now: http://shop.oreilly.com/product/0636920025054.do
⢠Read Now @ Safari Rough Cuts
4
7. 7
Scientific Computing / HPC
Tubes and Mercury (Old School) Cores and Spindles (New School)
UNIVAC and Deep Blue both fill a warehouse. Weâre back!
7
âSmart Kidâ Only: MPI, Globus, etc. Until Hadoop
9. 9
Data Center as Computer
âA key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient
manner.â Click here for a paper on operating a âdata center as computer.â
9
Warehouse Scale Computers and Applications
10. 10
Hadoop to the Rescue!
⢠Easy to use (Pig, Hive, Cascading)
⢠CHEAP: 1% the cost of SAN/NAS
⢠A department can afford its own Hadoop cluster!
⢠Dump all your data in one place: Hadoop DFS
⢠Silos come CRASHING DOWN!
⢠JOIN like crazy!
⢠ETL like whoa!
⢠An army of mappers and reducers at your command
⢠OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!
10
12. 12
Analytics Apps: It takes a Team
⢠Broad skill-set
⢠Nobody has them all
⢠Inherently collaborative
12
13. 13
Data Science Team
⢠3-4 team members with broad, diverse skill-sets that overlap
⢠Transactional overhead dominates at 5+ people
⢠Expert researchers: lend 25-50% of their time to teams
⢠Creative workers. Like a studio, not an assembly line
⢠Total freedom... with goals and deliverables.
⢠Work environment matters most
13
14. 14
How To Get Insight Into Product
⢠Back-end has gotten THICKER
⢠Generating $$$ insight can take 10-100x app dev
⢠Timeline disjoint: analytics vs agile app-dev/design
⢠How do you ship insights efficiently?
⢠Can you collaborate on research vs developer timeline?
14
15. 15
The Wrong Way - Part One
âWe made a great design.
Your job is to predict the future for it.â
15
16. 16
The Wrong Way - Part Two
âWhat is taking you so long
to reliably predict the future?â
16
17. 17
The Wrong Way - Part Three
âThe users donât understand
what 86% true means.â
17
18. 18
The Wrong Way - Part Four
GHJIAEHGIEhjagigehganb!!!!!RJ(@J?!!
18
19. 19
The Wrong Way - Conclusion
Inevitable Conclusion
Plane Mountain
19
21. 21
Chief Problem
You canât design insight in analytics applications.
You discover it.
You discover by exploring.
21
22. 22
-> Strategy
So make an app for exploring your data.
Which becomes a palette for what you ship.
Iterate and publish intermediate results.
22
23. 23
Data Design
⢠Not the 1st query that = insight, itâs the 15th, or 150th
⢠Capturing âAh ha!â moments
⢠Slow to do those in batchâŚ
⢠Faster, better context in an interactive web application.
⢠Pre-designed charts wind up terrible. So bad.
⢠Easy to invest man-years in wrong statistical models
⢠Semantics of presenting predictions are complex
⢠Opportunity lies at intersection of data & design
23
26. 26
Setup An Environment Where:
⢠Insights repeatedly produced
⢠Iterative work shared with entire team
⢠Interactive from day Zero
⢠Data model is consistent end-to-end
⢠Minimal impedance between layers
⢠Scope and depth of insights grow
⢠Insights form the palette for what you ship
⢠Until the application pays for itself and more
26
28. 28
Value Document > Relation
Most data is dirty. Most data is semi-structured or unstructured. Rejoice!
28
29. 29
Value Document > Relation
Note: Hive/ArrayQL/NewSQLâs support of documents/array types blur this distinction.
29
30. 30
Relational Data = Legacy Format
⢠Why JOIN? Storage is fundamentally cheap!
⢠Duplicate that JOIN data in one big record type!
⢠ETL once to document format on import, NOT every job
⢠Not zero JOINs, but far fewer JOINs
⢠Semi-structured documents preserve dataâs actual structur
⢠Column compressed document formats beat JOINs!
30
31. 31
Value Imperative > Declarative
⢠We donât know what we want to SELECT.
⢠Data is dirty - check each step, clean iteratively.
⢠85% of data scientistâs time spent munging. ETL.
⢠Imperative is optimized for our process.
⢠Process = iterative, snowballing insight
⢠Efficiency matters, self optimize
31
33. 33
Ex. Dataflow: ETL +
Email Sent Count
(I canât read this either. Get a big version here.)
33
34. 34
Value Pig > Hive (for app-dev)
⢠Pigs eat ANYTHING
⢠Pig is optimized for refining data, as opposed to consuming it
⢠Pig is imperative, iterative
⢠Pig is dataflows, and SQLish (but not SQL)
⢠Code modularization/re-use: Pig Macros
⢠ILLUSTRATE speeds dev time (even UDFs)
⢠Easy UDFs in Java, JRuby, Jython, Javascript
⢠Pig Streaming = use any tool, period.
⢠Easily prepare our data as it will appear in our app.
⢠If you prefer Hive, use Hive.
Actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive.
See: HCatalog for Pig/Hive integration.
34
35. 35
Localhost vs Petabyte Scale:
Same Tools Tools
⢠Simplicity essential to scalability: highest level tools we can
⢠Prepare a good sample - tricky with joins, easy with documents
⢠Local mode: pig -l /tmp -x local -v -w
⢠Frequent use of ILLUSTRATE
⢠1st: Iterate, debug & publish locally
⢠2nd: Run on cluster, publish to team/customer
⢠Consider skipping Object-Relational-Mapping (ORM)
⢠We do not trust âdatabases,â only HDFS @ n=3
⢠Everything we serve in our app is re-creatable via Hadoop.
35
38. 38
0.0) Document - Serialize Events
⢠Protobuf
⢠Thrift
⢠JSON
⢠Avro - I use Avro because the schema is onboard.
38
39. 39
0.1) Documents Via Relation ETL
enron_messages = load '/enron/enron_messages.tsv' as (
message_id:chararray,
sql_date:chararray,
from_address:chararray,
from_name:chararray,
subject:chararray,
body:chararray);
enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray,
name:chararray);
split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';
headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray,
name:chararray), enron_messages::subject as subject,
enron_messages::body as body,
headers::tos.(address, name) as tos,
headers::ccs.(address, name) as ccs,
headers::bccs.(address, name) as bccs;
store emails into '/enron/emails.avro' using AvroStorage(
Example here.
39
40. 40
0.2) Serialize Events From
Streamsclass GmailSlurper(object):
...
def init_imap(self, username, password):
self.username = username
self.password = password
try:
imap.shutdown()
except:
pass
self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)
self.imap.login(username, password)
self.imap.is_readonly = True
...
def write(self, record):
self.avro_writer.append(record)
...
def slurp(self):
if(self.imap and self.imap_folder):
for email_id in self.id_list:
(status, email_hash, charset) = self.fetch_email(email_id)
if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):
print email_id, charset, email_hash['thread_id']
self.write(email_hash)
Scrape your own gmail in Python and Ruby.
40
41. 41
0.3) ETL Logs
log_data = LOAD 'access_log'
USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader
AS (remoteAddr,
remoteLogname,
user,
time,
method,
uri,
proto,
bytes);
41
42. 42
1) Plumb Atomic Events->Browser
(Example stack that enables high productivity)
42
46. 46
1.4) Publish Events to a âDatabaseâ
pig -l /tmp -x local -v -w -param avros=enron.avro
-param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig
/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jar
register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
/* Set speculative execution off to avoid chance of duplicate records in Mongo */
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */
/* By default, lets have 5 reducers */
set default_parallel 5
avros = load '$avros' using AvroStorage();
store avros into '$mongourl' using MongoStorage();
Full instructions here.
Which does this:
From Avro to MongoDB in one command:
46
51. 51
Whatâs the Point?
⢠A designer can work against real data.
⢠An application developer can work against real data.
⢠A product manager can think in terms of real data.
⢠Entire team is grounded in reality!
⢠Youâll see how ugly your data really is.
⢠Youâll see how much work you have yet to do.
⢠Ship early and often!
⢠Feels agile, donât it? Keep it up!
51
52. 52
1.7) Wrap Events with Bootstrap
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
<table class="table table-striped table-bordered table-condensed">
<thead>
{% for key in data['keys'] %}
<th>{{ key }}</th>
{% endfor %}
</thead>
<tbody>
<tr>
{% for value in data['values'] %}
<td>{{ value }}</td>
{% endfor %}
</tr>
</tbody>
</table>
</div>
</body>
Complete example here with code here.
52
55. 56
1.8) List Links to Sorted Events
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
{
{
"_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
"message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
"from" : [
...
pig -l /tmp -x local -v -w
emails_per_user = foreach (group emails by from.address) {
sorted = order emails by date;
last_1000 = limit sorted 1000;
generate group as from_address, emails as emails;
};
store emails_per_user into '$mongourl' using MongoStorage();
Use Pig, serve/cache a bag/array of email documents:
Use your âdatabaseâ, if it can sort.
56
57. 58
1.9) Make It Searchable
If you have list, search is easy with
ElasticSearch and Wonderdog...
/* Load ElasticSearch integration */
register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';
register '/me/elasticsearch-0.18.6/lib/*';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();
emails = load '/me/tmp/emails' using AvroStorage();
store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/elasticsearch-
0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');
curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'
Test it with curl:
ElasticSearch has no security features. Take note. Isolate.
58
60. 61
2) Create Simple Charts
⢠Start with an HTML table on general principle.
⢠Then use nvd3.js - reusable charts for d3.js
⢠Aggregate by properties & displaying is first step in entity resolution
⢠Start extracting entities. Ex: people, places, topics, time series
⢠Group documents by entities, rank and count.
⢠Publish top N, time series, etc.
⢠Fill a page with charts.
⢠Add a chart to your event page.
61
61. 62
2.1) Top N (of Anything) in Pig
pig -l /tmp -x local -v -w
top_things = foreach (group things by key) {
sorted = order things by arbitrary_rank desc;
top_10_things = limit sorted 10;
generate group as key, top_10_things as top_10_things;
};
store top_n into '$mongourl' using MongoStorage();
Remember, this is the same structure the browser gets as json.
This would make a good Pig Macro.
62
62. 63
2.2) Time Series (of Anything) in
Pig
pig -l /tmp -x local -v -w
/* Group by our key and date rounded to the month, get a total */
things_by_month = foreach (group things by (key, ISOToMonth(datetime))
generate flatten(group) as (key, month),
COUNT_STAR(things) as total;
/* Sort our totals per key by month to get a time series */
things_timeseries = foreach (group things_by_month by key) {
timeseries = order things by month;
generate group as key, timeseries as timeseries;
};
store things_timeseries into '$mongourl' using MongoStorage();
Yet another good Pig Macro.
63
63. 64
Data Processing in Our Stack
A new feature in our application might begin at any layerâŚ
GREAT!
Any team member can add new features, no problemo!
Iâm creative!
I know Pig!
Iâm creative too!
I <3 Javascript!
omghi2u!
where r my legs?
send halp
64
64. 65
Data Processing in Our Stack
... but we shift the data-processing towards batch, as we are able.
Ex: Overall total emails calculated in each layer
See real example here.
65
67. 68
3.0) From Charts to Reports
⢠Extract entities from properties we aggregated by in charts (Step 2)
⢠Each entity gets its own type of web page
⢠Each unique entity gets its own web page
⢠Link to entities as they appear in atomic event documents (Step 1)
⢠Link most related entities together, same and between types.
⢠More visualizations!
⢠Parametize results via forms.
68
70. 71
3.3) Get People Clicking. Learn.
⢠Explore this web of generated pages, charts and links!
⢠Everyone on the team gets to know your data.
⢠Keep trying out different charts, metrics, entities, links.
⢠See whats interesting.
⢠Figure out what data needs cleaning and clean it.
⢠Start thinking about predictions & recommendations.
âPeopleâ could be just your team, if data is sensitive.
71
72. 73
4.0) Preparation
⢠Weâve already extracted entities, their properties and relationships
⢠Our charts show where our signal is rich
⢠Weâve cleaned our data to make it presentable
⢠The entire team has an intuitive understanding of the data
⢠They got that understanding by exploring the data
⢠We are all on the same page!
73
73. 74
4.2) Think in Different
Perspectives
⢠Networks
⢠Time Series / Distributions
⢠Natural Language Processing
⢠Conditional Probabilities / Bayesian Inference
⢠Check out Chapter 2 of the book
74
84. 85
4.5.3) NLP for All: Extract Topics!
⢠TF-IDF in Pig - 2 lines of code with Pig Macros:
⢠http://hortonworks.com/blog/pig-macro-for-tf-idf-makes-
topic-summarization-2-lines-of-pig/
⢠LDA with Pig and the Lucene Tokenizer:
⢠http://thedatachef.blogspot.be/2012/03/topic-discovery-
with-apache-pig-and.html
85
93. 94
Why Doesnât Kate Reply
to My Emails?
⢠What time is best to catch her?
⢠Are they too long?
⢠Are they meant to be replied to (original content)?
⢠Are they nice? (sentiment analysis)
⢠Do I reply to her emails (reciprocity)?
⢠Do I cc the wrong people (my mom)?
94
94. 95
Example: Packetpig
and PacketLoop
snort_alerts = LOAD '$pcap'
  USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig');
countries = FOREACH snort_alerts
  GENERATE
    com.packetloop.packetpig.udf.geoip.Country(src) as country,
    priority;
countries = GROUP countries BY country;
countries = FOREACH countries
  GENERATE
    group,
    AVG(countries.priority) as average_severity;
STORE countries into 'output/choropleth_countries' using PigStorage(',');
Code here.
95