This document provides an overview of an "Exploratorium", which is described as a guided tour of open source data analysis tools. It discusses exploring patent and health care data using tools like Graphviz and NetworkX to analyze networks, and Redis to store and reshape data. Examples are given of analyzing Reddit comment and user networks by counting words and mapping relationships between comments and users. The document encourages sharing favorite open source tools using the hashtag #exploratorium.
1. The Big Data
Exploratorium
A guided tour of open source
data analysis tools
Noah Pepper (@noahmp)
Devin Chalmers (@qwzybug)
#exploratorium @osb11
Thursday, June 23, 2011 1
2. Hi,
• We’re here because...
• We are...
• Data Exploration Is...
• Example 1: Patents
• (Chalmers et al. 2010; Buchanan et al. 2010; Pepper et al 2008)
• Example 2: Health Care
• (Pepper et al. Visweek 2010)
Thursday, June 23, 2011 2
4. Hi,
• Get the code & data samples:
• git clone git@github.com:peppern/exploratorium.git
Thursday, June 23, 2011 4
5. We’re here because...
• There is a really amazing OSS community in the data space.
• This is fantastic news for academics, hobbyists, and professionals alike.
• We want to show what you can do with open source tools, show you the ones
we like.
• We’d love to hear about what YOUR favorites are, #exploratorium to tell us.
• Data exploration is fun...
Thursday, June 23, 2011 5
6. We are...
Noah Pepper - @noahmp
Devin Chalmers - @qwzybug
• Academic Data Junkies • We’re Sorta Lucky
Our academic
home. Research
focuses on on
exploring the nature Our startup
of evolutionary where we build data
activity through data exploration
mining platforms
Thursday, June 23, 2011 6
7. We Build Data Exploration Tools!
map.clearhealthcosts.com
Thursday, June 23, 2011 7
8. What is data exploration and what is an exploratorium
• Narrow Definition • Why do I say
visualization
instead of the more
• Data exploration is
general
having an iterative
‘representation’?
relationship with
your data, analysis,
and visualization exploratorium
noun [usu. in names ]
stack where you a scientific museum or similar center at which visitors have the
build an intuitive opportunity of performing prearranged experiments or
demonstrations.
cognitive model of
the information
Yes! That means
visualized. there’s code
and data
Thursday, June 23, 2011 8
9. Data Exploration Example
• study evolution of technology in patent records
– technology is a window on culture
– patents are a window on technology
Thursday, June 23, 2011 9
15. PMI distributions
- see clusters
- different kinds
of clusters
Thursday, June 23, 2011 15
16. PMI Comparison: Plotting a different way
“the”
PMI integral
halfway rank
“optical” - generality
of content?
“cultivar”
Thursday, June 23, 2011 16
17. btw, these are older graphs, now we use ggplot2
Thursday, June 23, 2011 17
18. Previous Work in Health Care...
500,000
400,000
Bill volume
Placement in
distribution of billed
300,000
Upper 5%
200,000
Bottom 5%
100,000
0
AMB ASC DME ER IPH OPH PRO
Adjudication type
.... with @homerstrong
at Qmedtrix Systems Inc.
Thursday, June 23, 2011 18
19. Previous Work in Health Care...
120,000
Bill volume
100,000
80,000
60,000
40,000
20,000
0
10 1
10 2
10 3
10 4
10 5 10 6
10 7
1.4e+09
1.2e+09
Dollar density
1.0e+09
8.0e+08
Billed
6.0e+08 First Audit
4.0e+08 Second Audit
2.0e+08
0.0e+00
10 1
10 2
10 3
10 4
10 5 10 6
10 7
Amount ($)
... @hadleywickham is a #ballR
http://had.co.nz
Thursday, June 23, 2011 19
20. Health Care Data & Code Samples...
...Hahaha Just Kidding
Thursday, June 23, 2011 20
21. But actually:
• Qmedtrix R&D team members made source contributions, see:
• Homer Strong https://github.com/strongh @homerstrong (Lucky Sort)
• Kevin Lynagh https://github.com/lynaghk (Keming Labs)
Thursday, June 23, 2011 21
22. Exploratorium #1 Patent Networks
citations
amongst
top 10k
most cited
patents
Thursday, June 23, 2011 22
23. Grab the graph data:
~/exploratorium/patents/toplinks.dot
Graphviz Art is Pretty!
Thursday, June 23, 2011 23
24. GraphViz Can
Graph really big
graphs... but they
get hard to use ->
<- Psychedelic
Patents
Thursday, June 23, 2011 24
25. Graphviz - Play with Graphs
(http://www.graphviz.org)
• sudo port install graphviz or sudo apt-get install graphviz
• graphing commands: dot,neato,twopi,circo,fdp
• dot -Tpdf -o file.dot
• More options here:
• http://www.graphviz.org/content/command-line-invocation
• Fun options are in the .dot file:
• http://www.graphviz.org/content/dot-language
Thursday, June 23, 2011 25
26. Styling dots
• node [shape=point, width="0.15",color="#0000001c"];
• edge [arrowsize="0.50", color="#0000001c"];
• There are tons, read the docs and have fun
• You can also try more complex things
• Like constraints, time for example
• Sometimes too many constraints makes GraphViz unhappy...
Thursday, June 23, 2011 26
28. UbiGraph
• We loved UbiGraph, but don’t know an OSS alternative
• Renders many nodes in 3D in realtime FD-layout (50k+).
• 16gb of ram Mac Pro
• Shout out to Apple: thank you for supporting our research!
• It’s ‘free’ but development has stalled and since it’s closed source we can’t
build on it!
• Alternatives?
Thursday, June 23, 2011 28
29. Exploratorium #2
• Making graphs of language using python, redis, R and a bunch of awesome
libraries
• Thanks
• @hadleywickham
• @homerstrong
• @antirez
• Bryan Lewis (http://illposed.net/)
Thursday, June 23, 2011 29
39. Store the data
Postgres is not too shabby
Thursday, June 23, 2011 35
40. Store the data
SELECT cite AS patent_num, count FROM (SELECT cite,
count(*) AS count FROM citations GROUP BY cite) AS t1
ORDER BY t1.count DESC LIMIT 10
Thursday, June 23, 2011 36
41. Store the data
SELECT `cite`, count(*), `year` FROM `citations`
INNER JOIN (SELECT date_part('year', `grantdate`) AS
`year`, `patent_num` AS `patent_num` FROM `patents`)
AS `t1` USING (`patent_num`) WHERE (cite IN (12345))
GROUP BY `year`, `cite`
Thursday, June 23, 2011 37
42. Store the data
SELECT term, count FROM (SELECT term, count(*) FROM
(SELECT patent_num, term FROM tfidfs WHERE (tfidf >
0.05)) AS "t1" INNER JOIN (SELECT * FROM (SELECT
patent_num FROM patent_lengths WHERE (wordcount >
10)) AS "t1" INNER JOIN (SELECT * FROM patents WHERE
(grantdate > '1990-01-01' AND grantdate <
'2000-01-01')) AS "t2" USING ("patent_num")) AS "t2"
USING ("patent_num") GROUP BY "term") AS "t3" ORDER
BY count DESC LIMIT 50;
Thursday, June 23, 2011 38
62. Reddit
• Count words by hour
Thursday, June 23, 2011 50
63. Reddit
• Count words by hour
• Comment network
Thursday, June 23, 2011 50
64. Reddit
• Count words by hour
• Comment network
• User network
Thursday, June 23, 2011 50
65. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
• Comment network
• User network
Thursday, June 23, 2011 50
66. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network
• User network
Thursday, June 23, 2011 50
67. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network SET thread_id:comments
• User network
Thursday, June 23, 2011 50
68. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network SET thread_id:comments
“parent_id:child_id”
• User network
Thursday, June 23, 2011 50
69. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network SET thread_id:comments
“parent_id:child_id”
• User network SET thread_id:users
Thursday, June 23, 2011 50
70. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network SET thread_id:comments
“parent_id:child_id”
• User network SET thread_id:users
“parent_id:child_id”
Thursday, June 23, 2011 50
71. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network SET thread_id:comments
“parent_id:child_id”
• User network SET thread_id:users
“parent_id:child_id”
SET subreddit:threads
Thursday, June 23, 2011 50
72. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network SET thread_id:comments
“parent_id:child_id”
• User network SET thread_id:users
“parent_id:child_id”
SET subreddit:threads
thread_id
Thursday, June 23, 2011 50
75. Reddit
Go forth and graph!
#exploratorium #osb11
Thursday, June 23, 2011 53
76. Reddit
Go forth and graph!
#exploratorium #osb11
We will hire you.
Thursday, June 23, 2011 53
77. Reddit
Go forth and graph!
#exploratorium #osb11
We will hire you.
For reals.
Thursday, June 23, 2011 53
78. You Are Now Leaving
the Big Data
Exploratorium
Please ensure you have your
valuables.
Noah Pepper @noahmp
Devin Chalmers @qwzybug
#exploratorium #osb11
Thursday, June 23, 2011 54