SlideShare a Scribd company logo
1 of 78
The Big Data
       Exploratorium
       A guided tour of open source
       data analysis tools

       Noah Pepper (@noahmp)
       Devin Chalmers (@qwzybug)

       #exploratorium @osb11




Thursday, June 23, 2011               1
Hi,

       • We’re here because...


       • We are...


       • Data Exploration Is...


             • Example 1: Patents


                   • (Chalmers et al. 2010; Buchanan et al. 2010; Pepper et al 2008)


             • Example 2: Health Care


                   • (Pepper et al. Visweek 2010)


Thursday, June 23, 2011                                                                2
Hi,

                   • Exploratorium #1


                          • Patent citation networks


                             • Graphviz


                             • NetworkX


                   • Exploratorium #2


                          • Reddit comment word usages




Thursday, June 23, 2011                                  3
Hi,




     • Get the code & data samples:


     • git clone git@github.com:peppern/exploratorium.git




Thursday, June 23, 2011                                     4
We’re here because...

       • There is a really amazing OSS community in the data space.


       • This is fantastic news for academics, hobbyists, and professionals alike.


       • We want to show what you can do with open source tools, show you the ones
         we like.


       • We’d love to hear about what YOUR favorites are, #exploratorium to tell us.


       • Data exploration is fun...




Thursday, June 23, 2011                                                                5
We are...

                            Noah Pepper - @noahmp
                           Devin Chalmers - @qwzybug
               • Academic Data Junkies    • We’re Sorta Lucky



                   Our academic
                  home. Research
                   focuses on on
                exploring the nature         Our startup
                   of evolutionary        where we build data
                activity through data         exploration
                        mining                 platforms
Thursday, June 23, 2011                                         6
We Build Data Exploration Tools!

                                     map.clearhealthcosts.com




Thursday, June 23, 2011                                         7
What is data exploration and what is an exploratorium

       • Narrow Definition       • Why do I say
                                  visualization
                                  instead of the more
       • Data exploration is
                                  general
         having an iterative
                                  ‘representation’?
         relationship with
         your data, analysis,
         and visualization         exploratorium
                                   noun [usu. in names ]
         stack where you           a scientific museum or similar center at which visitors have the
         build an intuitive        opportunity of performing prearranged experiments or
                                   demonstrations.
         cognitive model of
         the information
                                                                           Yes! That means
         visualized.                                                         there’s code
                                                                               and data


Thursday, June 23, 2011                                                                              8
Data Exploration Example


             • study evolution of technology in patent records
                   – technology is a window on culture
                   – patents are a window on technology




Thursday, June 23, 2011                                          9
Patent Networks




Thursday, June 23, 2011   10
Citation Analysis of Patents




Thursday, June 23, 2011               11
Time Series Text Analysis




Thursday, June 23, 2011            12
Some explorations are more open ended




Thursday, June 23, 2011                                           13
Pointwise Mutual Information (PMI)




          # patents that contain words x and y




Thursday, June 23, 2011                          14
PMI distributions


         - see clusters
         - different kinds
           of clusters




Thursday, June 23, 2011      15
PMI Comparison: Plotting a different way


                                         “the”

                                                          PMI integral
                                                          halfway rank

                                         “optical”    -    generality
                                                          of content?




                                         “cultivar”


Thursday, June 23, 2011                                                  16
btw, these are older graphs, now we use ggplot2




Thursday, June 23, 2011                                     17
Previous Work in Health Care...


                 500,000


                 400,000
 Bill   volume




                                                                    Placement  in
                                                                    distribution  of  billed
                 300,000
                                                                         Upper  5%


                 200,000

                                                                         Bottom  5%
                 100,000


                      0

                           AMB   ASC   DME   ER   IPH   OPH   PRO

                                 Adjudication  type




                 .... with @homerstrong
                 at Qmedtrix Systems Inc.
Thursday, June 23, 2011                                                                        18
Previous Work in Health Care...
                              120,000
        Bill  volume




                              100,000

                               80,000

                               60,000

                               40,000

                               20,000

                                   0
                                        10   1
                                                 10   2
                                                          10   3
                                                                   10   4
                                                                            10   5   10   6
                                                                                              10   7




                              1.4e+09
                              1.2e+09
            Dollar  density




                              1.0e+09
                              8.0e+08
                                                                                                       Billed
                              6.0e+08                                                                  First  Audit
                              4.0e+08                                                                  Second  Audit

                              2.0e+08
                              0.0e+00
                                        10   1
                                                 10   2
                                                          10   3
                                                                   10   4
                                                                            10   5   10   6
                                                                                              10   7


                                                          Amount  ($)



                                                                                              ... @hadleywickham is a #ballR
                                                                                                            http://had.co.nz
Thursday, June 23, 2011                                                                                                        19
Health Care Data & Code Samples...




                              ...Hahaha Just Kidding

Thursday, June 23, 2011                                20
But actually:

       • Qmedtrix R&D team members made source contributions, see:


             • Homer Strong https://github.com/strongh @homerstrong (Lucky Sort)


             • Kevin Lynagh https://github.com/lynaghk (Keming Labs)




Thursday, June 23, 2011                                                            21
Exploratorium #1 Patent Networks




   citations
  amongst
   top 10k
  most cited
   patents




Thursday, June 23, 2011                                      22
Grab the graph data:
                          ~/exploratorium/patents/toplinks.dot




                                               Graphviz Art is Pretty!
Thursday, June 23, 2011                                                  23
GraphViz Can
           Graph really big
          graphs... but they
          get hard to use ->




                               <- Psychedelic
                                  Patents


Thursday, June 23, 2011                         24
Graphviz - Play with Graphs
       (http://www.graphviz.org)

       • sudo port install graphviz or sudo apt-get install graphviz


       • graphing commands: dot,neato,twopi,circo,fdp


       • dot -Tpdf -o file.dot


       • More options here:


             • http://www.graphviz.org/content/command-line-invocation


       • Fun options are in the .dot file:


             • http://www.graphviz.org/content/dot-language


Thursday, June 23, 2011                                                  25
Styling dots

       • 	 node [shape=point, width="0.15",color="#0000001c"];


       • 	 edge [arrowsize="0.50", color="#0000001c"];


       • There are tons, read the docs and have fun


       • You can also try more complex things


             • Like constraints, time for example


             • Sometimes too many constraints makes GraphViz unhappy...




Thursday, June 23, 2011                                                   26
Thursday, June 23, 2011   27
UbiGraph

       • We loved UbiGraph, but don’t know an OSS alternative


       • Renders many nodes in 3D in realtime FD-layout (50k+).


             • 16gb of ram Mac Pro


                          • Shout out to Apple: thank you for supporting our research!


       • It’s ‘free’ but development has stalled and since it’s closed source we can’t
         build on it!


       • Alternatives?




Thursday, June 23, 2011                                                                  28
Exploratorium #2

       • Making graphs of language using python, redis, R and a bunch of awesome
         libraries


       • Thanks


             • @hadleywickham


             • @homerstrong


             • @antirez


             • Bryan Lewis (http://illposed.net/)




Thursday, June 23, 2011                                                            29
...how?
       Mine — Munge — Visualize




Thursday, June 23, 2011           30
...how?
       github.com/peppern/exploratorium

       [ brew | apt-get | port ] install redis

       www.r-project.org
       github.com/qwzybug/rredis
       redis TTR package




Thursday, June 23, 2011                          31
Best show on TV




Thursday, June 23, 2011   32
Best show on TV




Thursday, June 23, 2011   32
Best show on TV




Thursday, June 23, 2011   32
Best show on TV




Thursday, June 23, 2011   32
Best show on TV




Thursday, June 23, 2011   33
Mine the data

       • gutenberg.org


       • google.com/ngrams


       • APIs — Twitter, etc.


       • http://code.google.com/apis/socialgraph/


       • Scrape




Thursday, June 23, 2011                             34
Store the data




Thursday, June 23, 2011   35
Store the data




                          Postgres is not too shabby




Thursday, June 23, 2011                                35
Store the data




            SELECT cite AS patent_num, count FROM (SELECT cite,
            count(*) AS count FROM citations GROUP BY cite) AS t1
            ORDER BY t1.count DESC LIMIT 10




Thursday, June 23, 2011                                             36
Store the data




            SELECT `cite`, count(*), `year` FROM `citations`
            INNER JOIN (SELECT date_part('year', `grantdate`) AS
            `year`, `patent_num` AS `patent_num` FROM `patents`)
            AS `t1` USING (`patent_num`) WHERE (cite IN (12345))
            GROUP BY `year`, `cite`




Thursday, June 23, 2011                                            37
Store the data



            SELECT term, count FROM (SELECT term, count(*) FROM
            (SELECT patent_num, term FROM tfidfs WHERE (tfidf >
            0.05)) AS "t1" INNER JOIN (SELECT * FROM (SELECT
            patent_num FROM patent_lengths WHERE (wordcount >
            10)) AS "t1" INNER JOIN (SELECT * FROM patents WHERE
            (grantdate > '1990-01-01' AND grantdate <
            '2000-01-01')) AS "t2" USING ("patent_num")) AS "t2"
            USING ("patent_num") GROUP BY "term") AS "t3" ORDER
            BY count DESC LIMIT 50;




Thursday, June 23, 2011                                            38
Store the data




Thursday, June 23, 2011   39
Store the data




                          NoSQL is a good fit for web data




Thursday, June 23, 2011                                     40
Reshape the data




Thursday, June 23, 2011   41
Reshape the data



                          citer   citee
                           a       b
                           c       b
                           b       d




Thursday, June 23, 2011                   41
Reshape the data



                             citer   citee
                              a       b
                              c       b
                              b       d




      { a : [b], c : [b], b: [d] }

Thursday, June 23, 2011                      41
Reshape the data



                             citer   citee
                              a       b
                              c       b
                              b       d




      { a : [b], c : [b], b: [d] }        { b : [a, c], d : [b] }

Thursday, June 23, 2011                                             41
Redis




                          In-Memory Data Structure Server




Thursday, June 23, 2011                                     42
Redis




Thursday, June 23, 2011   43
Redis

       • HSET key name value


       • SADD key value


       • ZUNIONSTORE


       • HSETNX


       • BRPOPLPUSH


       •…




Thursday, June 23, 2011        44
Redis




Thursday, June 23, 2011   45
Redis




                          Global variable for all your programs




Thursday, June 23, 2011                                           45
Redis




                          Global variable for all your programs

                              Memcached with structure




Thursday, June 23, 2011                                           45
Redis




                          Global variable for all your programs

                              Memcached with structure

                                       Really fast




Thursday, June 23, 2011                                           45
Redis




                          Global variable for all your programs

                              Memcached with structure

                                    Really really fast




Thursday, June 23, 2011                                           46
Redis




                          Global variable for all your programs

                              Memcached with structure

                            Really, really, astonishingly fast




Thursday, June 23, 2011                                           47
Redis




                          Global variable for all your programs

                              Memcached with structure

                                  No, faster than that




Thursday, June 23, 2011                                           48
Reddit




Thursday, June 23, 2011   49
Reddit




Thursday, June 23, 2011   49
Reddit




Thursday, June 23, 2011   50
Reddit

       • Count words by hour




Thursday, June 23, 2011        50
Reddit

       • Count words by hour


       • Comment network




Thursday, June 23, 2011        50
Reddit

       • Count words by hour


       • Comment network


       • User network




Thursday, June 23, 2011        50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12


       • Comment network


       • User network




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network


       • User network




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network       SET thread_id:comments


       • User network




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network       SET thread_id:comments
                                     “parent_id:child_id”

       • User network




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network       SET thread_id:comments
                                     “parent_id:child_id”

       • User network          SET thread_id:users




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network       SET thread_id:comments
                                     “parent_id:child_id”

       • User network          SET thread_id:users
                                     “parent_id:child_id”




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network       SET thread_id:comments
                                     “parent_id:child_id”

       • User network          SET thread_id:users
                                     “parent_id:child_id”
                               SET subreddit:threads




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network       SET thread_id:comments
                                     “parent_id:child_id”

       • User network          SET thread_id:users
                                     “parent_id:child_id”
                               SET subreddit:threads
                                     thread_id




Thursday, June 23, 2011                                       50
Reddit


       github.com/peppern/exploratorium

       [ brew | apt-get | port ] install redis

       www.r-project.org
       github.com/qwzybug/rredis
       redis TTR package




Thursday, June 23, 2011                          51
Reddit




                          (demo)




Thursday, June 23, 2011            52
Reddit



                           Go forth and graph!

                          #exploratorium #osb11




Thursday, June 23, 2011                           53
Reddit



                           Go forth and graph!

                          #exploratorium #osb11

                             We will hire you.




Thursday, June 23, 2011                           53
Reddit



                           Go forth and graph!

                          #exploratorium #osb11

                             We will hire you.

                                For reals.


Thursday, June 23, 2011                           53
You Are Now Leaving
       the Big Data
       Exploratorium
       Please ensure you have your
       valuables.

       Noah Pepper @noahmp
       Devin Chalmers @qwzybug

       #exploratorium #osb11




Thursday, June 23, 2011              54

More Related Content

Viewers also liked

Vineyard presentation
Vineyard presentationVineyard presentation
Vineyard presentationMisael Leon
 
TMX Buyside Program Presentation
TMX Buyside Program PresentationTMX Buyside Program Presentation
TMX Buyside Program PresentationTMX Equicom
 
Nuevos Mercados para Su Cosecha
Nuevos Mercados para Su CosechaNuevos Mercados para Su Cosecha
Nuevos Mercados para Su CosechaGardening
 
)2012-05-23) Medicina y nuevas tecnologias doc
)2012-05-23) Medicina y nuevas tecnologias doc)2012-05-23) Medicina y nuevas tecnologias doc
)2012-05-23) Medicina y nuevas tecnologias docUDMAFyC SECTOR ZARAGOZA II
 
River of Nessie
River of NessieRiver of Nessie
River of Nessiearifinbd
 
A trip to Birmingham
A trip to BirminghamA trip to Birmingham
A trip to Birminghamrupinderrinks
 
Plymouth north pp bulletin 02 04-13
Plymouth north pp bulletin 02 04-13Plymouth north pp bulletin 02 04-13
Plymouth north pp bulletin 02 04-13judykendall
 
Zencoder Guide to Closed Captions
Zencoder Guide to Closed CaptionsZencoder Guide to Closed Captions
Zencoder Guide to Closed CaptionsZencoder
 
PWF March 2015 color
PWF March 2015 colorPWF March 2015 color
PWF March 2015 colorBryan Kendro
 
Avoid Air-rors! Discuss the Air Regulations that Impact Oil and Gas Development
Avoid Air-rors! Discuss the Air Regulations that Impact Oil and Gas DevelopmentAvoid Air-rors! Discuss the Air Regulations that Impact Oil and Gas Development
Avoid Air-rors! Discuss the Air Regulations that Impact Oil and Gas DevelopmentTrihydro Corporation
 
Australian expos directory 2012 (by sector)
Australian expos directory 2012 (by sector)Australian expos directory 2012 (by sector)
Australian expos directory 2012 (by sector)exagoges
 
My First Source Code
My First Source CodeMy First Source Code
My First Source Codeenidcruz
 

Viewers also liked (20)

Anthromes v1
Anthromes v1Anthromes v1
Anthromes v1
 
El sistema solar
El sistema solarEl sistema solar
El sistema solar
 
Vineyard presentation
Vineyard presentationVineyard presentation
Vineyard presentation
 
TMX Buyside Program Presentation
TMX Buyside Program PresentationTMX Buyside Program Presentation
TMX Buyside Program Presentation
 
Nuevos Mercados para Su Cosecha
Nuevos Mercados para Su CosechaNuevos Mercados para Su Cosecha
Nuevos Mercados para Su Cosecha
 
)2012-05-23) Medicina y nuevas tecnologias doc
)2012-05-23) Medicina y nuevas tecnologias doc)2012-05-23) Medicina y nuevas tecnologias doc
)2012-05-23) Medicina y nuevas tecnologias doc
 
River of Nessie
River of NessieRiver of Nessie
River of Nessie
 
A trip to Birmingham
A trip to BirminghamA trip to Birmingham
A trip to Birmingham
 
Plymouth north pp bulletin 02 04-13
Plymouth north pp bulletin 02 04-13Plymouth north pp bulletin 02 04-13
Plymouth north pp bulletin 02 04-13
 
2 programa spss_2012
2 programa spss_20122 programa spss_2012
2 programa spss_2012
 
museos
museosmuseos
museos
 
Zencoder Guide to Closed Captions
Zencoder Guide to Closed CaptionsZencoder Guide to Closed Captions
Zencoder Guide to Closed Captions
 
Electronic Participatory Budgeting
Electronic Participatory BudgetingElectronic Participatory Budgeting
Electronic Participatory Budgeting
 
PWF March 2015 color
PWF March 2015 colorPWF March 2015 color
PWF March 2015 color
 
Net lecture3
Net lecture3Net lecture3
Net lecture3
 
Avoid Air-rors! Discuss the Air Regulations that Impact Oil and Gas Development
Avoid Air-rors! Discuss the Air Regulations that Impact Oil and Gas DevelopmentAvoid Air-rors! Discuss the Air Regulations that Impact Oil and Gas Development
Avoid Air-rors! Discuss the Air Regulations that Impact Oil and Gas Development
 
Australian expos directory 2012 (by sector)
Australian expos directory 2012 (by sector)Australian expos directory 2012 (by sector)
Australian expos directory 2012 (by sector)
 
My First Source Code
My First Source CodeMy First Source Code
My First Source Code
 
LG Catalogo seguridad 2011
LG Catalogo seguridad 2011LG Catalogo seguridad 2011
LG Catalogo seguridad 2011
 
Lcca
LccaLcca
Lcca
 

Similar to Open Source Data Analysis Tools Exploratorium

2011.11.03.charleston.ldi
2011.11.03.charleston.ldi2011.11.03.charleston.ldi
2011.11.03.charleston.ldiBruce Heterick
 
2011.11.03.charleston.ldi
2011.11.03.charleston.ldi2011.11.03.charleston.ldi
2011.11.03.charleston.ldiBruce Heterick
 
New e-Science Edinburgh Late Edition
New e-Science Edinburgh Late EditionNew e-Science Edinburgh Late Edition
New e-Science Edinburgh Late EditionDavid De Roure
 
Humanizing bioinformatics
Humanizing bioinformaticsHumanizing bioinformatics
Humanizing bioinformaticsJan Aerts
 
The W3C PROV standard: data model for the provenance of information, and enab...
The W3C PROV standard:data model for the provenance of information, and enab...The W3C PROV standard:data model for the provenance of information, and enab...
The W3C PROV standard: data model for the provenance of information, and enab...Paolo Missier
 
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009Paolo Missier
 
CrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRef
CrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRefCrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRef
CrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRefCrossref
 
Policy Lunchbox - Digital Science
Policy Lunchbox - Digital SciencePolicy Lunchbox - Digital Science
Policy Lunchbox - Digital ScienceKaitlin Thaney
 
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomicsYannick Wurm
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...Johann van Wyk
 
Improving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data CurationImproving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data CurationOCLC
 
Michael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project DesignMichael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project DesignAlice Sheppard
 
International scholarly infrastructures
International scholarly infrastructuresInternational scholarly infrastructures
International scholarly infrastructuresJisc
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search petermurrayrust
 
How to Build a Research Roadmap (avoiding tempting dead-ends)
How to Build a Research Roadmap (avoiding tempting dead-ends)How to Build a Research Roadmap (avoiding tempting dead-ends)
How to Build a Research Roadmap (avoiding tempting dead-ends)Aaron Sloman
 
Sensemaker for Partos Plaza - Irene Guyt
Sensemaker for Partos Plaza  - Irene GuytSensemaker for Partos Plaza  - Irene Guyt
Sensemaker for Partos Plaza - Irene Guytannepartos
 
The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014Right to Research
 
The State of Open Research Data
The State of Open Research DataThe State of Open Research Data
The State of Open Research DataRoss Mounce
 
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...Francisco Couto
 

Similar to Open Source Data Analysis Tools Exploratorium (20)

2011.11.03.charleston.ldi
2011.11.03.charleston.ldi2011.11.03.charleston.ldi
2011.11.03.charleston.ldi
 
2011.11.03.charleston.ldi
2011.11.03.charleston.ldi2011.11.03.charleston.ldi
2011.11.03.charleston.ldi
 
New e-Science Edinburgh Late Edition
New e-Science Edinburgh Late EditionNew e-Science Edinburgh Late Edition
New e-Science Edinburgh Late Edition
 
Humanizing bioinformatics
Humanizing bioinformaticsHumanizing bioinformatics
Humanizing bioinformatics
 
The W3C PROV standard: data model for the provenance of information, and enab...
The W3C PROV standard:data model for the provenance of information, and enab...The W3C PROV standard:data model for the provenance of information, and enab...
The W3C PROV standard: data model for the provenance of information, and enab...
 
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
 
CrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRef
CrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRefCrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRef
CrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRef
 
Policy Lunchbox - Digital Science
Policy Lunchbox - Digital SciencePolicy Lunchbox - Digital Science
Policy Lunchbox - Digital Science
 
Recommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenuRecommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenu
 
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...
 
Improving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data CurationImproving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data Curation
 
Michael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project DesignMichael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project Design
 
International scholarly infrastructures
International scholarly infrastructuresInternational scholarly infrastructures
International scholarly infrastructures
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
How to Build a Research Roadmap (avoiding tempting dead-ends)
How to Build a Research Roadmap (avoiding tempting dead-ends)How to Build a Research Roadmap (avoiding tempting dead-ends)
How to Build a Research Roadmap (avoiding tempting dead-ends)
 
Sensemaker for Partos Plaza - Irene Guyt
Sensemaker for Partos Plaza  - Irene GuytSensemaker for Partos Plaza  - Irene Guyt
Sensemaker for Partos Plaza - Irene Guyt
 
The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014
 
The State of Open Research Data
The State of Open Research DataThe State of Open Research Data
The State of Open Research Data
 
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
 

Recently uploaded

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Open Source Data Analysis Tools Exploratorium

  • 1. The Big Data Exploratorium A guided tour of open source data analysis tools Noah Pepper (@noahmp) Devin Chalmers (@qwzybug) #exploratorium @osb11 Thursday, June 23, 2011 1
  • 2. Hi, • We’re here because... • We are... • Data Exploration Is... • Example 1: Patents • (Chalmers et al. 2010; Buchanan et al. 2010; Pepper et al 2008) • Example 2: Health Care • (Pepper et al. Visweek 2010) Thursday, June 23, 2011 2
  • 3. Hi, • Exploratorium #1 • Patent citation networks • Graphviz • NetworkX • Exploratorium #2 • Reddit comment word usages Thursday, June 23, 2011 3
  • 4. Hi, • Get the code & data samples: • git clone git@github.com:peppern/exploratorium.git Thursday, June 23, 2011 4
  • 5. We’re here because... • There is a really amazing OSS community in the data space. • This is fantastic news for academics, hobbyists, and professionals alike. • We want to show what you can do with open source tools, show you the ones we like. • We’d love to hear about what YOUR favorites are, #exploratorium to tell us. • Data exploration is fun... Thursday, June 23, 2011 5
  • 6. We are... Noah Pepper - @noahmp Devin Chalmers - @qwzybug • Academic Data Junkies • We’re Sorta Lucky Our academic home. Research focuses on on exploring the nature Our startup of evolutionary where we build data activity through data exploration mining platforms Thursday, June 23, 2011 6
  • 7. We Build Data Exploration Tools! map.clearhealthcosts.com Thursday, June 23, 2011 7
  • 8. What is data exploration and what is an exploratorium • Narrow Definition • Why do I say visualization instead of the more • Data exploration is general having an iterative ‘representation’? relationship with your data, analysis, and visualization exploratorium noun [usu. in names ] stack where you a scientific museum or similar center at which visitors have the build an intuitive opportunity of performing prearranged experiments or demonstrations. cognitive model of the information Yes! That means visualized. there’s code and data Thursday, June 23, 2011 8
  • 9. Data Exploration Example • study evolution of technology in patent records – technology is a window on culture – patents are a window on technology Thursday, June 23, 2011 9
  • 11. Citation Analysis of Patents Thursday, June 23, 2011 11
  • 12. Time Series Text Analysis Thursday, June 23, 2011 12
  • 13. Some explorations are more open ended Thursday, June 23, 2011 13
  • 14. Pointwise Mutual Information (PMI) # patents that contain words x and y Thursday, June 23, 2011 14
  • 15. PMI distributions - see clusters - different kinds of clusters Thursday, June 23, 2011 15
  • 16. PMI Comparison: Plotting a different way “the” PMI integral halfway rank “optical” - generality of content? “cultivar” Thursday, June 23, 2011 16
  • 17. btw, these are older graphs, now we use ggplot2 Thursday, June 23, 2011 17
  • 18. Previous Work in Health Care... 500,000 400,000 Bill   volume Placement  in distribution  of  billed 300,000 Upper  5% 200,000 Bottom  5% 100,000 0 AMB ASC DME ER IPH OPH PRO Adjudication  type .... with @homerstrong at Qmedtrix Systems Inc. Thursday, June 23, 2011 18
  • 19. Previous Work in Health Care... 120,000 Bill  volume 100,000 80,000 60,000 40,000 20,000 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 1.4e+09 1.2e+09 Dollar  density 1.0e+09 8.0e+08 Billed 6.0e+08 First  Audit 4.0e+08 Second  Audit 2.0e+08 0.0e+00 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Amount  ($) ... @hadleywickham is a #ballR http://had.co.nz Thursday, June 23, 2011 19
  • 20. Health Care Data & Code Samples... ...Hahaha Just Kidding Thursday, June 23, 2011 20
  • 21. But actually: • Qmedtrix R&D team members made source contributions, see: • Homer Strong https://github.com/strongh @homerstrong (Lucky Sort) • Kevin Lynagh https://github.com/lynaghk (Keming Labs) Thursday, June 23, 2011 21
  • 22. Exploratorium #1 Patent Networks citations amongst top 10k most cited patents Thursday, June 23, 2011 22
  • 23. Grab the graph data: ~/exploratorium/patents/toplinks.dot Graphviz Art is Pretty! Thursday, June 23, 2011 23
  • 24. GraphViz Can Graph really big graphs... but they get hard to use -> <- Psychedelic Patents Thursday, June 23, 2011 24
  • 25. Graphviz - Play with Graphs (http://www.graphviz.org) • sudo port install graphviz or sudo apt-get install graphviz • graphing commands: dot,neato,twopi,circo,fdp • dot -Tpdf -o file.dot • More options here: • http://www.graphviz.org/content/command-line-invocation • Fun options are in the .dot file: • http://www.graphviz.org/content/dot-language Thursday, June 23, 2011 25
  • 26. Styling dots • node [shape=point, width="0.15",color="#0000001c"]; • edge [arrowsize="0.50", color="#0000001c"]; • There are tons, read the docs and have fun • You can also try more complex things • Like constraints, time for example • Sometimes too many constraints makes GraphViz unhappy... Thursday, June 23, 2011 26
  • 28. UbiGraph • We loved UbiGraph, but don’t know an OSS alternative • Renders many nodes in 3D in realtime FD-layout (50k+). • 16gb of ram Mac Pro • Shout out to Apple: thank you for supporting our research! • It’s ‘free’ but development has stalled and since it’s closed source we can’t build on it! • Alternatives? Thursday, June 23, 2011 28
  • 29. Exploratorium #2 • Making graphs of language using python, redis, R and a bunch of awesome libraries • Thanks • @hadleywickham • @homerstrong • @antirez • Bryan Lewis (http://illposed.net/) Thursday, June 23, 2011 29
  • 30. ...how? Mine — Munge — Visualize Thursday, June 23, 2011 30
  • 31. ...how? github.com/peppern/exploratorium [ brew | apt-get | port ] install redis www.r-project.org github.com/qwzybug/rredis redis TTR package Thursday, June 23, 2011 31
  • 32. Best show on TV Thursday, June 23, 2011 32
  • 33. Best show on TV Thursday, June 23, 2011 32
  • 34. Best show on TV Thursday, June 23, 2011 32
  • 35. Best show on TV Thursday, June 23, 2011 32
  • 36. Best show on TV Thursday, June 23, 2011 33
  • 37. Mine the data • gutenberg.org • google.com/ngrams • APIs — Twitter, etc. • http://code.google.com/apis/socialgraph/ • Scrape Thursday, June 23, 2011 34
  • 38. Store the data Thursday, June 23, 2011 35
  • 39. Store the data Postgres is not too shabby Thursday, June 23, 2011 35
  • 40. Store the data SELECT cite AS patent_num, count FROM (SELECT cite, count(*) AS count FROM citations GROUP BY cite) AS t1 ORDER BY t1.count DESC LIMIT 10 Thursday, June 23, 2011 36
  • 41. Store the data SELECT `cite`, count(*), `year` FROM `citations` INNER JOIN (SELECT date_part('year', `grantdate`) AS `year`, `patent_num` AS `patent_num` FROM `patents`) AS `t1` USING (`patent_num`) WHERE (cite IN (12345)) GROUP BY `year`, `cite` Thursday, June 23, 2011 37
  • 42. Store the data SELECT term, count FROM (SELECT term, count(*) FROM (SELECT patent_num, term FROM tfidfs WHERE (tfidf > 0.05)) AS "t1" INNER JOIN (SELECT * FROM (SELECT patent_num FROM patent_lengths WHERE (wordcount > 10)) AS "t1" INNER JOIN (SELECT * FROM patents WHERE (grantdate > '1990-01-01' AND grantdate < '2000-01-01')) AS "t2" USING ("patent_num")) AS "t2" USING ("patent_num") GROUP BY "term") AS "t3" ORDER BY count DESC LIMIT 50; Thursday, June 23, 2011 38
  • 43. Store the data Thursday, June 23, 2011 39
  • 44. Store the data NoSQL is a good fit for web data Thursday, June 23, 2011 40
  • 45. Reshape the data Thursday, June 23, 2011 41
  • 46. Reshape the data citer citee a b c b b d Thursday, June 23, 2011 41
  • 47. Reshape the data citer citee a b c b b d { a : [b], c : [b], b: [d] } Thursday, June 23, 2011 41
  • 48. Reshape the data citer citee a b c b b d { a : [b], c : [b], b: [d] } { b : [a, c], d : [b] } Thursday, June 23, 2011 41
  • 49. Redis In-Memory Data Structure Server Thursday, June 23, 2011 42
  • 51. Redis • HSET key name value • SADD key value • ZUNIONSTORE • HSETNX • BRPOPLPUSH •… Thursday, June 23, 2011 44
  • 53. Redis Global variable for all your programs Thursday, June 23, 2011 45
  • 54. Redis Global variable for all your programs Memcached with structure Thursday, June 23, 2011 45
  • 55. Redis Global variable for all your programs Memcached with structure Really fast Thursday, June 23, 2011 45
  • 56. Redis Global variable for all your programs Memcached with structure Really really fast Thursday, June 23, 2011 46
  • 57. Redis Global variable for all your programs Memcached with structure Really, really, astonishingly fast Thursday, June 23, 2011 47
  • 58. Redis Global variable for all your programs Memcached with structure No, faster than that Thursday, June 23, 2011 48
  • 62. Reddit • Count words by hour Thursday, June 23, 2011 50
  • 63. Reddit • Count words by hour • Comment network Thursday, June 23, 2011 50
  • 64. Reddit • Count words by hour • Comment network • User network Thursday, June 23, 2011 50
  • 65. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 • Comment network • User network Thursday, June 23, 2011 50
  • 66. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network • User network Thursday, June 23, 2011 50
  • 67. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments • User network Thursday, June 23, 2011 50
  • 68. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network Thursday, June 23, 2011 50
  • 69. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users Thursday, June 23, 2011 50
  • 70. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users “parent_id:child_id” Thursday, June 23, 2011 50
  • 71. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users “parent_id:child_id” SET subreddit:threads Thursday, June 23, 2011 50
  • 72. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users “parent_id:child_id” SET subreddit:threads thread_id Thursday, June 23, 2011 50
  • 73. Reddit github.com/peppern/exploratorium [ brew | apt-get | port ] install redis www.r-project.org github.com/qwzybug/rredis redis TTR package Thursday, June 23, 2011 51
  • 74. Reddit (demo) Thursday, June 23, 2011 52
  • 75. Reddit Go forth and graph! #exploratorium #osb11 Thursday, June 23, 2011 53
  • 76. Reddit Go forth and graph! #exploratorium #osb11 We will hire you. Thursday, June 23, 2011 53
  • 77. Reddit Go forth and graph! #exploratorium #osb11 We will hire you. For reals. Thursday, June 23, 2011 53
  • 78. You Are Now Leaving the Big Data Exploratorium Please ensure you have your valuables. Noah Pepper @noahmp Devin Chalmers @qwzybug #exploratorium #osb11 Thursday, June 23, 2011 54