SlideShare a Scribd company logo
1 of 36
Download to read offline
Tapping  the  Data  Deluge  with  R

           Finding  and  using  supplemental  data  
              to  add  context  to  your  analysis



                                          by Jeffrey Breen
                            Principal, Think Big Academy
Code & Data on github
http://bit.ly/pawdata      email: jeffrey.breen@thinkbiganalytics.com
                             blog: http://jeffreybreen.wordpress.com
                                                 Twitter: @JeffreyBreen
                                        1
Data data everywhere!

This may be how you picture the data deluge looks like if you work for the Economist.

But those of us who wrangle data for living know that it’s usually not so prosaic or buttoned-down, proper or quaint.
Real  data  hits  us  in  the  face...




                                      3
Real data can hit you in the face.

Yet we keep coming back for more.
...and  then  there’s  Big  Data.




                                                                     4


And I’m not even going to talk about Big Data tonight. (For a change!)
Finding  the  right  data  makes  all  the  difference




                                                                     5
Tonight we’re going to look at a few different places to find those data sets which can make a difference, and a few techniques
to access them so you can incorporate them into your analysis.
The  two  types  of  data

        Data  you  have
        Data  you  don’t  
        have...  yet
                                                                   6
Perhaps you’ve heard the joke: There are two kinds of people: People who think there are two kinds of people and people
who don’t.

I like to think that there are two kinds of data.
The  two  types  of  data
   • Data  you  have
      – CSV  files,  spreadsheets
      – files  from  other  sta>s>cs  packages  (SPSS,  SAS,  Stata,...)
      – databases,  data  warehouses  (SQL,  NoSQL,  HBase,...)
      – whatever  your  boss  emailed  you  on  his  way  to  lunch
      – datasets  within  R  and  R  packages

   • Data  you  don’t  have...  yet
      – file  downloads  &  web  scraping
      – data  marketplaces  and  other  APIs


Code & Data on github: http://bit.ly/pawdata   7
Reading  CSV  files  is  easy
   $ head -5 data/mpg-3-13-2012.csv | cut -c 1-60
   "Model Yr","Mfr Name","Division","Carline","Verify Mfr Cd","
   2012,"aston martin","Aston Martin Lagonda Ltd","V12 Vantage"
   2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",
   2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",
   2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",


   data = read.csv('data/mpg-3-13-2012.csv')

   View(data)




see R/01-read.csv-mpg.R                           8
But  so  is  reading  Excel  files  directly
   library(XLConnect)

   wb = loadWorkbook("data/mpg.xlsx", create=F)

   data = readWorksheet(wb, sheet='3-7-2012')




see R/02-XLConnect-mpg.R              9
“foreign”  file  formats
   library(foreign)

   sav.file = file.path(system.file(package='foreign'), 'tests', 'sample100.sav')
   spss.data = read.spss(sav.file)




   xpt.file = file.path(system.file(package='foreign'), 'tests', 'test.xpt')
   sas.data = read.xport(xpt.file)

   dta.file = file.path(system.file(package='foreign'), 'tests', 'auto8.dta')
   stata.data = read.dta(dta.file)

   dbf.file = file.path(system.file(package='foreign'), 'files', 'sids.dbf')
   dbf.data = read.dbf(dbf.file)



see R/03-foreign.R                             10
RelaMonal  databases
 library(RMySQL)

 con = dbConnect(MySQL(), user="root", dbname="test")

 data = dbGetQuery(con, "select * from airport")

 dbDisconnect(con)

 View(data)

     airport_code   airport_name                         location                      state_code   country_name   time_zone_code
 1   ATL            WILLIAM B. HARTSFIELD                ATLANTA,GEORGIA               GA           USA            EST
 2   BOS            LOGAN INTERNATIONAL                  BOSTON,MASSACHUSETTS          MA           USA            EST
 3   BWI            BALTIMORE/WASHINGTON INTERNATIONAL   BALTIMORE,MARYLAND            MD           USA            EST
 4   DEN            STAPLETON INTERNATIONAL              DENVER,COLORADO               CO           USA            MST
 5   DFW            DALLAS/FORT WORTH INTERNATIONAL      DALLAS/FT. WORTH,TEXAS        TX           USA            CST
 6   OAK            METROPOLITAN OAKLAND INTERNATIONAL   OAKLAND,CALIFORNIA            CA           USA            PST
 7   PHL            PHILADELPHIA INTERNATIONAL           PHILADELPHIA PA/WILM'TON,DE   PA           USA            EST
 8   PIT            GREATER PITTSBURGH                   PITTSBURGH,PENNSYLVANIA       PA           USA            EST
 9   SFO            SAN FRANCISCO INTERNATIONAL          SAN FRANCISCO,CALIFORNIA      CA           USA            PST




see R/04-RMySQL-airport.R                                            11
Non-­‐relaMonal  databases  too
> library(rhbase)
> hb.init(serialize='raw')
> x = hb.get(tablename='tweets', rows='221325531868692480')
> str(x)
List of 1
 $ :List of 3
  ..$ : chr "221325531868692480"
  ..$ : chr [1:10] "created:" "favorited:" "id:" "replyToSID:" ...
  ..$ :List of 10
  .. ..$ : chr "2012-07-06 19:31:33"
  .. ..$ : chr "FALSE"
  .. ..$ : chr "221325531868692480"
  .. ..$ : chr "NA"
  .. ..$ : chr "NA"
  .. ..$ : chr "NA"
  .. ..$ : chr "arnicas"
  .. ..$ : chr "<a href="http://www.tweetdeck.com"
rel="nofollow">TweetDeck</a>"
  .. ..$ : chr "RT @bycoffe: From @DrewLinzer, an #Rstats function for querying
the HuffPost Pollster API. http://t.co/fXnG32JX cc @thewhyaxis"
  .. ..$ : chr "FALSE"




                                           12
weird  emails  from  the  boss
   con = textConnection('
   # Hi:
   #
   # Please invite these paid volunteers   to the spontaneous rally at 3PM today:
   #
   Name       Department  "Hourly Rate"    email
   Alice      Operations    32             alice@wonderland.org
   Billy      Logistics      5             billy.pilgrim@slaugterhouse5.com
   Winston    Records       20             winston.smith@truth.gov.oc
   #
   #Thanks,
   #Your Boss
   #!   !   !     !    !
   ')

   data = read.table(con, header=T, comment.char='#')
   close.connection(con)

   View(data)       Name      Department      Hourly.Rate     email
                1   Alice     Operations      32              alice@wonderland.org
                2   Billy     Logistics       5               billy.pilgrim@slaugterhouse5.com
                3   Winston   Records         20              winston.smith@truth.gov.oc

see R/05-textConnection-email.R                    13
> data()
Data sets in package ‘datasets’:
AirPassengers            Monthly Airline Passenger Numbers 1949-1960
BJsales                  Sales Data with Leading Indicator
BJsales.lead (BJsales)
                         Sales Data with Leading Indicator
BOD                      Biochemical Oxygen Demand
CO2                      Carbon Dioxide Uptake in Grass Plants
ChickWeight              Weight versus age of chicks on different diets
DNase                    Elisa assay of DNase
EuStockMarkets           Daily Closing Prices of Major European Stock
                         Indices, 1991-1998
Formaldehyde             Determination of Formaldehyde
HairEyeColor             Hair and Eye Color of Statistics Students
Harman23.cor             Harman Example 2.3
Harman74.cor             Harman Example 7.4
Indometh                 Pharmacokinetics of Indomethacin
InsectSprays             Effectiveness of Insect Sprays
JohnsonJohnson           Quarterly Earnings per Johnson & Johnson Share
LakeHuron                Level of Lake Huron 1875-1972
LifeCycleSavings         Intercountry Life-Cycle Savings Data
Loblolly                 Growth of Loblolly pine trees
Nile                     Flow of the River Nile
Orange                   Growth of Orange Trees
OrchardSprays            Potency of Orchard Sprays
PlantGrowth              Results from an Experiment on Plant Growth
Puromycin                Reaction Velocity of an Enzymatic Reaction
Seatbelts                Road Casualties in Great Britain 1969-84
Theoph                   Pharmacokinetics of Theophylline
Titanic                  Survival of passengers on the Titanic
ToothGrowth              The Effect of Vitamin C on Tooth Growth in
                         Guinea Pigs
UCBAdmissions            Student Admissions at UC Berkeley
UKDriverDeaths           Road Casualties in Great Britain 1969-84
UKgas                    UK Quarterly Gas Consumption
USAccDeaths              Accidental Deaths in the US 1973-1978
USArrests                Violent Crime Rates by US State
USJudgeRatings           Lawyers' Ratings of State Judges in the US
                         Superior Court
USPersonalExpenditure    Personal Expenditure Data
VADeaths                 Death Rates in Virginia (1940)
WWWusage                 Internet Usage per Minute
WorldPhones              The World's Telephones
ability.cov              Ability and Intelligence Tests
airmiles                 Passenger Miles on Commercial US Airlines,
                         1937-1960
airquality               New York Air Quality Measurements
[...]
> library(zipcode)
> data(zipcode)
> str(zipcode)
'data.frame':   44336 obs. of 5 variables:
 $ zip      : chr "00210" "00211" "00212" "00213" ...
 $ city     : chr "Portsmouth" "Portsmouth" "Portsmouth" "Portsmouth" ...
 $ state    : chr "NH" "NH" "NH" "NH" ...
 $ latitude : num 43 43 43 43 43 ...
 $ longitude: num -71 -71 -71 -71 -71 ...
> subset(zipcode, city=='Boston' & state=='MA')
       zip  city state latitude longitude
664 02101 Boston    MA 42.37057 -71.02696
665 02102 Boston    MA 42.33895 -70.91963
666 02103 Boston    MA 42.33895 -70.91963
667 02104 Boston    MA 42.33895 -70.91963
668 02105 Boston    MA 42.33895 -70.91963
669 02106 Boston    MA 42.35432 -71.07345
670 02107 Boston    MA 42.33895 -70.91963
671 02108 Boston    MA 42.35790 -71.06408
672 02109 Boston    MA 42.36148 -71.05417
673 02110 Boston    MA 42.35653 -71.05365
674 02111 Boston    MA 42.34984 -71.06101
675 02112 Boston    MA 42.33895 -70.91963
676 02113 Boston    MA 42.36503 -71.05636
677 02114 Boston    MA 42.36179 -71.06774
678 02115 Boston    MA 42.34308 -71.09268
679 02116 Boston    MA 42.34962 -71.07372
680 02117 Boston    MA 42.33895 -70.91963
681 02118 Boston    MA 42.33872 -71.07276
682 02119 Boston    MA 42.32451 -71.08455
683 02120 Boston    MA 42.33210 -71.09651
684 02121 Boston    MA 42.30745 -71.08127
685 02122 Boston    MA 42.29630 -71.05454
686 02123 Boston    MA 42.33895 -70.91963
687 02124 Boston    MA 42.28713 -71.07156
688 02125 Boston    MA 42.31685 -71.05811
690 02127 Boston    MA 42.33499 -71.04562
691 02128 Boston    MA 42.37830 -71.02550
696 02133 Boston    MA 42.33895 -70.91963
726 02163 Boston    MA 42.36795 -71.12056
757 02196 Boston    MA 42.33895 -70.91963
[...]
image credit: http://njarb.com/2012/08/untangle-this-mess-of-wires/

Now let’s turn our attention to tapping into the internet for other data sources
The  two  types  of  data
   • Data  you  have
      – CSV  files,  spreadsheets
      – files  from  other  sta>s>cs  packages  (SPSS,  SAS,  Stata,...)
      – databases,  data  warehouses  (SQL,  NoSQL,  HBase,...)
      – whatever  your  boss  emailed  you  on  his  way  to  lunch
      – datasets  within  R  and  R  packages

   • Data  you  don’t  have...  yet
      – file  downloads  &  web  scraping
      – data  marketplaces  and  other  APIs


Code & Data on github: http://bit.ly/pawdata   17
Many  base  funcMons  take  URLs
   url = 'http://ichart.finance.yahoo.com/table.csv?
   s=YHOO&d=8&e=28&f=2012&g=d&a=3&b=12&c=1996&
   ignore=.csv'

   data = read.csv(url)

   ggplot(data) + geom_point(aes(x=as.Date(Date),
   y=Close), size = 1) + scale_y_log10() + theme_bw()




see R/06-read.csv-url-yahoo.R   20
download.file()  if  URLs  aren’t  supported
   library(XLConnect)

   url = "http://www.fueleconomy.gov/feg/EPAGreenGuide/xls/
   all_alpha_12.xls"
   local.xls.file = 'data/all_alpha_12.xls'

   download.file(url, local.xls.file)

   wb = loadWorkbook(local.xls.file, create=F)
   data = readWorksheet(wb, sheet='all_alpha_12')

   View(data)




see R/07-download.file-XLConnect-green.R   22
image credit: http://groovynoms.com/2011/07/25/beer-of-the-week-2/

Now, I don’t mean to oversell this next one, but if you’ve spent as much time as I have finding -- and trying to deal with --
interesting data sets on web pages, you might agree that this next function alone is worth the price of admission.
not  even  HTML  tables  are  safe
     library(XML)
     url = 'http://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States'
     state.capitals.df = readHTMLTable(url, which=2)




                                                State         Abr.   Date of statehood   Capital       Capital since   Land area (mi²)   Most populous city?
                                            1   Alabama       AL     1819                Montgomery    1846            155.4              No
                                            2   Alaska        AK     1959                Juneau        1906            2716.7             No
                                            3   Arizona       AZ     1912                Phoenix       1889            474.9              Yes
                                            4   Arkansas      AR     1836                Little Rock   1821            116.2              Yes
                                            5   California    CA     1850                Sacramento    1854            97.2               No
                                            6   Colorado      CO     1876                Denver        1867            153.4              Yes
                                            7   Connecticut   CT     1788                Hartford      1875            17.3               No
                                            8   Delaware      DE     1787                Dover         1777            22.4               No
                                            9   Florida       FL     1845                Tallahassee   1824            95.7               No
                                           10   Georgia       GA     1788                Atlanta       1868            131.7              Yes




see R/08-readHTMLTable.R                                                    24
As you’d expect from a package called “XML”, it parses well-formed XML files.

But I didn’t expect it would do such a good job with HTML.

And I certainly didn’t expect to find a function as handy as readHTMLTable()!
image credit: http://www.ebaypartnernetworkblog.com/en/files/2011/05/api1.gif
The  DataMarket  Is  Open...




               26
..and  couldn’t  be  easier  to  access.

                                                                       library(rdatamarket)
                                                                       oil.prod = dmseries("http://data.is/nyFeP9")
                                                                       plot(oil.prod)




see R/09-rdatamarket.R                                                  27
DataMarket includes its own URL shortner -- like bit.ly but just for their data.

Long or short, just give dmseries() the URL, and it will download the data set for you.
Make  a  withdrawal  from  the  World  Bank
   > library(WDI)
   > WDIsearch('population, total')
             indicator                name
         "SP.POP.TOTL" "Population, total"

   > WDIsearch('fertility .*total')
                                    indicator                                       name
                             "SP.DYN.TFRT.IN" "Fertility rate, total (births per woman)"

   > WDIsearch('life expectancy .*birth.*total')
                                   indicator                                      name
                            "SP.DYN.LE00.IN" "Life expectancy at birth, total (years)"

   > WDIsearch('GDP per capita .*constant')
        indicator        name
   [1,] "NY.GDP.PCAP.KD" "GDP per capita (constant 2000 US$)"
   [2,] "NY.GDP.PCAP.KN" "GDP per capita (constant LCU)"

   > WDIsearch('population, total')
             indicator                name
         "SP.POP.TOTL" "Population, total"




see R/10-WDI.R                                   28
Swedish  Accent  Not  Included
   data   = WDI(country=c('BR', 'CN', 'GB', 'JP', 'IN', 'SE', 'US'),
    !     !   !    indicator=c('SP.DYN.TFRT.IN', 'SP.DYN.LE00.IN', 'SP.POP.TOTL',
    !     !   !    !    !    !   'NY.GDP.PCAP.KD'),
    !     !     ! start=1900, end=2010)

   library(googleVis)
   g = gvisMotionChart(data, idvar='country',    timevar='year')
   plot(g)




see R/10-WDI.R                                     29
quantmod:  the  king  of  symbols
• getSymbols()  downloads  Mme  series  data  from  
  source  specified  by  “src”  parameter:
  – yahoo  =  Yahoo!  Finance

  – google  =  Google  Finance
  – FRED  =  St.  Louis  Fed’s  Federal  Reserve  Economic  Data

  – oanda  =  OANDA  Forex  Trading  &  Exchange  Rates
  – csv

  – MySQL
  – RData


                                   30
Hello,  FRED
55,000  economic  +me  series              • Federal  Reserve  Bank  of  Kansas       • Thomson  Reuters/University  of  
from  45  sources:                           City                                       Michigan
                                           • Federal  Reserve  Bank  of               • U.S.  Congress:  Congressional  
• AutomaMc  Data  Processing,  Inc.
                                             Philadelphia                               Budget  Office
• Banca  d'Italia
                                           • Federal  Reserve  Bank  of  St.  Louis   • U.S.  Department  of  Commerce:  
• Banco  de  Mexico                                                                     Bureau  of  Economic  Analysis
                                           • Freddie  Mac
• Bank  of  Japan                                                                     • U.S.  Department  of  Commerce:  
                                           • Haver  AnalyMcs
• Bankrate,  Inc.                                                                       Census  Bureau
                                           • InsMtute  for  Supply  Management
• Board  of  Governors  of  the                                                       • U.S.  Department  of  Energy:  
  Federal  Reserve  System                 • InternaMonal  Monetary  Fund
                                                                                        Energy  InformaMon  
                                           • London  Bullion  Market                    AdministraMon
• BofA  Merrill  Lynch
                                             AssociaMon
• BriMsh  Bankers'  AssociaMon                                                        • U.S.  Department  of  Housing  and  
                                           • NaMonal  AssociaMon  of  Realtors          Urban  Development
• Central  Bank  of  the  Republic  of  
  Turkey                                   • NaMonal  Bureau  of  Economic            • U.S.  Department  of  Labor:  
                                             Research                                   Bureau  of  Labor  StaMsMcs
• Chicago  Board  OpMons  Exchange
                                           • OrganisaMon  for  Economic  Co-­‐        • U.S.  Department  of  Labor:  
• CredAbility  Nonprofit  Credit              operaMon  and  Development                 Employment  and  Training  
  Counseling  &  EducaMon
                                           • Reserve  Bank  of  Australia               AdministraMon
• Deutsche  Bundesbank
                                           • Standard  and  Poor's                    • U.S.  Department  of  the  Treasury:  
• Dow  Jones  &  Company                                                                Financial  Management  Service
                                           • Swiss  NaMonal  Bank
• Eurostat                                                                            • U.S.  Department  of  
                                           • The  White  House:  Council  of  
• Federal  Financial  InsMtuMons             Economic  Advisors                         TransportaMon:  Federal  Highway  
  ExaminaMon  Council                                                                   AdministraMon
                                           • The  White  House:  Office  of  
• Federal  Housing  Finance  Agency          Management  and  Budget                  • Wilshire  Associates  Incorporated
• Federal  Reserve  Bank  of  Chicago                                                 • World  Bank
                                                                     31
BLS  Jobless  data  (FRED)  +  S&P  (Yahoo!)
   library(quantmod)

   initial.claims = getSymbols('ICSA', src='FRED', auto.assign=F)

   sp500 = getSymbols('^GSPC', src='yahoo', auto.assign=F)

   # Convert quotes to weekly and fetch Cl() closing price
   sp500.weekly = Cl(to.weekly(sp500))




see R/11-quantmod.R                              32
Resources
• Expanded  code  snippets  and  all  data  for  this  talk
   –   http://bit.ly/pawdata
• R  Data  Import/Export  manual
   –   http://cran.r-project.org/doc/manuals/R-data.html
• CRAN:  Comprehensive  R  Archive  Network
   –   package  lists:  http://cran.r-project.org/web/packages/
   –   Featured:  XLConnect,  foreign,  RMySQL,  XML,  quantmod,  rdatamarket,  WDI,  
       quantmod
   –   Database:  RODBC,  DBI,  RJDBC,  ROracle,  RPostgreSQL,  RSQLite,  RMongo,  RCassandra
   –   Data  sets:  zipcode,  agridat,  GANPAdata    
   –   Data  access:  crn,  rgbif,  RISmed,  govdat,  myepisodes,  msProstate,  corpora
• rhbase  from  the  RHadoop  project
   –   https://github.com/RevolutionAnalytics/RHadoop



                                                   33
When  I  first  said  that  R  is  my  “Swiss  Army  
Knife”  for  data,  you  might  have  pictured  this:
but  now  you  know  I  was  really  thinking  this:
Thank  you!



                                         by Jeffrey Breen
                           Principal, Think Big Academy
Code & Data on github
http://bit.ly/pawdata      email: jeffrey.breen@thinkbiganalytics.com
                             blog: http://jeffreybreen.wordpress.com
                                                 Twitter: @JeffreyBreen
                                       36

More Related Content

What's hot

Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1Vemula Ravi
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training Keylabs
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahoMartin Ferguson
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data ScientistsDataWorks Summit
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easyVictor Sanchez Anguix
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answerstechieguy85
 
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話しますDMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話しますWataru Shinohara
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
First Step for Big Data with Apache Hadoop
First Step for Big Data with Apache HadoopFirst Step for Big Data with Apache Hadoop
First Step for Big Data with Apache HadoopBorn2Learn Co., Ltd
 

What's hot (20)

Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話しますDMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Introduction to Mongodb
Introduction to MongodbIntroduction to Mongodb
Introduction to Mongodb
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
First Step for Big Data with Apache Hadoop
First Step for Big Data with Apache HadoopFirst Step for Big Data with Apache Hadoop
First Step for Big Data with Apache Hadoop
 

Similar to Tapping the Data Deluge with R

CONFidence 2014: Davi Ottenheimer Protecting big data at scale
CONFidence 2014: Davi Ottenheimer Protecting big data at scaleCONFidence 2014: Davi Ottenheimer Protecting big data at scale
CONFidence 2014: Davi Ottenheimer Protecting big data at scalePROIDEA
 
Dedupe, Merge and Purge: the art of normalization
Dedupe, Merge and Purge: the art of normalizationDedupe, Merge and Purge: the art of normalization
Dedupe, Merge and Purge: the art of normalizationTyler Bell
 
6 things to expect when you are visualizing
6 things to expect when you are visualizing6 things to expect when you are visualizing
6 things to expect when you are visualizingKrist Wongsuphasawat
 
What to expect when you are visualizing (v.2)
What to expect when you are visualizing (v.2)What to expect when you are visualizing (v.2)
What to expect when you are visualizing (v.2)Krist Wongsuphasawat
 
RDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SRDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SEmily Nimsakont
 
What I tell myself before visualizing
What I tell myself before visualizingWhat I tell myself before visualizing
What I tell myself before visualizingKrist Wongsuphasawat
 
The world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmithThe world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmithSanjiv Kawa
 
The world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmithThe world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmithSanjiv Kawa
 
Machine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMachine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMarc Borowczak
 
UBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-courseUBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-courseJennifer Bryan
 
Enhancing E-Resource Records for Discovery handout (PDF)
Enhancing E-Resource Records for Discovery handout (PDF)Enhancing E-Resource Records for Discovery handout (PDF)
Enhancing E-Resource Records for Discovery handout (PDF)Carla Arbagey
 
6 things to expect when you are visualizing (2020 Edition)
6 things to expect when you are visualizing (2020 Edition)6 things to expect when you are visualizing (2020 Edition)
6 things to expect when you are visualizing (2020 Edition)Krist Wongsuphasawat
 
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...J T "Tom" Johnson
 
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo LogicOC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo LogicBig Data Joe™ Rossi
 
DBpedia Framework - BBC Talk
DBpedia Framework - BBC TalkDBpedia Framework - BBC Talk
DBpedia Framework - BBC TalkGeorgi Kobilarov
 
Realtime Data Visualization
Realtime Data VisualizationRealtime Data Visualization
Realtime Data Visualizationphil_renaud
 
Embrace The Chaos
Embrace The ChaosEmbrace The Chaos
Embrace The Chaosjonphipps
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabadGeohedrick
 

Similar to Tapping the Data Deluge with R (20)

CONFidence 2014: Davi Ottenheimer Protecting big data at scale
CONFidence 2014: Davi Ottenheimer Protecting big data at scaleCONFidence 2014: Davi Ottenheimer Protecting big data at scale
CONFidence 2014: Davi Ottenheimer Protecting big data at scale
 
Dedupe, Merge and Purge: the art of normalization
Dedupe, Merge and Purge: the art of normalizationDedupe, Merge and Purge: the art of normalization
Dedupe, Merge and Purge: the art of normalization
 
6 things to expect when you are visualizing
6 things to expect when you are visualizing6 things to expect when you are visualizing
6 things to expect when you are visualizing
 
What to expect when you are visualizing (v.2)
What to expect when you are visualizing (v.2)What to expect when you are visualizing (v.2)
What to expect when you are visualizing (v.2)
 
RDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SRDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar S
 
NCompass Live: RDA: Are We There Yet?
NCompass Live: RDA: Are We There Yet?NCompass Live: RDA: Are We There Yet?
NCompass Live: RDA: Are We There Yet?
 
What I tell myself before visualizing
What I tell myself before visualizingWhat I tell myself before visualizing
What I tell myself before visualizing
 
The world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmithThe world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmith
 
The world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmithThe world is y0ur$: Geolocation-based wordlist generation with wordsmith
The world is y0ur$: Geolocation-based wordlist generation with wordsmith
 
Machine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMachine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification Challenges
 
UBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-courseUBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-course
 
Enhancing E-Resource Records for Discovery handout (PDF)
Enhancing E-Resource Records for Discovery handout (PDF)Enhancing E-Resource Records for Discovery handout (PDF)
Enhancing E-Resource Records for Discovery handout (PDF)
 
6 things to expect when you are visualizing (2020 Edition)
6 things to expect when you are visualizing (2020 Edition)6 things to expect when you are visualizing (2020 Edition)
6 things to expect when you are visualizing (2020 Edition)
 
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
 
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo LogicOC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
 
DBpedia Framework - BBC Talk
DBpedia Framework - BBC TalkDBpedia Framework - BBC Talk
DBpedia Framework - BBC Talk
 
EDI 2009 Case Law Update
EDI 2009 Case Law UpdateEDI 2009 Case Law Update
EDI 2009 Case Law Update
 
Realtime Data Visualization
Realtime Data VisualizationRealtime Data Visualization
Realtime Data Visualization
 
Embrace The Chaos
Embrace The ChaosEmbrace The Chaos
Embrace The Chaos
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 

More from Jeffrey Breen

Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
 
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Jeffrey Breen
 
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeMove your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeJeffrey Breen
 
R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesJeffrey Breen
 
Accessing Databases from R
Accessing Databases from RAccessing Databases from R
Accessing Databases from RJeffrey Breen
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterJeffrey Breen
 
FAA Aviation Forecasts 2011-2031 overview
FAA Aviation Forecasts 2011-2031 overviewFAA Aviation Forecasts 2011-2031 overview
FAA Aviation Forecasts 2011-2031 overviewJeffrey Breen
 

More from Jeffrey Breen (10)

Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
 
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
 
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeMove your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R code
 
R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlines
 
Accessing Databases from R
Accessing Databases from RAccessing Databases from R
Accessing Databases from R
 
Reshaping Data in R
Reshaping Data in RReshaping Data in R
Reshaping Data in R
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop cluster
 
FAA Aviation Forecasts 2011-2031 overview
FAA Aviation Forecasts 2011-2031 overviewFAA Aviation Forecasts 2011-2031 overview
FAA Aviation Forecasts 2011-2031 overview
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Tapping the Data Deluge with R

  • 1. Tapping  the  Data  Deluge  with  R Finding  and  using  supplemental  data   to  add  context  to  your  analysis by Jeffrey Breen Principal, Think Big Academy Code & Data on github http://bit.ly/pawdata email: jeffrey.breen@thinkbiganalytics.com blog: http://jeffreybreen.wordpress.com Twitter: @JeffreyBreen 1
  • 2. Data data everywhere! This may be how you picture the data deluge looks like if you work for the Economist. But those of us who wrangle data for living know that it’s usually not so prosaic or buttoned-down, proper or quaint.
  • 3. Real  data  hits  us  in  the  face... 3 Real data can hit you in the face. Yet we keep coming back for more.
  • 4. ...and  then  there’s  Big  Data. 4 And I’m not even going to talk about Big Data tonight. (For a change!)
  • 5. Finding  the  right  data  makes  all  the  difference 5 Tonight we’re going to look at a few different places to find those data sets which can make a difference, and a few techniques to access them so you can incorporate them into your analysis.
  • 6. The  two  types  of  data Data  you  have Data  you  don’t   have...  yet 6 Perhaps you’ve heard the joke: There are two kinds of people: People who think there are two kinds of people and people who don’t. I like to think that there are two kinds of data.
  • 7. The  two  types  of  data • Data  you  have – CSV  files,  spreadsheets – files  from  other  sta>s>cs  packages  (SPSS,  SAS,  Stata,...) – databases,  data  warehouses  (SQL,  NoSQL,  HBase,...) – whatever  your  boss  emailed  you  on  his  way  to  lunch – datasets  within  R  and  R  packages • Data  you  don’t  have...  yet – file  downloads  &  web  scraping – data  marketplaces  and  other  APIs Code & Data on github: http://bit.ly/pawdata 7
  • 8. Reading  CSV  files  is  easy $ head -5 data/mpg-3-13-2012.csv | cut -c 1-60 "Model Yr","Mfr Name","Division","Carline","Verify Mfr Cd"," 2012,"aston martin","Aston Martin Lagonda Ltd","V12 Vantage" 2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage", 2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage", 2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage", data = read.csv('data/mpg-3-13-2012.csv') View(data) see R/01-read.csv-mpg.R 8
  • 9. But  so  is  reading  Excel  files  directly library(XLConnect) wb = loadWorkbook("data/mpg.xlsx", create=F) data = readWorksheet(wb, sheet='3-7-2012') see R/02-XLConnect-mpg.R 9
  • 10. “foreign”  file  formats library(foreign) sav.file = file.path(system.file(package='foreign'), 'tests', 'sample100.sav') spss.data = read.spss(sav.file) xpt.file = file.path(system.file(package='foreign'), 'tests', 'test.xpt') sas.data = read.xport(xpt.file) dta.file = file.path(system.file(package='foreign'), 'tests', 'auto8.dta') stata.data = read.dta(dta.file) dbf.file = file.path(system.file(package='foreign'), 'files', 'sids.dbf') dbf.data = read.dbf(dbf.file) see R/03-foreign.R 10
  • 11. RelaMonal  databases library(RMySQL) con = dbConnect(MySQL(), user="root", dbname="test") data = dbGetQuery(con, "select * from airport") dbDisconnect(con) View(data) airport_code airport_name location state_code country_name time_zone_code 1 ATL WILLIAM B. HARTSFIELD ATLANTA,GEORGIA GA USA EST 2 BOS LOGAN INTERNATIONAL BOSTON,MASSACHUSETTS MA USA EST 3 BWI BALTIMORE/WASHINGTON INTERNATIONAL BALTIMORE,MARYLAND MD USA EST 4 DEN STAPLETON INTERNATIONAL DENVER,COLORADO CO USA MST 5 DFW DALLAS/FORT WORTH INTERNATIONAL DALLAS/FT. WORTH,TEXAS TX USA CST 6 OAK METROPOLITAN OAKLAND INTERNATIONAL OAKLAND,CALIFORNIA CA USA PST 7 PHL PHILADELPHIA INTERNATIONAL PHILADELPHIA PA/WILM'TON,DE PA USA EST 8 PIT GREATER PITTSBURGH PITTSBURGH,PENNSYLVANIA PA USA EST 9 SFO SAN FRANCISCO INTERNATIONAL SAN FRANCISCO,CALIFORNIA CA USA PST see R/04-RMySQL-airport.R 11
  • 12. Non-­‐relaMonal  databases  too > library(rhbase) > hb.init(serialize='raw') > x = hb.get(tablename='tweets', rows='221325531868692480') > str(x) List of 1 $ :List of 3 ..$ : chr "221325531868692480" ..$ : chr [1:10] "created:" "favorited:" "id:" "replyToSID:" ... ..$ :List of 10 .. ..$ : chr "2012-07-06 19:31:33" .. ..$ : chr "FALSE" .. ..$ : chr "221325531868692480" .. ..$ : chr "NA" .. ..$ : chr "NA" .. ..$ : chr "NA" .. ..$ : chr "arnicas" .. ..$ : chr "<a href="http://www.tweetdeck.com" rel="nofollow">TweetDeck</a>" .. ..$ : chr "RT @bycoffe: From @DrewLinzer, an #Rstats function for querying the HuffPost Pollster API. http://t.co/fXnG32JX cc @thewhyaxis" .. ..$ : chr "FALSE" 12
  • 13. weird  emails  from  the  boss con = textConnection(' # Hi: # # Please invite these paid volunteers to the spontaneous rally at 3PM today: # Name Department "Hourly Rate" email Alice Operations 32 alice@wonderland.org Billy Logistics 5 billy.pilgrim@slaugterhouse5.com Winston Records 20 winston.smith@truth.gov.oc # #Thanks, #Your Boss #! ! ! ! ! ') data = read.table(con, header=T, comment.char='#') close.connection(con) View(data) Name Department Hourly.Rate email 1 Alice Operations 32 alice@wonderland.org 2 Billy Logistics 5 billy.pilgrim@slaugterhouse5.com 3 Winston Records 20 winston.smith@truth.gov.oc see R/05-textConnection-email.R 13
  • 14. > data() Data sets in package ‘datasets’: AirPassengers Monthly Airline Passenger Numbers 1949-1960 BJsales Sales Data with Leading Indicator BJsales.lead (BJsales) Sales Data with Leading Indicator BOD Biochemical Oxygen Demand CO2 Carbon Dioxide Uptake in Grass Plants ChickWeight Weight versus age of chicks on different diets DNase Elisa assay of DNase EuStockMarkets Daily Closing Prices of Major European Stock Indices, 1991-1998 Formaldehyde Determination of Formaldehyde HairEyeColor Hair and Eye Color of Statistics Students Harman23.cor Harman Example 2.3 Harman74.cor Harman Example 7.4 Indometh Pharmacokinetics of Indomethacin InsectSprays Effectiveness of Insect Sprays JohnsonJohnson Quarterly Earnings per Johnson & Johnson Share LakeHuron Level of Lake Huron 1875-1972 LifeCycleSavings Intercountry Life-Cycle Savings Data Loblolly Growth of Loblolly pine trees Nile Flow of the River Nile Orange Growth of Orange Trees OrchardSprays Potency of Orchard Sprays PlantGrowth Results from an Experiment on Plant Growth Puromycin Reaction Velocity of an Enzymatic Reaction Seatbelts Road Casualties in Great Britain 1969-84 Theoph Pharmacokinetics of Theophylline Titanic Survival of passengers on the Titanic ToothGrowth The Effect of Vitamin C on Tooth Growth in Guinea Pigs UCBAdmissions Student Admissions at UC Berkeley UKDriverDeaths Road Casualties in Great Britain 1969-84 UKgas UK Quarterly Gas Consumption USAccDeaths Accidental Deaths in the US 1973-1978 USArrests Violent Crime Rates by US State USJudgeRatings Lawyers' Ratings of State Judges in the US Superior Court USPersonalExpenditure Personal Expenditure Data VADeaths Death Rates in Virginia (1940) WWWusage Internet Usage per Minute WorldPhones The World's Telephones ability.cov Ability and Intelligence Tests airmiles Passenger Miles on Commercial US Airlines, 1937-1960 airquality New York Air Quality Measurements [...]
  • 15. > library(zipcode) > data(zipcode) > str(zipcode) 'data.frame': 44336 obs. of 5 variables: $ zip : chr "00210" "00211" "00212" "00213" ... $ city : chr "Portsmouth" "Portsmouth" "Portsmouth" "Portsmouth" ... $ state : chr "NH" "NH" "NH" "NH" ... $ latitude : num 43 43 43 43 43 ... $ longitude: num -71 -71 -71 -71 -71 ... > subset(zipcode, city=='Boston' & state=='MA') zip city state latitude longitude 664 02101 Boston MA 42.37057 -71.02696 665 02102 Boston MA 42.33895 -70.91963 666 02103 Boston MA 42.33895 -70.91963 667 02104 Boston MA 42.33895 -70.91963 668 02105 Boston MA 42.33895 -70.91963 669 02106 Boston MA 42.35432 -71.07345 670 02107 Boston MA 42.33895 -70.91963 671 02108 Boston MA 42.35790 -71.06408 672 02109 Boston MA 42.36148 -71.05417 673 02110 Boston MA 42.35653 -71.05365 674 02111 Boston MA 42.34984 -71.06101 675 02112 Boston MA 42.33895 -70.91963 676 02113 Boston MA 42.36503 -71.05636 677 02114 Boston MA 42.36179 -71.06774 678 02115 Boston MA 42.34308 -71.09268 679 02116 Boston MA 42.34962 -71.07372 680 02117 Boston MA 42.33895 -70.91963 681 02118 Boston MA 42.33872 -71.07276 682 02119 Boston MA 42.32451 -71.08455 683 02120 Boston MA 42.33210 -71.09651 684 02121 Boston MA 42.30745 -71.08127 685 02122 Boston MA 42.29630 -71.05454 686 02123 Boston MA 42.33895 -70.91963 687 02124 Boston MA 42.28713 -71.07156 688 02125 Boston MA 42.31685 -71.05811 690 02127 Boston MA 42.33499 -71.04562 691 02128 Boston MA 42.37830 -71.02550 696 02133 Boston MA 42.33895 -70.91963 726 02163 Boston MA 42.36795 -71.12056 757 02196 Boston MA 42.33895 -70.91963 [...]
  • 16. image credit: http://njarb.com/2012/08/untangle-this-mess-of-wires/ Now let’s turn our attention to tapping into the internet for other data sources
  • 17. The  two  types  of  data • Data  you  have – CSV  files,  spreadsheets – files  from  other  sta>s>cs  packages  (SPSS,  SAS,  Stata,...) – databases,  data  warehouses  (SQL,  NoSQL,  HBase,...) – whatever  your  boss  emailed  you  on  his  way  to  lunch – datasets  within  R  and  R  packages • Data  you  don’t  have...  yet – file  downloads  &  web  scraping – data  marketplaces  and  other  APIs Code & Data on github: http://bit.ly/pawdata 17
  • 18.
  • 19.
  • 20. Many  base  funcMons  take  URLs url = 'http://ichart.finance.yahoo.com/table.csv? s=YHOO&d=8&e=28&f=2012&g=d&a=3&b=12&c=1996& ignore=.csv' data = read.csv(url) ggplot(data) + geom_point(aes(x=as.Date(Date), y=Close), size = 1) + scale_y_log10() + theme_bw() see R/06-read.csv-url-yahoo.R 20
  • 21.
  • 22. download.file()  if  URLs  aren’t  supported library(XLConnect) url = "http://www.fueleconomy.gov/feg/EPAGreenGuide/xls/ all_alpha_12.xls" local.xls.file = 'data/all_alpha_12.xls' download.file(url, local.xls.file) wb = loadWorkbook(local.xls.file, create=F) data = readWorksheet(wb, sheet='all_alpha_12') View(data) see R/07-download.file-XLConnect-green.R 22
  • 23. image credit: http://groovynoms.com/2011/07/25/beer-of-the-week-2/ Now, I don’t mean to oversell this next one, but if you’ve spent as much time as I have finding -- and trying to deal with -- interesting data sets on web pages, you might agree that this next function alone is worth the price of admission.
  • 24. not  even  HTML  tables  are  safe library(XML) url = 'http://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States' state.capitals.df = readHTMLTable(url, which=2) State Abr. Date of statehood Capital Capital since Land area (mi²) Most populous city? 1 Alabama AL 1819 Montgomery 1846 155.4 No 2 Alaska AK 1959 Juneau 1906 2716.7 No 3 Arizona AZ 1912 Phoenix 1889 474.9 Yes 4 Arkansas AR 1836 Little Rock 1821 116.2 Yes 5 California CA 1850 Sacramento 1854 97.2 No 6 Colorado CO 1876 Denver 1867 153.4 Yes 7 Connecticut CT 1788 Hartford 1875 17.3 No 8 Delaware DE 1787 Dover 1777 22.4 No 9 Florida FL 1845 Tallahassee 1824 95.7 No 10 Georgia GA 1788 Atlanta 1868 131.7 Yes see R/08-readHTMLTable.R 24 As you’d expect from a package called “XML”, it parses well-formed XML files. But I didn’t expect it would do such a good job with HTML. And I certainly didn’t expect to find a function as handy as readHTMLTable()!
  • 26. The  DataMarket  Is  Open... 26
  • 27. ..and  couldn’t  be  easier  to  access. library(rdatamarket) oil.prod = dmseries("http://data.is/nyFeP9") plot(oil.prod) see R/09-rdatamarket.R 27 DataMarket includes its own URL shortner -- like bit.ly but just for their data. Long or short, just give dmseries() the URL, and it will download the data set for you.
  • 28. Make  a  withdrawal  from  the  World  Bank > library(WDI) > WDIsearch('population, total') indicator name "SP.POP.TOTL" "Population, total" > WDIsearch('fertility .*total') indicator name "SP.DYN.TFRT.IN" "Fertility rate, total (births per woman)" > WDIsearch('life expectancy .*birth.*total') indicator name "SP.DYN.LE00.IN" "Life expectancy at birth, total (years)" > WDIsearch('GDP per capita .*constant') indicator name [1,] "NY.GDP.PCAP.KD" "GDP per capita (constant 2000 US$)" [2,] "NY.GDP.PCAP.KN" "GDP per capita (constant LCU)" > WDIsearch('population, total') indicator name "SP.POP.TOTL" "Population, total" see R/10-WDI.R 28
  • 29. Swedish  Accent  Not  Included data = WDI(country=c('BR', 'CN', 'GB', 'JP', 'IN', 'SE', 'US'), ! ! ! indicator=c('SP.DYN.TFRT.IN', 'SP.DYN.LE00.IN', 'SP.POP.TOTL', ! ! ! ! ! ! 'NY.GDP.PCAP.KD'), ! ! ! start=1900, end=2010) library(googleVis) g = gvisMotionChart(data, idvar='country', timevar='year') plot(g) see R/10-WDI.R 29
  • 30. quantmod:  the  king  of  symbols • getSymbols()  downloads  Mme  series  data  from   source  specified  by  “src”  parameter: – yahoo  =  Yahoo!  Finance – google  =  Google  Finance – FRED  =  St.  Louis  Fed’s  Federal  Reserve  Economic  Data – oanda  =  OANDA  Forex  Trading  &  Exchange  Rates – csv – MySQL – RData 30
  • 31. Hello,  FRED 55,000  economic  +me  series   • Federal  Reserve  Bank  of  Kansas   • Thomson  Reuters/University  of   from  45  sources: City Michigan • Federal  Reserve  Bank  of   • U.S.  Congress:  Congressional   • AutomaMc  Data  Processing,  Inc. Philadelphia Budget  Office • Banca  d'Italia • Federal  Reserve  Bank  of  St.  Louis • U.S.  Department  of  Commerce:   • Banco  de  Mexico Bureau  of  Economic  Analysis • Freddie  Mac • Bank  of  Japan • U.S.  Department  of  Commerce:   • Haver  AnalyMcs • Bankrate,  Inc. Census  Bureau • InsMtute  for  Supply  Management • Board  of  Governors  of  the   • U.S.  Department  of  Energy:   Federal  Reserve  System • InternaMonal  Monetary  Fund Energy  InformaMon   • London  Bullion  Market   AdministraMon • BofA  Merrill  Lynch AssociaMon • BriMsh  Bankers'  AssociaMon • U.S.  Department  of  Housing  and   • NaMonal  AssociaMon  of  Realtors Urban  Development • Central  Bank  of  the  Republic  of   Turkey • NaMonal  Bureau  of  Economic   • U.S.  Department  of  Labor:   Research Bureau  of  Labor  StaMsMcs • Chicago  Board  OpMons  Exchange • OrganisaMon  for  Economic  Co-­‐ • U.S.  Department  of  Labor:   • CredAbility  Nonprofit  Credit   operaMon  and  Development Employment  and  Training   Counseling  &  EducaMon • Reserve  Bank  of  Australia AdministraMon • Deutsche  Bundesbank • Standard  and  Poor's • U.S.  Department  of  the  Treasury:   • Dow  Jones  &  Company Financial  Management  Service • Swiss  NaMonal  Bank • Eurostat • U.S.  Department  of   • The  White  House:  Council  of   • Federal  Financial  InsMtuMons   Economic  Advisors TransportaMon:  Federal  Highway   ExaminaMon  Council AdministraMon • The  White  House:  Office  of   • Federal  Housing  Finance  Agency Management  and  Budget • Wilshire  Associates  Incorporated • Federal  Reserve  Bank  of  Chicago • World  Bank 31
  • 32. BLS  Jobless  data  (FRED)  +  S&P  (Yahoo!) library(quantmod) initial.claims = getSymbols('ICSA', src='FRED', auto.assign=F) sp500 = getSymbols('^GSPC', src='yahoo', auto.assign=F) # Convert quotes to weekly and fetch Cl() closing price sp500.weekly = Cl(to.weekly(sp500)) see R/11-quantmod.R 32
  • 33. Resources • Expanded  code  snippets  and  all  data  for  this  talk – http://bit.ly/pawdata • R  Data  Import/Export  manual – http://cran.r-project.org/doc/manuals/R-data.html • CRAN:  Comprehensive  R  Archive  Network – package  lists:  http://cran.r-project.org/web/packages/ – Featured:  XLConnect,  foreign,  RMySQL,  XML,  quantmod,  rdatamarket,  WDI,   quantmod – Database:  RODBC,  DBI,  RJDBC,  ROracle,  RPostgreSQL,  RSQLite,  RMongo,  RCassandra – Data  sets:  zipcode,  agridat,  GANPAdata     – Data  access:  crn,  rgbif,  RISmed,  govdat,  myepisodes,  msProstate,  corpora • rhbase  from  the  RHadoop  project – https://github.com/RevolutionAnalytics/RHadoop 33
  • 34. When  I  first  said  that  R  is  my  “Swiss  Army   Knife”  for  data,  you  might  have  pictured  this:
  • 35. but  now  you  know  I  was  really  thinking  this:
  • 36. Thank  you! by Jeffrey Breen Principal, Think Big Academy Code & Data on github http://bit.ly/pawdata email: jeffrey.breen@thinkbiganalytics.com blog: http://jeffreybreen.wordpress.com Twitter: @JeffreyBreen 36