SlideShare a Scribd company logo
1 of 19
Download to read offline
Streaming Data And
 Concurrency In R



      Rory Winston

    rory@theresearchkitchen.com
About Me




   Independent Software Consultant
   M.Sc. Applied Computing, 2000
   M.Sc. Finance, 2008
   Apache Committer
   Interested in practical applications of functional languages and
   machine learning
   Really interested in seeing R usage grow in finance
1   A Short Rant


2   Why We Need Concurrency


3   Motivating Example


4   Conclusion


5   References and Further Reading
A Short Rant



   Parallelization vs. Concurrency in R


           Multithreading vs. parallelization
           i.e. fork() vs. pthread_create()
           R interpreter is single threaded
           Some historical context for this (e.g. non-threadsafe BLAS
           implementations)
           Multithreading can be complex and problematic
           Instead a focus on parallelization:
                Distributed computation: gridR, nws, snow
                Multicore/multi-cpu scaling: Rmpi, Romp, pnmath
                Interfaces to PBLAS/Hadoop/OpenMP/MPI/Globus/etc.
           Parallelization suits large CPU-bound processing applications
           So do we really need it at all then?
Why We Need Concurrency



   Multithreading Is A Valuable Tool


          I say, "yes"
          For general real-time (streaming to be more precise) data
          analysis
          (Growing interest in using R for streaming data, not just
          offline analyis)
          GUI toolkit integration
          Fine-grained control over independent task execution
          Fine-grained control over CPU-bound and I/O-bound task
          management
          "I believe that explicit concurrency management tools (i.e. a
          threads toolkit) are what we really need in R at this point." -
          Luke Tierney, 2001
Why We Need Concurrency



   Will There Be A Multithreaded R?



          Short answer is: Most likely not
          At least not in its current incarnation
          Internal workings of the interpreter not particularly amenable
          to concurrency:
                 Functions can manipulate caller state («- vs. <-)
                 Lazy evaluation machinery (promises)
                 Dynamic State, garbage collection, etc.
                 Scoping: global environments
                 Management of resources: streams, I/O, connections, sinks
          Implications for current code
          Possibly in the next language evolution (cf. Ihaka?)
Motivating Example



   Motivating Example




           Based on work I did last year and presented at UseR! 2008
           Wrote a real-time and historical market data service from
           Reuters/R
           The real-time interface used the Reuters C++ API
           R extension that spawned listening thread and handled market
           updates
           New version also does publishing as well as subscribing
Motivating Example



   Motivating Example




           The (real-world) example involves building a new
           high-frequency trading system
           Step 1 is handling market prices (in this case interbank
           currency prices)
           Need to ensure that the new system’s prices are:
                     Correct;
                     Fast
Motivating Example




                        R Analytics

                      C++ RMDS API




                     RMDS Message Bus
Motivating Example



   Issues With This Approach




           As R interpreter is single threaded, cannot spawn thread for
           callbacks
           Thus, interpreter thread is locked for the duration of
           subscription
           Not a great user experience
           Need to find alternative mechanism
Motivating Example



   Alternative Approach



           If we cannot run subscriber threads in-process, need to
           decouple
           Standard approach: add an extra layer and use some form of
           IPC
           For instance, we could:
                     Subscribe in a dedicated R process (A)
                     Push incoming data onto a socket
                     R process (B) reads from a listening socket
           Sockets could also be another IPC primitive, e.g. pipes, shared
           mem
           We will use the bigmemoRy package to leverage the latter
Motivating Example



   The bigmemoRy package




           From the description: "Use C++ to create, store,
           access, and manipulate massive matrices"
           Allows creation of large (≥ RAM) matrices
           These matrices can be mapped to files/shared memory
           It is the shared memory functionality that we will use

    big.matrix(nrow, ncol, type = "integer", ....)
    shared.big.matrix(nrow, ncol, type = "integer", ...)
    filebacked.big.matrix(nrow, ncol, type = "integer", ...)
    read.big.matrix(file, sep=, ...)
Motivating Example



   Sample Usage




    > library(bigmemory)
    > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000)
    > X
    An object of class “big.matrix”
    Slot "address":
    <pointer: 0x7378a0>
Motivating Example



   Create Shared Memory Descriptor

    > desc <- describe(X)
    > desc
    $sharedType
    [1] "SharedMemory"

    $sharedName
    [1] "53f14925-dca1-42a8-a547-e1bccae999ce"

    $nrow
    [1] 1000

    $ncol
    [1] 1000

    $rowNames
    NULL

    $colNames
    NULL

    $type
    [1] "double"
Motivating Example



   Export the Descriptor




    In R session 1:

    > dput(desc, file="/tmp/matrix.desc")

    In R session 2:

    > library(bigmemory)
    > desc <- dget("/tmp/matrix.desc")
    > X <- attach.big.matrix(desc)

    Now R sessions A and B share the same big.matrix instance
Motivating Example



   Share Data Between Sessions




    R session 1:

    > X[1,1] <- 1.2345

    R session 2:

    > X[1,1]
    [1] 1.2345

    Thus, streaming data can be continuously fed into session A
    And concurrently processed in session B
Motivating Example




                     RMDS Message Bus




                      C++ RMDS API

                      R / bigmemoRy




                      R / bigmemoRy

                      C++ RMDS API




                     RMDS Message Bus
Conclusion



    Summary




             Lack of threads not necessarily a barrier to concurrent analysis
             Packages like bigmemoRy, nws, etc. facilitate decoupling via
             IPC
             Could potentially take this further (using e.g. nws)
References and Further Reading



    References




            bigmemoRy:
            http://cran.r-project.org/web/packages/bigmemory/
            Luke Tierney’s original threading paper:
            http://www.cs.uiowa.edu/~luke/R/thrgui/
            HPC and R Survey:
            http://epub.ub.uni-muenchen.de/8991/
            Inside The Python GIL:
            www.dabeaz.com/python/GIL.pdf

More Related Content

What's hot

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Yuanyuan Tian
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsDatabricks
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsAjay Ohri
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
 
R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopRevolution Analytics
 
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Revolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Distributed R: The Next Generation Platform for Predictive Analytics
Distributed R: The Next Generation Platform for Predictive AnalyticsDistributed R: The Next Generation Platform for Predictive Analytics
Distributed R: The Next Generation Platform for Predictive AnalyticsJorge Martinez de Salinas
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkIntroduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkTugdual Grall
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisYuanyuan Tian
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient DataCarol McDonald
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 

What's hot (20)

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with Hadoop
 
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Distributed R: The Next Generation Platform for Predictive Analytics
Distributed R: The Next Generation Platform for Predictive AnalyticsDistributed R: The Next Generation Platform for Predictive Analytics
Distributed R: The Next Generation Platform for Predictive Analytics
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkIntroduction to Streaming with Apache Flink
Introduction to Streaming with Apache Flink
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient Data
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 

Similar to Streaming Data in R

Scaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, GoalsScaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, Goalskamaelian
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...IndicThreads
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystemGrzegorz Kolpuc
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftTalentica Software
 
Data analysis with R and Julia
Data analysis with R and JuliaData analysis with R and Julia
Data analysis with R and JuliaMark Tabladillo
 
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...EUDAT
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageDamien Dallimore
 
Effiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen DatenmengenEffiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen DatenmengenFlorian Stegmaier
 
RESTLess Design with Apache Thrift: Experiences from Apache Airavata
RESTLess Design with Apache Thrift: Experiences from Apache AiravataRESTLess Design with Apache Thrift: Experiences from Apache Airavata
RESTLess Design with Apache Thrift: Experiences from Apache Airavatasmarru
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document usefulssuser3c3f88
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureGabriele Modena
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 

Similar to Streaming Data in R (20)

Scaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, GoalsScaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, Goals
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
Data analysis with R and Julia
Data analysis with R and JuliaData analysis with R and Julia
Data analysis with R and Julia
 
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
 
Rust presentation convergeconf
Rust presentation convergeconfRust presentation convergeconf
Rust presentation convergeconf
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the message
 
Effiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen DatenmengenEffiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen Datenmengen
 
RESTLess Design with Apache Thrift: Experiences from Apache Airavata
RESTLess Design with Apache Thrift: Experiences from Apache AiravataRESTLess Design with Apache Thrift: Experiences from Apache Airavata
RESTLess Design with Apache Thrift: Experiences from Apache Airavata
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 

More from Rory Winston

Building A Trading Desk On Analytics
Building A Trading Desk On AnalyticsBuilding A Trading Desk On Analytics
Building A Trading Desk On AnalyticsRory Winston
 
The Modern FX Desk
The Modern FX DeskThe Modern FX Desk
The Modern FX DeskRory Winston
 
Introduction to kdb+
Introduction to kdb+Introduction to kdb+
Introduction to kdb+Rory Winston
 
An Analytics Toolkit Tour
An Analytics Toolkit TourAn Analytics Toolkit Tour
An Analytics Toolkit TourRory Winston
 
Creating R Packages
Creating R PackagesCreating R Packages
Creating R PackagesRory Winston
 
Streaming Data and Concurrency in R
Streaming Data and Concurrency in RStreaming Data and Concurrency in R
Streaming Data and Concurrency in RRory Winston
 

More from Rory Winston (6)

Building A Trading Desk On Analytics
Building A Trading Desk On AnalyticsBuilding A Trading Desk On Analytics
Building A Trading Desk On Analytics
 
The Modern FX Desk
The Modern FX DeskThe Modern FX Desk
The Modern FX Desk
 
Introduction to kdb+
Introduction to kdb+Introduction to kdb+
Introduction to kdb+
 
An Analytics Toolkit Tour
An Analytics Toolkit TourAn Analytics Toolkit Tour
An Analytics Toolkit Tour
 
Creating R Packages
Creating R PackagesCreating R Packages
Creating R Packages
 
Streaming Data and Concurrency in R
Streaming Data and Concurrency in RStreaming Data and Concurrency in R
Streaming Data and Concurrency in R
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Streaming Data in R

  • 1. Streaming Data And Concurrency In R Rory Winston rory@theresearchkitchen.com
  • 2. About Me Independent Software Consultant M.Sc. Applied Computing, 2000 M.Sc. Finance, 2008 Apache Committer Interested in practical applications of functional languages and machine learning Really interested in seeing R usage grow in finance
  • 3. 1 A Short Rant 2 Why We Need Concurrency 3 Motivating Example 4 Conclusion 5 References and Further Reading
  • 4. A Short Rant Parallelization vs. Concurrency in R Multithreading vs. parallelization i.e. fork() vs. pthread_create() R interpreter is single threaded Some historical context for this (e.g. non-threadsafe BLAS implementations) Multithreading can be complex and problematic Instead a focus on parallelization: Distributed computation: gridR, nws, snow Multicore/multi-cpu scaling: Rmpi, Romp, pnmath Interfaces to PBLAS/Hadoop/OpenMP/MPI/Globus/etc. Parallelization suits large CPU-bound processing applications So do we really need it at all then?
  • 5. Why We Need Concurrency Multithreading Is A Valuable Tool I say, "yes" For general real-time (streaming to be more precise) data analysis (Growing interest in using R for streaming data, not just offline analyis) GUI toolkit integration Fine-grained control over independent task execution Fine-grained control over CPU-bound and I/O-bound task management "I believe that explicit concurrency management tools (i.e. a threads toolkit) are what we really need in R at this point." - Luke Tierney, 2001
  • 6. Why We Need Concurrency Will There Be A Multithreaded R? Short answer is: Most likely not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?)
  • 7. Motivating Example Motivating Example Based on work I did last year and presented at UseR! 2008 Wrote a real-time and historical market data service from Reuters/R The real-time interface used the Reuters C++ API R extension that spawned listening thread and handled market updates New version also does publishing as well as subscribing
  • 8. Motivating Example Motivating Example The (real-world) example involves building a new high-frequency trading system Step 1 is handling market prices (in this case interbank currency prices) Need to ensure that the new system’s prices are: Correct; Fast
  • 9. Motivating Example R Analytics C++ RMDS API RMDS Message Bus
  • 10. Motivating Example Issues With This Approach As R interpreter is single threaded, cannot spawn thread for callbacks Thus, interpreter thread is locked for the duration of subscription Not a great user experience Need to find alternative mechanism
  • 11. Motivating Example Alternative Approach If we cannot run subscriber threads in-process, need to decouple Standard approach: add an extra layer and use some form of IPC For instance, we could: Subscribe in a dedicated R process (A) Push incoming data onto a socket R process (B) reads from a listening socket Sockets could also be another IPC primitive, e.g. pipes, shared mem We will use the bigmemoRy package to leverage the latter
  • 12. Motivating Example The bigmemoRy package From the description: "Use C++ to create, store, access, and manipulate massive matrices" Allows creation of large (≥ RAM) matrices These matrices can be mapped to files/shared memory It is the shared memory functionality that we will use big.matrix(nrow, ncol, type = "integer", ....) shared.big.matrix(nrow, ncol, type = "integer", ...) filebacked.big.matrix(nrow, ncol, type = "integer", ...) read.big.matrix(file, sep=, ...)
  • 13. Motivating Example Sample Usage > library(bigmemory) > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000) > X An object of class “big.matrix” Slot "address": <pointer: 0x7378a0>
  • 14. Motivating Example Create Shared Memory Descriptor > desc <- describe(X) > desc $sharedType [1] "SharedMemory" $sharedName [1] "53f14925-dca1-42a8-a547-e1bccae999ce" $nrow [1] 1000 $ncol [1] 1000 $rowNames NULL $colNames NULL $type [1] "double"
  • 15. Motivating Example Export the Descriptor In R session 1: > dput(desc, file="/tmp/matrix.desc") In R session 2: > library(bigmemory) > desc <- dget("/tmp/matrix.desc") > X <- attach.big.matrix(desc) Now R sessions A and B share the same big.matrix instance
  • 16. Motivating Example Share Data Between Sessions R session 1: > X[1,1] <- 1.2345 R session 2: > X[1,1] [1] 1.2345 Thus, streaming data can be continuously fed into session A And concurrently processed in session B
  • 17. Motivating Example RMDS Message Bus C++ RMDS API R / bigmemoRy R / bigmemoRy C++ RMDS API RMDS Message Bus
  • 18. Conclusion Summary Lack of threads not necessarily a barrier to concurrent analysis Packages like bigmemoRy, nws, etc. facilitate decoupling via IPC Could potentially take this further (using e.g. nws)
  • 19. References and Further Reading References bigmemoRy: http://cran.r-project.org/web/packages/bigmemory/ Luke Tierney’s original threading paper: http://www.cs.uiowa.edu/~luke/R/thrgui/ HPC and R Survey: http://epub.ub.uni-muenchen.de/8991/ Inside The Python GIL: www.dabeaz.com/python/GIL.pdf