SlideShare a Scribd company logo
1 of 30
Ruby on Big Data
                 Brian O’Neill
Lead Architect, Health Market Science (HMS)



      The views expressed herein are those of my own and do not necessarily reflect the views of HMS or other
                                              organizations mentioned.
Agenda
Big Data Orientation
  Cassandra
  Hadoop
  SOLR
  Storm

DEMO
Java/Ruby Interoperability
Advanced Ideas
  Rails Integration
  Combing Real-time w/ Batch Processing (The Final Frontier)
“Big” Data
Size doesn’t always matter, it may be
what your doing with it
 e.g. Natural-Language Processing

Flexibility was our major motivator
 Data sources with disparate schema
Decomposing the
       Problem
Data         Processing
 Storage     Distributed

 Indexing    Batch

 Querying    Real-time
Relational Storage
ACID
 Atomic: Everything in a transaction succeeds or the
 entire transaction is rolled back.
 Consistent: A transaction cannot leave the
 database in an inconsistent state.
 Isolated: Transactions cannot interfere with each
 other.
 Durable: Completed transactions persist, even when
 ser vers restart etc.
Relational Storage
Benefits          Limitations
 Data Integrity   Static Schemas

 Ubiquity         Scalability
NoSQL Storage
BASE
 Basic Availability
 Soft-state
 Eventual consistency

Simple API
 REST + JSON
Indexing
Real-time Answers
Full-text queries
 Fuzzy Searching

Nickname analysis
Geospatial and Temporal Search
Storage Options
Indexing Options
Why?
Cassandra
 Consistency-level per operation
 Temporal dimension of an operation
 Idempotent mentality

SOLR
 Community
 Integration (Solandra)
   NOT scalability and flexibility (sharding stinks)
Cassandra’s Data Model
   Keyspaces

     Column Families
                 Rows
               (Sorted by KEY!)


                       Columns
                         (Name : Value)
Example
BeerGuys (Keyspace)
  Users (Column Families)
     bonedog (Row)
        firstName : Brian
        lastName : O’Neill
     lisa (Row)
        firstName : Lisa
        lastName : O’Neill
        maidenName : Kelley
Cassandra Architecture
 Ring Architecture         A
                          (N-Z)
  Hash(key) -> Node

  Reliability
                                   F
                                  (A-F)
  Scalability



                 Client
                          M
                          (G-M)
Why NoSQL for us?
Flexibility
A new data processing paradigm
  Instead of:
                Data          Processing
  Do this:


 Processing            Data
Batch Processing
                      DATA

                JOB           A
Distributable                 (T-A)


Scalable
Data Locality
                       S      HDFS     H
                      (I-R)           (B-G)
Map / Reduce
tuple = (key, value)
map(x) -> tuple[]
reduce(key, value[]) -> tuple[]
Word Count
The Code                                   The Run
def map(doc)                                   doc1 = “boy meets girl”
 doc.each do |word|                            doc2 = ”girl likes boy”)
      emit(word, 1)
                                               map (doc1) -> (boy, 1), (meets, 1), (girl, 1)
  end
                                               map (doc2) -> (girl, 1), (likes, 1), (boy, 1)
end
                                               reduce (boy, [1, 1]) -> (boy, 2)

def reduce(key, values[])                      reduce (girl, [1, 1]) -> (girl, 2)

  sum = values.inject {|sum,x| sum + x }       reduce (likes [1]) -> (likes, 1)
  emit(key, sum)                               reduce (meets, [1]) -> (meets, 1)
end
Queries / Flows


      Hive
Pig          Cascading
Real-time Processing
Deals with data streams
                                    Storm
          tuple   Bolt   tuple

  Spout                          Bolt
          tuple          tuple



          tuple
                  Bolt
  Spout                          Bolt
          tuple          tuple

                  Bolt
Putting it Together
          A
          (T-A)




  S      Storm     H
 (I-R)            (B-G)
But...
We love Ruby!
 and it’s all in Java. :(


That’s okay,
  because
We love REST!
REST Layer
         CRUD via HTTP
         Map/Reduce via HTTP
                                A

Client



                         S             H
                               Storm
DEMO
Java Interoperability
Conventional Interoperability
 I/O Streams bet ween processes



Hadoop Streaming
Storm Multilang
CRUD via HTTP
http://virgil/data/{keyspace}/{columnFamily}/{column}/{row}
                    PUT : Replaces Content of Row/Column
                    GET : Retrieves Value of a Row/Column
                    DELETE : Removes Value of a Row/Column


                                                    A




             curl


                                             S               H
Map/Reduce over HTTP
       wordcount.rb
def map(rowKey, columns)
    result = []
    columns.each do |column_name, value|
        words = value.split                              A
        words.each do |word|
            result << [word, "1"]
        end
    end                                    curl
    return result
end

def reduce(key, values)
    rows = {}
    total = 0
                                                     S            H
    columns = {}
    values.each do |value|
        total += value.to_i
    end
    columns["count"] = total.to_s
    rows[key] = columns
    return rows
end

                                             CF in           CF out
Better?
                             Use JRuby
                                 Single Process
                                 Parse Once / Eval Many

JSR 223
    ScriptEngine ENGINE = new ScriptEngineManager().getEngineByName("jruby");
    ScriptContext context = new SimpleScriptContext();
    Bindings bindings = context.getBindings(ScriptContext.ENGINE_SCOPE);
    bindings.put("variable", "value");
    ENGINE.eval(script, context);



Redbridge
    this.rubyContainer = new ScriptingContainer(LocalContextScope.CONCURRENT);
    this.rubyReceiver = rubyContainer.runScriptlet(script);
    container.callMethod(rubyReceiver, "foo", "value");
Rails Integration
                                        A




                   Balancer
                     Load




                                   ta
                                   Da




                                                 g
                               S                     H




                                             sin
                                            es
                                        oc
                                        Pr
“REST is the new JDBC”
ActiveRecord backed by REST?
Anything more than a proxy?
Ratch Processing
  (Combing Real-time and Batch)


Data Flows as:
 Cascading Map/Reduce jobs
 Storm Topologies?

Can’t we have one framework to rule
them all?

More Related Content

What's hot

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaFernando Rodriguez
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.Sergey Zelvenskiy
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureDr. Christian Betz
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
Tokyo Cabinet & Tokyo Tyrant
Tokyo Cabinet & Tokyo TyrantTokyo Cabinet & Tokyo Tyrant
Tokyo Cabinet & Tokyo Tyrant輝 子安
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programmingKuldeep Dhole
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? DataWorks Summit
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo ProductsMikio Hirabayashi
 

What's hot (20)

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Scalding
ScaldingScalding
Scalding
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Scala+data
Scala+dataScala+data
Scala+data
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Kyotoproducts
KyotoproductsKyotoproducts
Kyotoproducts
 
Tokyo Cabinet & Tokyo Tyrant
Tokyo Cabinet & Tokyo TyrantTokyo Cabinet & Tokyo Tyrant
Tokyo Cabinet & Tokyo Tyrant
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo Products
 

Viewers also liked

Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
Common Practices in Religion
Common Practices in ReligionCommon Practices in Religion
Common Practices in ReligionStacey Troup
 
Certificate in hardware networking
Certificate in  hardware networkingCertificate in  hardware networking
Certificate in hardware networkingignounict
 
Motor vehicle sale agreement
Motor vehicle sale agreementMotor vehicle sale agreement
Motor vehicle sale agreementManesa George
 
Significance of CAP accreditation
Significance of CAP accreditationSignificance of CAP accreditation
Significance of CAP accreditationBilal Al-kadri
 
ISO 10993-7 Biological Evaluation of Medical Devices - Ethylene Oxide Sterili...
ISO 10993-7 Biological Evaluation of Medical Devices - Ethylene Oxide Sterili...ISO 10993-7 Biological Evaluation of Medical Devices - Ethylene Oxide Sterili...
ISO 10993-7 Biological Evaluation of Medical Devices - Ethylene Oxide Sterili...NAMSA
 
The Dojo Toolkit An Introduction
The Dojo Toolkit   An IntroductionThe Dojo Toolkit   An Introduction
The Dojo Toolkit An IntroductionJeff Fox
 
Types of Insurance Policies Owned by Singaporeans
Types of Insurance Policies Owned by SingaporeansTypes of Insurance Policies Owned by Singaporeans
Types of Insurance Policies Owned by SingaporeansChew Zhan Lun
 
Performance Management System in Bank Assignment Sample
Performance Management System in Bank Assignment SamplePerformance Management System in Bank Assignment Sample
Performance Management System in Bank Assignment SampleGlobal Assignment Help
 
overview of banking sector & growth and structure
overview of banking sector & growth  and structureoverview of banking sector & growth  and structure
overview of banking sector & growth and structureAnil Beniwal
 
Accounting Standard-3 Cash Flow Statement by Nithin Raj
Accounting Standard-3 Cash Flow Statement by Nithin RajAccounting Standard-3 Cash Flow Statement by Nithin Raj
Accounting Standard-3 Cash Flow Statement by Nithin RajChinnu Raj
 
해킹 대회 리뷰 및 실전 해킹
해킹 대회 리뷰 및 실전 해킹해킹 대회 리뷰 및 실전 해킹
해킹 대회 리뷰 및 실전 해킹totodeung
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turkbuildacloud
 
Second Grade Science: Plants
Second Grade Science: PlantsSecond Grade Science: Plants
Second Grade Science: PlantsKathyFiol
 
Effective Automation 〜変化に強い開発基盤〜
Effective Automation 〜変化に強い開発基盤〜Effective Automation 〜変化に強い開発基盤〜
Effective Automation 〜変化に強い開発基盤〜Jumpei Miyata
 
Digital Marketing Strategic Framework
Digital Marketing Strategic FrameworkDigital Marketing Strategic Framework
Digital Marketing Strategic FrameworkJanet Jaiswal
 

Viewers also liked (20)

Big Data и Ruby
Big Data и RubyBig Data и Ruby
Big Data и Ruby
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Common Practices in Religion
Common Practices in ReligionCommon Practices in Religion
Common Practices in Religion
 
Certificate in hardware networking
Certificate in  hardware networkingCertificate in  hardware networking
Certificate in hardware networking
 
Motor vehicle sale agreement
Motor vehicle sale agreementMotor vehicle sale agreement
Motor vehicle sale agreement
 
Effective intercultural communication
Effective intercultural communicationEffective intercultural communication
Effective intercultural communication
 
Significance of CAP accreditation
Significance of CAP accreditationSignificance of CAP accreditation
Significance of CAP accreditation
 
ISO 10993-7 Biological Evaluation of Medical Devices - Ethylene Oxide Sterili...
ISO 10993-7 Biological Evaluation of Medical Devices - Ethylene Oxide Sterili...ISO 10993-7 Biological Evaluation of Medical Devices - Ethylene Oxide Sterili...
ISO 10993-7 Biological Evaluation of Medical Devices - Ethylene Oxide Sterili...
 
The Dojo Toolkit An Introduction
The Dojo Toolkit   An IntroductionThe Dojo Toolkit   An Introduction
The Dojo Toolkit An Introduction
 
Types of Insurance Policies Owned by Singaporeans
Types of Insurance Policies Owned by SingaporeansTypes of Insurance Policies Owned by Singaporeans
Types of Insurance Policies Owned by Singaporeans
 
Endowment Policy
Endowment PolicyEndowment Policy
Endowment Policy
 
Performance Management System in Bank Assignment Sample
Performance Management System in Bank Assignment SamplePerformance Management System in Bank Assignment Sample
Performance Management System in Bank Assignment Sample
 
overview of banking sector & growth and structure
overview of banking sector & growth  and structureoverview of banking sector & growth  and structure
overview of banking sector & growth and structure
 
Accounting Standard-3 Cash Flow Statement by Nithin Raj
Accounting Standard-3 Cash Flow Statement by Nithin RajAccounting Standard-3 Cash Flow Statement by Nithin Raj
Accounting Standard-3 Cash Flow Statement by Nithin Raj
 
해킹 대회 리뷰 및 실전 해킹
해킹 대회 리뷰 및 실전 해킹해킹 대회 리뷰 및 실전 해킹
해킹 대회 리뷰 및 실전 해킹
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
 
Second Grade Science: Plants
Second Grade Science: PlantsSecond Grade Science: Plants
Second Grade Science: Plants
 
Effective Automation 〜変化に強い開発基盤〜
Effective Automation 〜変化に強い開発基盤〜Effective Automation 〜変化に強い開発基盤〜
Effective Automation 〜変化に強い開発基盤〜
 
Digital Marketing Strategic Framework
Digital Marketing Strategic FrameworkDigital Marketing Strategic Framework
Digital Marketing Strategic Framework
 
SCgame2
SCgame2SCgame2
SCgame2
 

Similar to Ruby on Big Data (Cassandra + Hadoop)

Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for MobileBugSense
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browsergethue
 
3rd Athens Big Data Meetup - 2nd Talk - Neo4j: The World's Leading Graph DB
3rd Athens Big Data Meetup - 2nd Talk - Neo4j: The World's Leading Graph DB3rd Athens Big Data Meetup - 2nd Talk - Neo4j: The World's Leading Graph DB
3rd Athens Big Data Meetup - 2nd Talk - Neo4j: The World's Leading Graph DBAthens Big Data
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra ExplainedEric Evans
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big DataLeonardo Gamas
 
Implementing a many-to-many Relationship with Slick
Implementing a many-to-many Relationship with SlickImplementing a many-to-many Relationship with Slick
Implementing a many-to-many Relationship with SlickHermann Hueck
 

Similar to Ruby on Big Data (Cassandra + Hadoop) (20)

Hadoop
HadoopHadoop
Hadoop
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Collections forceawakens
Collections forceawakensCollections forceawakens
Collections forceawakens
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
3rd Athens Big Data Meetup - 2nd Talk - Neo4j: The World's Leading Graph DB
3rd Athens Big Data Meetup - 2nd Talk - Neo4j: The World's Leading Graph DB3rd Athens Big Data Meetup - 2nd Talk - Neo4j: The World's Leading Graph DB
3rd Athens Big Data Meetup - 2nd Talk - Neo4j: The World's Leading Graph DB
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 
Implementing a many-to-many Relationship with Slick
Implementing a many-to-many Relationship with SlickImplementing a many-to-many Relationship with Slick
Implementing a many-to-many Relationship with Slick
 

More from Brian O'Neill

Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardBrian O'Neill
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
The Art of Platform Development
The Art of Platform DevelopmentThe Art of Platform Development
The Art of Platform DevelopmentBrian O'Neill
 
Collaborative software development
Collaborative software developmentCollaborative software development
Collaborative software developmentBrian O'Neill
 

More from Brian O'Neill (8)

Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Big data philly_jug
Big data philly_jugBig data philly_jug
Big data philly_jug
 
The Art of Platform Development
The Art of Platform DevelopmentThe Art of Platform Development
The Art of Platform Development
 
Hms nyc* talk
Hms nyc* talkHms nyc* talk
Hms nyc* talk
 
Collaborative software development
Collaborative software developmentCollaborative software development
Collaborative software development
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Ruby on Big Data (Cassandra + Hadoop)

  • 1. Ruby on Big Data Brian O’Neill Lead Architect, Health Market Science (HMS) The views expressed herein are those of my own and do not necessarily reflect the views of HMS or other organizations mentioned.
  • 2. Agenda Big Data Orientation Cassandra Hadoop SOLR Storm DEMO Java/Ruby Interoperability Advanced Ideas Rails Integration Combing Real-time w/ Batch Processing (The Final Frontier)
  • 3. “Big” Data Size doesn’t always matter, it may be what your doing with it e.g. Natural-Language Processing Flexibility was our major motivator Data sources with disparate schema
  • 4. Decomposing the Problem Data Processing Storage Distributed Indexing Batch Querying Real-time
  • 5. Relational Storage ACID Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: A transaction cannot leave the database in an inconsistent state. Isolated: Transactions cannot interfere with each other. Durable: Completed transactions persist, even when ser vers restart etc.
  • 6. Relational Storage Benefits Limitations Data Integrity Static Schemas Ubiquity Scalability
  • 7. NoSQL Storage BASE Basic Availability Soft-state Eventual consistency Simple API REST + JSON
  • 8. Indexing Real-time Answers Full-text queries Fuzzy Searching Nickname analysis Geospatial and Temporal Search
  • 11. Why? Cassandra Consistency-level per operation Temporal dimension of an operation Idempotent mentality SOLR Community Integration (Solandra) NOT scalability and flexibility (sharding stinks)
  • 12. Cassandra’s Data Model Keyspaces Column Families Rows (Sorted by KEY!) Columns (Name : Value)
  • 13. Example BeerGuys (Keyspace) Users (Column Families) bonedog (Row) firstName : Brian lastName : O’Neill lisa (Row) firstName : Lisa lastName : O’Neill maidenName : Kelley
  • 14. Cassandra Architecture Ring Architecture A (N-Z) Hash(key) -> Node Reliability F (A-F) Scalability Client M (G-M)
  • 15. Why NoSQL for us? Flexibility A new data processing paradigm Instead of: Data Processing Do this: Processing Data
  • 16. Batch Processing DATA JOB A Distributable (T-A) Scalable Data Locality S HDFS H (I-R) (B-G)
  • 17. Map / Reduce tuple = (key, value) map(x) -> tuple[] reduce(key, value[]) -> tuple[]
  • 18. Word Count The Code The Run def map(doc) doc1 = “boy meets girl” doc.each do |word| doc2 = ”girl likes boy”) emit(word, 1) map (doc1) -> (boy, 1), (meets, 1), (girl, 1) end map (doc2) -> (girl, 1), (likes, 1), (boy, 1) end reduce (boy, [1, 1]) -> (boy, 2) def reduce(key, values[]) reduce (girl, [1, 1]) -> (girl, 2) sum = values.inject {|sum,x| sum + x } reduce (likes [1]) -> (likes, 1) emit(key, sum) reduce (meets, [1]) -> (meets, 1) end
  • 19. Queries / Flows Hive Pig Cascading
  • 20. Real-time Processing Deals with data streams Storm tuple Bolt tuple Spout Bolt tuple tuple tuple Bolt Spout Bolt tuple tuple Bolt
  • 21. Putting it Together A (T-A) S Storm H (I-R) (B-G)
  • 22. But... We love Ruby! and it’s all in Java. :( That’s okay, because We love REST!
  • 23. REST Layer CRUD via HTTP Map/Reduce via HTTP A Client S H Storm
  • 24. DEMO
  • 25. Java Interoperability Conventional Interoperability I/O Streams bet ween processes Hadoop Streaming Storm Multilang
  • 26. CRUD via HTTP http://virgil/data/{keyspace}/{columnFamily}/{column}/{row} PUT : Replaces Content of Row/Column GET : Retrieves Value of a Row/Column DELETE : Removes Value of a Row/Column A curl S H
  • 27. Map/Reduce over HTTP wordcount.rb def map(rowKey, columns) result = [] columns.each do |column_name, value| words = value.split A words.each do |word| result << [word, "1"] end end curl return result end def reduce(key, values) rows = {} total = 0 S H columns = {} values.each do |value| total += value.to_i end columns["count"] = total.to_s rows[key] = columns return rows end CF in CF out
  • 28. Better? Use JRuby Single Process Parse Once / Eval Many JSR 223 ScriptEngine ENGINE = new ScriptEngineManager().getEngineByName("jruby"); ScriptContext context = new SimpleScriptContext(); Bindings bindings = context.getBindings(ScriptContext.ENGINE_SCOPE); bindings.put("variable", "value"); ENGINE.eval(script, context); Redbridge this.rubyContainer = new ScriptingContainer(LocalContextScope.CONCURRENT); this.rubyReceiver = rubyContainer.runScriptlet(script); container.callMethod(rubyReceiver, "foo", "value");
  • 29. Rails Integration A Balancer Load ta Da g S H sin es oc Pr “REST is the new JDBC” ActiveRecord backed by REST? Anything more than a proxy?
  • 30. Ratch Processing (Combing Real-time and Batch) Data Flows as: Cascading Map/Reduce jobs Storm Topologies? Can’t we have one framework to rule them all?

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n