SlideShare a Scribd company logo
1 of 76
Download to read offline
The Artful Business
                        of Data Mining
                            Distributed Schema-less
                           Document-Based Databases




Wednesday 27 March 13
David Coallier
                         @davidcoallier



Wednesday 27 March 13
Data Scientist
                         At Engine Yard (.com)




Wednesday 27 March 13
RDBMs

Wednesday 27 March 13
Structure
          Restrictions
          Safety
Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
What If?


Wednesday 27 March 13
id    name      age    address   phone

                        1     david      26     IE        353
                        2     divad      27     US         1
                        3       foo      42     IE        353
                        4       bar      31     CA         1
                        5     john       17     NZ        131
                        6      jack     128     DK        311
                        7        jill    21     IE        353
                        ...       ...    ...     ...       ...




Wednesday 27 March 13
Before
                   Moving on
Wednesday 27 March 13
JSON

Wednesday 27 March 13
What is JSON?


Wednesday 27 March 13
{
                            "firstName": "David",
                            "lastName": "Coallier",
                            "age": 26,
                            "address": {
                                "streetAddress": "Mansfield House",
                                "city": "Crosshaven",
                            },
                            "phoneNumbers": [
                                {
                                    "type": "mobile",
                                    "number": "0863299999"
                                }
                            ]
                        }




Wednesday 27 March 13
What is HTTP?


Wednesday 27 March 13
What is a Schema?


Wednesday 27 March 13
Alternative

Wednesday 27 March 13
Schema-less


Wednesday 27 March 13
Does
      NOT
      Mean
      Structure-less
Wednesday 27 March 13
Documents
      and
      K-V Buckets
Wednesday 27 March 13
CouchDB
                        Cluster of unreliable commodity hardware




Wednesday 27 March 13
Replication Attachments
               Generated “random” ids
               Dictionary Revisions?
               JSON Objects
               HTTP CRUD


Wednesday 27 March 13
Documents

Wednesday 27 March 13
Wednesday 27 March 13
{
                            "_id": "131dafsd1vasd",
                            "_rev": "12-fva32asdf",
                            "firstName": "David",
                            "lastName": "Coallier",
                            "age": 26,
                            "address": {
                                "streetAddress": "Mansfield House",
                                "city": "Crosshaven",
                            },
                            "phoneNumbers": [
                                {
                                    "type": "mobile",
                                    "number": "0863299999"
                                }
                            ]
                        }




Wednesday 27 March 13
How do you
      find
      Anything?
Wednesday 27 March 13
Map/Reduce

Wednesday 27 March 13
...

Wednesday 27 March 13
Riak

Wednesday 27 March 13
Dynamo
     Paper
Wednesday 27 March 13
CAP
     Theorem
Wednesday 27 March 13
Key-Value
  Buckets
Wednesday 27 March 13
Differences?

Wednesday 27 March 13
CouchDB                                      Riak
           Storage Model         append-only                                 bitcask
                   Access            HTTP                                HTTP, PB
                 Retrieval       Views(M/R)                  M/R, Indexes, Search
               Versioning    Eventual Consistency                  Vector Clocks
            Concurrency          No Locking                   Client Resolution
              Replication    master/master/slave replication, clustering
           Scaling In/Out         Big Couch                                 Built-in
             Management         Futon/Fuxton                        Riak Control
                                  http://guide.couchdb.org   http://downloads.basho.com/papers/bitcask-intro.pdf



Wednesday 27 March 13
Map/Reduce

Wednesday 27 March 13
Mapper:
Executed on document

Reducer:
Receives output from mappers


Wednesday 27 March 13
{
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
{
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
{
                  "age": "32",
                  "heads": "3",
 }

Wednesday 27 March 13
Map: find-ages

                                 {
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
Map: find-ages
                function find_ages(doc) {
                  if (typeof(doc.age) != undefined) {
                    emit(doc._id, doc.age);
                  }
                }




Wednesday 27 March 13
Map: find-ages

                                 {
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
Map: find-ages

                                 {
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




                26                   32                   42                   17
Wednesday 27 March 13
Map: find-ages

               26       32   42   17

              Reduce: sum

Wednesday 27 March 13
Reduce: sum

    function sum(values) {
      return sum(values);
    }


Wednesday 27 March 13
Map: find-ages

               26       32    42   17

              Reduce: sum
                             117
Wednesday 27 March 13
Mapper:
Executed on document

Reducer:
Receives output from mappers


Wednesday 27 March 13
So
     What?
Wednesday 27 March 13
The
     Machines
     They Lurn.
Wednesday 27 March 13
The
     Problem
Wednesday 27 March 13
Statistics
     Example
Wednesday 27 March 13
Mean,
  Std. Deviation
  Age
Wednesday 27 March 13
n
                1
             µ = ∑ xi
                n i=1
Wednesday 27 March 13
n
           1
        σ=   ∑
           n i=1
                 (xi − µ ) 2




Wednesday 27 March 13
Mapper:
Executed on document

Reducer:
Receives output from mappers


Wednesday 27 March 13
Mapper:
  Retrieve values, pre-process

Reducer:
 Receive, process further.


Wednesday 27 March 13
{
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
[
                            [ 26, 676],
                            [ 32, 1024],
                            [ 42, 1764],
                            [ 17, 289 ]
                        ]
Wednesday 27 March 13
/**
                          * Our mapper function.
                          */
                        map: function(doc) {
                           emit(null, [doc.age, doc.age * doc.age]);
                        }

                        /**
                         * Our reducer...
                         */
                        reduce: function(keys, values, rereduce) {
                          var N = 0;
                          var summed = 0;
                          var summedSquare = 0;

                            for (var i in values) {
                              N += 1;
                              summed += values[i][0];
                              summedSquare += values[i][1];
                            }

                            var mean = summed / N;
                            var standard_deviation = Math.sqrt(
                              (summedSquare / N) - (mean* mean)
                            )

                            return [mean, standard_deviation]
                        }




Wednesday 27 March 13
/**
   * Our mapper function.
   */
 map: function(doc) {
    emit(null, [doc.age, doc.age * doc.age]);
 }

 /**
  * Our reducer...
  */
 reduce: function(keys, values, rereduce) {
   var N = values.length;
   var summed = sum(values.map(function(v) { return v[0]; }));
   var summedSquares = sum(values.map(function(v) { return v[1];}));

     var mean = summed / N;
     var standard_deviation = Math.sqrt(
       (summedSquares / N) - (mean*mean)
     )

     return [mean, standard_deviation]
 }


Wednesday 27 March 13
Naive
  Bayes
Wednesday 27 March 13
Real Life
  Fraud
Wednesday 27 March 13
P(x j = k | y = fraudulent)
  P(x j = k | y = normal)
  P(y)

Wednesday 27 March 13
We need to:
  Sum x j = k , for each y
  to calculate P(x|y)



Wednesday 27 March 13
We need:
   More than 1 mapper.




Wednesday 27 March 13
We need

                          4
                        mappers
Wednesday 27 March 13
Mapper #1:
   ∑1i P(x = k | y = fraudulent)
                        j




Wednesday 27 March 13
Mapper #2:
   ∑1i P(x = k | y = normal)
                        j




Wednesday 27 March 13
Mapper #3:
   ∑1i P(y = fraudulent)

Wednesday 27 March 13
Mapper #4:
   ∑1i P(y = normal)


Wednesday 27 March 13
Reducer
         Sums up
         results for
         parameters
Wednesday 27 March 13
Cluster
  Analysis
Wednesday 27 March 13
k-means

Wednesday 27 March 13
Mapper:
 Divide vectors into subgroups,
 Calculate d(p,q) between
 vectors, find centroids,
 sum them up.

 Reducer:
 Sum up the sums,
 get new centroids.

Wednesday 27 March 13

More Related Content

Viewers also liked

Facebooks new model
Facebooks new modelFacebooks new model
Facebooks new model
finanzas_uca
 
Digital business #5
Digital business #5Digital business #5
Digital business #5
finanzas_uca
 
Crystallized042210
Crystallized042210Crystallized042210
Crystallized042210
klee4vp
 
Thesis Final120309
Thesis Final120309Thesis Final120309
Thesis Final120309
klee4vp
 
Mobile clinic breast_cancer_research_proposal_
Mobile clinic breast_cancer_research_proposal_Mobile clinic breast_cancer_research_proposal_
Mobile clinic breast_cancer_research_proposal_
klee4vp
 
Draft Framework sep 26
Draft Framework sep 26Draft Framework sep 26
Draft Framework sep 26
chefhja
 
Kitchenbathportfolio3
Kitchenbathportfolio3Kitchenbathportfolio3
Kitchenbathportfolio3
RaquelT
 
telephone data systems 99AR
telephone data systems  99ARtelephone data systems  99AR
telephone data systems 99AR
finance48
 
autozone AZO_04AR
autozone  AZO_04ARautozone  AZO_04AR
autozone AZO_04AR
finance46
 
10:30 AM ET Q4 2008 Tenneco Inc. Earnings Conference
 10:30 AM ET 	Q4 2008 Tenneco Inc. Earnings Conference 10:30 AM ET 	Q4 2008 Tenneco Inc. Earnings Conference
10:30 AM ET Q4 2008 Tenneco Inc. Earnings Conference
finance46
 

Viewers also liked (19)

Facebooks new model
Facebooks new modelFacebooks new model
Facebooks new model
 
Digital business #5
Digital business #5Digital business #5
Digital business #5
 
Об инициативе украиского правительства касательно регистрации Интернет-изданий
Об инициативе украиского правительства касательно регистрации Интернет-изданийОб инициативе украиского правительства касательно регистрации Интернет-изданий
Об инициативе украиского правительства касательно регистрации Интернет-изданий
 
Crystallized042210
Crystallized042210Crystallized042210
Crystallized042210
 
SuferinţA
SuferinţASuferinţA
SuferinţA
 
Lams101: Introducing the Learning Activity Management System
Lams101: Introducing the Learning Activity Management SystemLams101: Introducing the Learning Activity Management System
Lams101: Introducing the Learning Activity Management System
 
Thesis Final120309
Thesis Final120309Thesis Final120309
Thesis Final120309
 
Mobile clinic breast_cancer_research_proposal_
Mobile clinic breast_cancer_research_proposal_Mobile clinic breast_cancer_research_proposal_
Mobile clinic breast_cancer_research_proposal_
 
Draft Framework sep 26
Draft Framework sep 26Draft Framework sep 26
Draft Framework sep 26
 
Khrsheed khawar peshawar night Part-2
Khrsheed khawar peshawar night Part-2Khrsheed khawar peshawar night Part-2
Khrsheed khawar peshawar night Part-2
 
Code Reviews - Vortrag für Innogames
Code Reviews - Vortrag für InnogamesCode Reviews - Vortrag für Innogames
Code Reviews - Vortrag für Innogames
 
Menulis di blog dan manfaat yang menyertainya
Menulis di blog dan manfaat yang menyertainyaMenulis di blog dan manfaat yang menyertainya
Menulis di blog dan manfaat yang menyertainya
 
Kitchenbathportfolio3
Kitchenbathportfolio3Kitchenbathportfolio3
Kitchenbathportfolio3
 
Thats Cool
Thats CoolThats Cool
Thats Cool
 
telephone data systems 99AR
telephone data systems  99ARtelephone data systems  99AR
telephone data systems 99AR
 
Presentation2
Presentation2Presentation2
Presentation2
 
autozone AZO_04AR
autozone  AZO_04ARautozone  AZO_04AR
autozone AZO_04AR
 
10:30 AM ET Q4 2008 Tenneco Inc. Earnings Conference
 10:30 AM ET 	Q4 2008 Tenneco Inc. Earnings Conference 10:30 AM ET 	Q4 2008 Tenneco Inc. Earnings Conference
10:30 AM ET Q4 2008 Tenneco Inc. Earnings Conference
 
Alescon Heeft Passie Voor Horeca 20120305
Alescon Heeft Passie Voor Horeca 20120305Alescon Heeft Passie Voor Horeca 20120305
Alescon Heeft Passie Voor Horeca 20120305
 

More from David Coallier

Orchestra at EngineYard
Orchestra at EngineYardOrchestra at EngineYard
Orchestra at EngineYard
David Coallier
 
Open Source for the greater good
Open Source for the greater goodOpen Source for the greater good
Open Source for the greater good
David Coallier
 
RESTful APIs and FRAPI, a matter of minutes
RESTful APIs and FRAPI, a matter of minutesRESTful APIs and FRAPI, a matter of minutes
RESTful APIs and FRAPI, a matter of minutes
David Coallier
 

More from David Coallier (18)

Data Science at Scale @ barricade.io
Data Science at Scale @ barricade.ioData Science at Scale @ barricade.io
Data Science at Scale @ barricade.io
 
Data Science, what even?!
Data Science, what even?!Data Science, what even?!
Data Science, what even?!
 
Data Science, what even...
Data Science, what even...Data Science, what even...
Data Science, what even...
 
PRISM seed-stage Investor Deck
PRISM seed-stage Investor DeckPRISM seed-stage Investor Deck
PRISM seed-stage Investor Deck
 
The Artful Business of Data Mining: Computational Statistics with Open Source...
The Artful Business of Data Mining: Computational Statistics with Open Source...The Artful Business of Data Mining: Computational Statistics with Open Source...
The Artful Business of Data Mining: Computational Statistics with Open Source...
 
Taking PHP to the next level
Taking PHP to the next levelTaking PHP to the next level
Taking PHP to the next level
 
Mobile Cloud Architectures
Mobile Cloud ArchitecturesMobile Cloud Architectures
Mobile Cloud Architectures
 
Taking PHP To the next level
Taking PHP To the next levelTaking PHP To the next level
Taking PHP To the next level
 
Orchestra at EngineYard
Orchestra at EngineYardOrchestra at EngineYard
Orchestra at EngineYard
 
The Orchestra Platform
The Orchestra PlatformThe Orchestra Platform
The Orchestra Platform
 
Breaking Technologies
Breaking TechnologiesBreaking Technologies
Breaking Technologies
 
Building APIs with FRAPI
Building APIs with FRAPIBuilding APIs with FRAPI
Building APIs with FRAPI
 
RESTful APIs and FRAPI
RESTful APIs and FRAPIRESTful APIs and FRAPI
RESTful APIs and FRAPI
 
Open Source for the greater good
Open Source for the greater goodOpen Source for the greater good
Open Source for the greater good
 
PHP 5.3, a walkthrough
PHP 5.3, a walkthroughPHP 5.3, a walkthrough
PHP 5.3, a walkthrough
 
RESTful APIs and FRAPI, a matter of minutes
RESTful APIs and FRAPI, a matter of minutesRESTful APIs and FRAPI, a matter of minutes
RESTful APIs and FRAPI, a matter of minutes
 
An introduction to CouchDB
An introduction to CouchDBAn introduction to CouchDB
An introduction to CouchDB
 
Get ready for web3.0! Open up your app!
Get ready for web3.0! Open up your app!Get ready for web3.0! Open up your app!
Get ready for web3.0! Open up your app!
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

  • 1. The Artful Business of Data Mining Distributed Schema-less Document-Based Databases Wednesday 27 March 13
  • 2. David Coallier @davidcoallier Wednesday 27 March 13
  • 3. Data Scientist At Engine Yard (.com) Wednesday 27 March 13
  • 5. Structure Restrictions Safety Wednesday 27 March 13
  • 6. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 7. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 8. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 9. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 10. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 12. id name age address phone 1 david 26 IE 353 2 divad 27 US 1 3 foo 42 IE 353 4 bar 31 CA 1 5 john 17 NZ 131 6 jack 128 DK 311 7 jill 21 IE 353 ... ... ... ... ... Wednesday 27 March 13
  • 13. Before Moving on Wednesday 27 March 13
  • 15. What is JSON? Wednesday 27 March 13
  • 16. { "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 27 March 13
  • 17. What is HTTP? Wednesday 27 March 13
  • 18. What is a Schema? Wednesday 27 March 13
  • 21. Does NOT Mean Structure-less Wednesday 27 March 13
  • 22. Documents and K-V Buckets Wednesday 27 March 13
  • 23. CouchDB Cluster of unreliable commodity hardware Wednesday 27 March 13
  • 24. Replication Attachments Generated “random” ids Dictionary Revisions? JSON Objects HTTP CRUD Wednesday 27 March 13
  • 27. { "_id": "131dafsd1vasd", "_rev": "12-fva32asdf", "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 27 March 13
  • 28. How do you find Anything? Wednesday 27 March 13
  • 32. Dynamo Paper Wednesday 27 March 13
  • 33. CAP Theorem Wednesday 27 March 13
  • 36. CouchDB Riak Storage Model append-only bitcask Access HTTP HTTP, PB Retrieval Views(M/R) M/R, Indexes, Search Versioning Eventual Consistency Vector Clocks Concurrency No Locking Client Resolution Replication master/master/slave replication, clustering Scaling In/Out Big Couch Built-in Management Futon/Fuxton Riak Control http://guide.couchdb.org http://downloads.basho.com/papers/bitcask-intro.pdf Wednesday 27 March 13
  • 38. Mapper: Executed on document Reducer: Receives output from mappers Wednesday 27 March 13
  • 39. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 40. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 41. { "age": "32", "heads": "3", } Wednesday 27 March 13
  • 42. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 43. Map: find-ages function find_ages(doc) { if (typeof(doc.age) != undefined) { emit(doc._id, doc.age); } } Wednesday 27 March 13
  • 44. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 45. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } 26 32 42 17 Wednesday 27 March 13
  • 46. Map: find-ages 26 32 42 17 Reduce: sum Wednesday 27 March 13
  • 47. Reduce: sum function sum(values) { return sum(values); } Wednesday 27 March 13
  • 48. Map: find-ages 26 32 42 17 Reduce: sum 117 Wednesday 27 March 13
  • 49. Mapper: Executed on document Reducer: Receives output from mappers Wednesday 27 March 13
  • 50. So What? Wednesday 27 March 13
  • 51. The Machines They Lurn. Wednesday 27 March 13
  • 52. The Problem Wednesday 27 March 13
  • 53. Statistics Example Wednesday 27 March 13
  • 54. Mean, Std. Deviation Age Wednesday 27 March 13
  • 55. n 1 µ = ∑ xi n i=1 Wednesday 27 March 13
  • 56. n 1 σ= ∑ n i=1 (xi − µ ) 2 Wednesday 27 March 13
  • 57. Mapper: Executed on document Reducer: Receives output from mappers Wednesday 27 March 13
  • 58. Mapper: Retrieve values, pre-process Reducer: Receive, process further. Wednesday 27 March 13
  • 59. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 60. [ [ 26, 676], [ 32, 1024], [ 42, 1764], [ 17, 289 ] ] Wednesday 27 March 13
  • 61. /** * Our mapper function. */ map: function(doc) { emit(null, [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = 0; var summed = 0; var summedSquare = 0; for (var i in values) { N += 1; summed += values[i][0]; summedSquare += values[i][1]; } var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquare / N) - (mean* mean) ) return [mean, standard_deviation] } Wednesday 27 March 13
  • 62. /** * Our mapper function. */ map: function(doc) { emit(null, [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = values.length; var summed = sum(values.map(function(v) { return v[0]; })); var summedSquares = sum(values.map(function(v) { return v[1];})); var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquares / N) - (mean*mean) ) return [mean, standard_deviation] } Wednesday 27 March 13
  • 63. Naive Bayes Wednesday 27 March 13
  • 64. Real Life Fraud Wednesday 27 March 13
  • 65. P(x j = k | y = fraudulent) P(x j = k | y = normal) P(y) Wednesday 27 March 13
  • 66. We need to: Sum x j = k , for each y to calculate P(x|y) Wednesday 27 March 13
  • 67. We need: More than 1 mapper. Wednesday 27 March 13
  • 68. We need 4 mappers Wednesday 27 March 13
  • 69. Mapper #1: ∑1i P(x = k | y = fraudulent) j Wednesday 27 March 13
  • 70. Mapper #2: ∑1i P(x = k | y = normal) j Wednesday 27 March 13
  • 71. Mapper #3: ∑1i P(y = fraudulent) Wednesday 27 March 13
  • 72. Mapper #4: ∑1i P(y = normal) Wednesday 27 March 13
  • 73. Reducer Sums up results for parameters Wednesday 27 March 13
  • 76. Mapper: Divide vectors into subgroups, Calculate d(p,q) between vectors, find centroids, sum them up. Reducer: Sum up the sums, get new centroids. Wednesday 27 March 13