SlideShare a Scribd company logo
1 of 37
Seminar@ochanomizu university , October 14, 2010(Thu.)




                 Yasuo Tabei(JST Minato ERATO Project)
                          Joint work with Takeaki Uno(NII),
            Masashi Sugiyama (TITECH), Koji Tsuda (AIST)
  Motivation
 - Large-scale data
 - Needs for all pairs similarity search method
 - Single sorting sethod, and its drawbacks
 Method
 - Multiple sorting method
 - Locality sensitive hashing
 - SketchSort
 Experiments
 - Comparison of other state-of-the art methods
 - Use large-scale image datasets
 Image                        Chemical Compounds
- 80 million tiny images     - NCBI PubChem
 (Torralba et al., (2008))   - 28 million chemical
- size: 32×32 pixes            compounds




 Genome Sequences
- NCBI Sequence Read Archive
- A large-scale genome sequences
  from various organisms
Image
            sift   Vector
                   x=(0.3, -0.3, 0.5, 1.2, …)

Chemical Compound
          fingerprint
                   x=(1, 0, 1, 0, 0, 0, 1, …)


Text, Protein, DNA/RNA etc
 Mapping vector to binary string (sketch)
- Conserve the distance in the original space


    x=(0.3, 0.1, 0.5, 0.6, 0.7, 1.2, -0.2,…)
                              Mapping
    s=1010010001110001010…
 Advantages
- Can keep giga-scale data in main memory
- Accelerate various algorithms
 Finding all neighbor pairs from vector data
- Given a set of data-points
- Find all pairs within a distance ,
             xi , x j   Δ( xi , x j  ε
                      s.t.          )


                     ε
 Can build a neighborhood graph
- Vertex: a data-point
- Edge : a neighbor pair
Applications: semi-supervised learning, spectral
  clustering, ROI detection in images, retrieval of
  protein sequences, etc


          ε
   Finding neighbor pairs by sorting
   Map vector data to skeches
     (a)Input         (b) Sort          (c) Scan neighbors
      1:101111         7:000000          7:000000
      2:110101         4:010000          4:010000
      3:110010         8:010110          8:010110 w
      4:010000        10:100100         10:100100
      5:101000         5:101000          5:101000
      6:111100         1:101111          1:101111
      7:000000         3:110010          3:110010 w
      8:010110         2:110101          2:110101
      9:110110         9:110110          9:110110
     10:100100         6:111100          6:111100
   Need a large number of distance calcuration for
    achieving reasonable accuracy
   Can not derive an analytical estimation of the
    fraction of missing neighbors
  Motivation
 - Large-scale data
 - Needs for all pairs similarity search method
 - Single sorting sethod, and its drawbacks
 Method
 - Multiple sorting method
 - Locality sensitive hashing
 - SketchSort
 Experiments
 - Comparison of other state-of-the art methods
 - Use large-scale image datasets
   Input: set of fixed-length strings S={s1,…,sn }
   Output: all pairs of strings within a Hamming
    distance d
   By appling radixsort, enumerate all pairs in O(n+m)
 - n: number of strings, m: output pairs
 Introduce block-wise masking technique for acceleration
   Sort strings by radixsort, divide strings into equivalence
    classes O(n)
   Draw edges within all strings in an equivalece class O(m)
   Computational Complexity: O(n+m)

            EMILY              ALICE
            DAVID              ALICE
            CHRIS              BOBBY
            ALICE     Sort     CHRIS
                                          Equivalence
            DAVID              DAVID      Classes
            BOBBY              DAVID
            DAVID              DAVID
            ALICE              EMILY
   Mask d characters in all possible ways
                        l 
   Performe radixsort  d  times
                         
                         
   Linear time to the number of strings
   Time exponential to d, polynomial to the length of
    strings l
   Ex)d=2 7:0000 0001 0011 1110   7:0000 0001 0011 1110
            4:0100   0001   1101   1100    4:0100   0001   1101   1100
            8:0101   1001   0111   1000    8:0101   1001   0111   1000
           10:1001   0011   1001   0111    5:1010   0010   1110   1010
            5:1010   0010   1110   1010    3:1100   1000   1101   1100
            1:1011   1111   0011   1110    6:1111   0011   1001   0111
            2:1101   0111   0111   0001   10:1001   0011   1001   0111
            3:1100   1000   1101   1100    2:1101   0111   0111   0001
            9:1101   1000   1101   1110    9:1101   1000   1101   1110
            6:1111   0011   1001   0111    1:1011   1111   0011   1110
   Mask d blocks in all possible ways
   Can reduce the number of sorting operations
   Non-neighbor might be detected
   Filter out by calcurating actual Hamming distance
   Ex)d=2
             7:0000   0001 0011 1110    7:0000   0001 0011 1110 7:0000 0000 0000 1110
             4:0100   0001 1101 1100    4:0100   0001 1101 1100 4:0100 0000 0000 1100
             8:0101   1001 0111 1000    8:0101   1001 0111 1000 8:0101 0000 0000 1000
            10:1001   0011 1001 0111   10:1001   0011 1001 0111 10:1001 0000 0000 0111
             5:1010   0010 1110 1010    5:1010   0000 1110 0000 5:1010 0000 0000 1010
             1:1011   1111 0011 1110    1:1011   0000 0011 0000 1:1011 0000 0000 1110
             3:1100   1000 0000 0000    3:1100   0000 1101 0000 3:1100 0000 0000 1100
             2:1101   0111 0000 0000    2:1101   0000 0111 0000 2:1101 0000 0000 0001
             9:1101   1000 0000 0000    9:1101   0000 1101 0000 9:1101 0000 0000 1110
             6:1111   0011 0000 0000    6:1111   0000 1001 0000 6:1111 0000 0000 0111
             7:0000   0001 0011 0000    4:0000   0001 0000 1100 1:0000 0000 0011 1110
             4:0000   0001 1101 0000    7:0000   0001 0000 1110 7:0000 0000 0011 1110
             5:0000   0010 1110 0000    5:0000   0010 0000 1010 2:0000 0000 0111 0001
             6:0000   0011 1001 0000    6:0000   0011 0000 0111 8:0000 0000 0111 1000
            10:0000   0011 1001 0000   10:0000   0011 0000 0111 4:0000 0000 0111 1000
             2:0000   0111 0111 0000    2:0000   0111 0000 0001 6:0000 0000 1001 0111
             3:0000   1000 1101 0000    3:0000   1000 0000 1100 10:0000 0000 1001 0111
             9:0000   1000 1101 0000    9:0000   1000 0000 1110 3:0000 0000 1101 1100
             8:0000   1001 0111 0000    8:0000   1001 0000 1000 9:0000 0000 1101 1110
             1:0000   1111 0011 0000    1:0000   1111 0000 1110 5:0000 0000 1110 1010
Step1. Perform radixsort in a block, and detect
  equivalence classes
 Step2. For each equivalence class, perform radixsort
        the next block

                                                 Recursive
                                                         7:0000 0001 0011 0000
 7:0000   0001 0011 0000    7:0000   0001 0011 0000      4:0000 0001 1101 0000
 4:0000   0001 1101 0000    4:0000   0001 1101 0000
 5:0000   0010 1110 0000    5:0000   0010 1110 0000
 6:0000   0011 1001 0000    6:0000   0011 1001 0000      6:0000 0011 1001 0000
10:0000   0011 1001 0000   10:0000   0011 1001 0000     10:0000 0011 1001 0000
 2:0000   0111 0111 0000    2:0000   0111 0111 0000
 3:0000   1000 1101 0000    3:0000   1000 1101 0000
 9:0000   1000 1101 0000    9:0000   1000 1101 0000
 8:0000   1001 0111 0000    8:0000   1001 0111 0000      3:0000 1000 1101 0000
 1:0000   1111 0011 0000    1:0000   1111 0011 0000      9:0000 1000 1101 0000
   All neighbor pairs can be
                                                     7:0000   0001 0011 1110
     enumerated                                      4:0100   0001 1101 1100      2:1101 0111 0000 0000
                                                     8:0101   1001 0111 1000      9:1101 1000 0000 0000
                                                    10:1001   0011 1001 0111
                                                     5:1010   0010 1110 1010      2:1101 0111 0011 0000
                                                     1:1011   1111 0011 1110      9:1101 1000 0000 0000
                                                     3:1100   1000 0000 0000
                                                     2:1101   0111 0000 0000      2:1101 0111 0000 0001
                                                     9:1101   1000 0000 0000      9:1101 1000 0000 1110
                                                     6:1111   0011 0000 0000

                            7:0000 0001 0011 0000
                            4:0000 0001 1101 0000
                            6:0000 0011 1001 0000
                           10:0000 0011 1001 0000
 7:0000   0001 0011 0000                                                          1:0000 0000 0011 1110
                                                        1:0000   0000 0011 1110   7:0000 0000 0011 1110
 4:0000   0001 1101 0000    3:0000 1000 1101 0000
                                                        7:0000   0000 0011 1110
 5:0000   0010 1110 0000    9:0000 1000 1101 0000
                                                        2:0000   0000 0111 0001   2:0000 0000 0111 0001
 6:0000   0011 1001 0000
                                                        8:0000   0000 0111 1000   8:0000 0000 0111 1000
10:0000   0011 1001 0000    4:0000 0001 0011 1000
                                                        4:0000   0000 0111 1000   4:0000 0000 0111 1000
 2:0000   0111 0111 0000    7:0000 0001 1101 1110
                                                        6:0000   0000 1001 0111
 3:0000   1000 1101 0000
                            6:0000 0011 1001 0111      10:0000   0000 1001 0111
 9:0000   1000 1101 0000                                                          3:0000 0000 1101 1100
                           10:0000 0011 1001 0111       3:0000   0000 1101 1100
 8:0000   1001 0111 0000                                                          9:0000 0000 1101 1110
                                                        9:0000   0000 1101 1110
 1:0000   1111 0011 0000    3:0000 1000 1101 1100       5:0000   0000 1110 1010
                            9:0000 1000 1101 1110
   The same pair may be
                                                     7:0000   0001 0011 1110
     detcted in different block                      4:0100   0001 1101 1100   2:1101 0111 0000 0000
                                                     8:0101   1001 0111 1000   9:1101 1000 0000 0000
     combinations                                   10:1001   0011 1001 0111
                                                     5:1010   0010 1110 1010   2:1101 0111 0011 0000
     Ex) (3,9), (6,10)                               1:1011   1111 0011 1110   9:1101 1000 0000 0000
                                                     3:1100   1000 0000 0000
    Naïve method takes n2 memory                    2:1101   0111 0000 0000   2:1101 0111 0000 0001
                                                     9:1101   1000 0000 0000   9:1101 1000 0000 1110
                                                     6:1111   0011 0000 0000

                            7:0000 0001 0011 0000
                            4:0000 0001 1101 0000
                            6:0000 0011 1001 0000                              1:0000 0000 0011 1110
                           10:0000 0011 1001 0000    1:0000   0000 0011 1110   7:0000 0000 0011 1110
 7:0000   0001 0011 0000                             7:0000   0000 0011 1110
 4:0000   0001 1101 0000    3:0000 1000 1101 0000    2:0000   0000 0111 0001   2:0000 0000 0111 0001
 5:0000   0010 1110 0000    9:0000 1000 1101 0000    8:0000   0000 0111 1000   8:0000 0000 0111 1000
 6:0000   0011 1001 0000                             4:0000   0000 0111 1000   4:0000 0000 0111 1000
10:0000   0011 1001 0000    4:0000 0001 0011 1000    6:0000   0000 1001 0111
 2:0000   0111 0111 0000    7:0000 0001 1101 1110                               6:1111 0011 1001 0111
                                                    10:0000   0000 1001 0111
 3:0000   1000 1101 0000                                                       10:1001 0011 1001 0111
                            6:0000 0011 1001 0111    3:0000   0000 1101 1100
 9:0000   1000 1101 0000                             9:0000   0000 1101 1110
                           10:0000 0011 1001 0111                              3:0000 0000 1101 1100
 8:0000   1001 0111 0000                             5:0000   0000 1110 1010
 1:0000   1111 0011 0000                                                       9:0000 0000 1101 1110
                            3:0000 1000 1101 1100
                            9:0000 1000 1101 1110
1     2   3    4
Step1: Make a total order among
                                    6:1111 0011 1001 0111
       blocks from left to right   10:1001 0011 1001 0111
Step2: Make a total order among
       block combinations           (1,2)<(1,3)<(1,4)
                                    <(2,3)<(2,4)<(3,4)
Step3: Take the minimum among
      matched block combinations
Combination 1                Combination 2
      1     2    3     4       1       2     3     4
  6:1111 0011 1001 0111  6:1111 0011 1001 0111
 10:1001 0011 1001 0111 10:1001 0011 1001 0111
                     (1,2) < (1,4)
If the number of blocks is k-d,
 Eliminate duplicate pairs, and
 Calculate Hamming distance




Call function to equivalence
classes
  Enumerates all neighbor pairs within a distance
- (xi, xj), i < j, Δ(xi,xj) ≦ε,
 Basic idea
 - Map vector data to sketches by LSH
 - Enumerate all neighbor pairs by MSM
 SketchSort with cosine LSH
- Enumerate all neighbor pairs within a consine
   distance threshold ε                     xiT x j
- (xi, xj), i < j,
                    Δ( xi , x j )  1                    ≦ε
                                        || xi |||| x j ||
 Applicable to Euclidean distance (Raginsky,10),
   Jaccard-coffecients
 Basic idea
 Generate a random hyperplain centered at 0
 Map vector data to ‘1 if a
 data-points is above the hyperplain,
 or else ‘0’
 Repeat l times   10….
                     l
         1
                    1
                1                0….
    0                     0
   Basic idea: Map vector data to sketches and apply MSM
   Not good: create long sketches and apply MSM at once
   Divide long sketches to Q short sketches of length l
    (chunks)
   Apply MSM to each chunk, obtain neighbor pairs w.r.t
    Hamming distance
                l                  l                 l
         1100101010010101   0101010101001010   101010101010011
         0101010101010101   0101000101010101   010101010010100
         1001010101011110   1010101010101010   101010101010100
         1000001010101010   1010101010101010   101010010101011
         1111001010111110   1010101011111010   010101010101010 ......
         1010101010010101   0100111101010100   001111010001011
         1000010000100001   0011111101000100   010010010001111
         1111000001110101   0101001010101001   001000111100101
         0001010010101001   0100101011100100   101001010001000
         1111000100100010   0011010100010010   010010100010001




            MSM               MSM               MSM
       Report neighbor pairs no more than a cosine distance
        threshold ε
   A neighbor pair of sketches can be detected in
    several chunks within Hamming distance d
    Si1100101010010101   0101010101001010 101010101010011 10101111010011 111010101001001 …

    Sj0101010101010101   0101000101010101 010101010010100 11110101010011 111111111010011 …


HamDist(si1,sj1)>d      HamDist(si3,sj3)>d  HamDist(si5,sj5)≦d
            HamDist(si2,sj2)≦d     HamDist(si4,sj4)>d




                                              Duplication!!

        The same pair is outputted several times
        (Duplication)
Step1: Order chunks from left to right.
           1                2                3              4              5
     1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 …
     0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …a



Step2: Check whether left chunks are no more than
 Hamming distance d
            1               2                3             4               5
     1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 …
     0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …a


HamDist(si1,sj1)>d?   HamDist(si3,sj3)>d?  HamDist(si5,sj5)≦d
              HamDist(si2,sj2)>d? HamDist(si4,sj4)>d?


-   If such chunk is found, trash the pair,
    Or else check cosine distance
Block level    Yes
                      Trash
 duplication?
       No
Check Hamming   >d
  Distance?           Trash
       ≦d
 chunk level    Yes
                      Trash
  duplication
       No
 Check Cosine   >ε
   Distance           Trash
         ≦ε

   Output
   Call function for
    each chunk

   Check duplication
    four times




   Divide sketches into
    equivalence classes
   Call function recursively
   True edges E*, Our results E
   Type-I error (false positive): A non-neighbor pair has
    a Hamming distance within d in at least one chunk



   Type II-error (false negative): A neighbor pair has a
    Hamming distance larger than d in all chunks
  Basically, type-II error is more crucial
 - type-I errors are filtered out by distance calculations
 Missing edge ratio (type-II error) is bounded as




where p is an upper bound of the non-collision
 probability of neighbors
  Motivation
 - Large-scale data
 - Needs for All Pairs Similarity Search Method
 - Single Sorting Method, and its drawbacks
 Method
 - Multiple Sorting Method
 - Cosine Locality Sensitive Hashing
 - SketchSort
 Experiments
 - Comparison of other state-of-the art methods
 - Use large-scale image datasets
  Two image datasets
 - MNIST (60,000 data, 748 dimension)
 - Tiny Image (100,000 data, 960 dimension)
 Use missing edge ratio as an evaluation measure
 Set cosine distance threshold of 0.15π
 Length of each chunk to 32bit
 Hamming distance and number of blocks are set
   to (2,5) and (3,6).
 Make number of chunks vary from 2, 4, 6, …, 50
 Compare our method to Lanczos bisection method
  (JMLR, 2009)
  K-nearest neighbor graph construction by
   SketchSort
 - Keep k-nearest neighbor pairs by priority queue
 Compare SketchSort to
 - Cover Tree (Beygelzimer et al., ICML 2006)
 - AllKNN (Ram et al., NIPS 2009),
 - Lanczos-bisection (JMLR, 2009)
 Set parameters so as to keep missing edge ratio
  no more than 1.0×10-6
 Enable to detect similar pairs nearly exactly
 Take only 4.3 hours for 1.6 million images

     0.05π              0.1π               0.15π
 Fast all pairs similarity search method
 Applicable to large-scale vector data
 Various applications
 Software
- http://code.google.com/p/sketchsort/

More Related Content

Viewers also liked

Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Yasuo Tabei
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabeiYasuo Tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20Yasuo Tabei
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009Yasuo Tabei
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeYasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesYasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 publicYasuo Tabei
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法MapR Technologies Japan
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界Preferred Networks
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)Shirou Maruyama
 
Residents i digitals hospitalet
Residents i digitals hospitaletResidents i digitals hospitalet
Residents i digitals hospitaletArnau Cerdà
 

Viewers also liked (20)

Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTree
 
Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
 
bigdata2012nlp okanohara
bigdata2012nlp okanoharabigdata2012nlp okanohara
bigdata2012nlp okanohara
 
Resistencia bacteriana a antibioticos
Resistencia bacteriana a antibioticosResistencia bacteriana a antibioticos
Resistencia bacteriana a antibioticos
 
Residents i digitals hospitalet
Residents i digitals hospitaletResidents i digitals hospitalet
Residents i digitals hospitalet
 
Colegio Sudamericano
Colegio SudamericanoColegio Sudamericano
Colegio Sudamericano
 
Pokus214
Pokus214Pokus214
Pokus214
 

Similar to Sketch sort ochadai20101015-public

Ashish thusoo evolution of big data architectures
Ashish thusoo   evolution of big data architecturesAshish thusoo   evolution of big data architectures
Ashish thusoo evolution of big data architecturesdrewz lin
 
Binary & Hexadecimal
Binary & HexadecimalBinary & Hexadecimal
Binary & Hexadecimalneptonia
 
Hash Functions, the MD5 Algorithm and the Future (SHA-3)
Hash Functions, the MD5 Algorithm and the Future (SHA-3)Hash Functions, the MD5 Algorithm and the Future (SHA-3)
Hash Functions, the MD5 Algorithm and the Future (SHA-3)Dylan Field
 
Sauron: DIY home security with Ruby!
Sauron: DIY home security with Ruby!Sauron: DIY home security with Ruby!
Sauron: DIY home security with Ruby!1337807
 
Abductive learning of quantized stochastic processes
Abductive learning of quantized stochastic processesAbductive learning of quantized stochastic processes
Abductive learning of quantized stochastic processesIshanu Chattopadhyay
 
Speech Reognition Using FPGA Technology
Speech Reognition Using FPGA TechnologySpeech Reognition Using FPGA Technology
Speech Reognition Using FPGA TechnologyCarlos
 
Computer hardware michael karbo
Computer hardware   michael karboComputer hardware   michael karbo
Computer hardware michael karboSecretTed
 
Andrew Goldberg. An Efficient Point-to–Point Shortest Path Algorithm
Andrew Goldberg. An Efficient Point-to–Point  Shortest Path AlgorithmAndrew Goldberg. An Efficient Point-to–Point  Shortest Path Algorithm
Andrew Goldberg. An Efficient Point-to–Point Shortest Path AlgorithmComputer Science Club
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video searchzukun
 
Introduction to Computer Lesson 7.0
Introduction to Computer Lesson 7.0Introduction to Computer Lesson 7.0
Introduction to Computer Lesson 7.0Von Ryan Sugatan
 
Exact Real Arithmetic for Tcl
Exact Real Arithmetic for TclExact Real Arithmetic for Tcl
Exact Real Arithmetic for Tclke9tv
 
Computer Security Lecture 5: Simplified Advanced Encryption Standard
Computer Security Lecture 5: Simplified Advanced Encryption StandardComputer Security Lecture 5: Simplified Advanced Encryption Standard
Computer Security Lecture 5: Simplified Advanced Encryption StandardMohamed Loey
 
Estado del Arte de la IA
Estado del Arte de la IAEstado del Arte de la IA
Estado del Arte de la IAPlain Concepts
 
Binary Mathematics Classwork and Hw
Binary Mathematics Classwork and HwBinary Mathematics Classwork and Hw
Binary Mathematics Classwork and HwJoji Thompson
 
Digital Cameras
Digital CamerasDigital Cameras
Digital CamerasMister D
 

Similar to Sketch sort ochadai20101015-public (20)

Ashish thusoo evolution of big data architectures
Ashish thusoo   evolution of big data architecturesAshish thusoo   evolution of big data architectures
Ashish thusoo evolution of big data architectures
 
Binary & Hexadecimal
Binary & HexadecimalBinary & Hexadecimal
Binary & Hexadecimal
 
Hash Functions, the MD5 Algorithm and the Future (SHA-3)
Hash Functions, the MD5 Algorithm and the Future (SHA-3)Hash Functions, the MD5 Algorithm and the Future (SHA-3)
Hash Functions, the MD5 Algorithm and the Future (SHA-3)
 
Sauron: DIY home security with Ruby!
Sauron: DIY home security with Ruby!Sauron: DIY home security with Ruby!
Sauron: DIY home security with Ruby!
 
Abductive learning of quantized stochastic processes
Abductive learning of quantized stochastic processesAbductive learning of quantized stochastic processes
Abductive learning of quantized stochastic processes
 
Speech Reognition Using FPGA Technology
Speech Reognition Using FPGA TechnologySpeech Reognition Using FPGA Technology
Speech Reognition Using FPGA Technology
 
Computer hardware michael karbo
Computer hardware   michael karboComputer hardware   michael karbo
Computer hardware michael karbo
 
Project lfsr
Project lfsrProject lfsr
Project lfsr
 
Andrew Goldberg. An Efficient Point-to–Point Shortest Path Algorithm
Andrew Goldberg. An Efficient Point-to–Point  Shortest Path AlgorithmAndrew Goldberg. An Efficient Point-to–Point  Shortest Path Algorithm
Andrew Goldberg. An Efficient Point-to–Point Shortest Path Algorithm
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
the RAID
the RAIDthe RAID
the RAID
 
05 2 관계논리비트연산
05 2 관계논리비트연산05 2 관계논리비트연산
05 2 관계논리비트연산
 
Introduction to Computer Lesson 7.0
Introduction to Computer Lesson 7.0Introduction to Computer Lesson 7.0
Introduction to Computer Lesson 7.0
 
Week1m
Week1mWeek1m
Week1m
 
Exact Real Arithmetic for Tcl
Exact Real Arithmetic for TclExact Real Arithmetic for Tcl
Exact Real Arithmetic for Tcl
 
Computer Security Lecture 5: Simplified Advanced Encryption Standard
Computer Security Lecture 5: Simplified Advanced Encryption StandardComputer Security Lecture 5: Simplified Advanced Encryption Standard
Computer Security Lecture 5: Simplified Advanced Encryption Standard
 
Estado del Arte de la IA
Estado del Arte de la IAEstado del Arte de la IA
Estado del Arte de la IA
 
Binary Mathematics Classwork and Hw
Binary Mathematics Classwork and HwBinary Mathematics Classwork and Hw
Binary Mathematics Classwork and Hw
 
Digital Cameras
Digital CamerasDigital Cameras
Digital Cameras
 
Number systems tutorial
Number systems tutorialNumber systems tutorial
Number systems tutorial
 

Recently uploaded

Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxMichelleTuguinay1
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxDhatriParmar
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 

Recently uploaded (20)

Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 

Sketch sort ochadai20101015-public

  • 1. Seminar@ochanomizu university , October 14, 2010(Thu.) Yasuo Tabei(JST Minato ERATO Project) Joint work with Takeaki Uno(NII), Masashi Sugiyama (TITECH), Koji Tsuda (AIST)
  • 2.  Motivation - Large-scale data - Needs for all pairs similarity search method - Single sorting sethod, and its drawbacks  Method - Multiple sorting method - Locality sensitive hashing - SketchSort  Experiments - Comparison of other state-of-the art methods - Use large-scale image datasets
  • 3.  Image Chemical Compounds - 80 million tiny images - NCBI PubChem (Torralba et al., (2008)) - 28 million chemical - size: 32×32 pixes compounds Genome Sequences - NCBI Sequence Read Archive - A large-scale genome sequences from various organisms
  • 4. Image sift Vector x=(0.3, -0.3, 0.5, 1.2, …) Chemical Compound fingerprint x=(1, 0, 1, 0, 0, 0, 1, …) Text, Protein, DNA/RNA etc
  • 5.  Mapping vector to binary string (sketch) - Conserve the distance in the original space x=(0.3, 0.1, 0.5, 0.6, 0.7, 1.2, -0.2,…) Mapping s=1010010001110001010…  Advantages - Can keep giga-scale data in main memory - Accelerate various algorithms
  • 6.  Finding all neighbor pairs from vector data - Given a set of data-points - Find all pairs within a distance , xi , x j   Δ( xi , x j  ε s.t. ) ε
  • 7.  Can build a neighborhood graph - Vertex: a data-point - Edge : a neighbor pair Applications: semi-supervised learning, spectral clustering, ROI detection in images, retrieval of protein sequences, etc ε
  • 8. Finding neighbor pairs by sorting  Map vector data to skeches (a)Input (b) Sort (c) Scan neighbors 1:101111 7:000000 7:000000 2:110101 4:010000 4:010000 3:110010 8:010110 8:010110 w 4:010000 10:100100 10:100100 5:101000 5:101000 5:101000 6:111100 1:101111 1:101111 7:000000 3:110010 3:110010 w 8:010110 2:110101 2:110101 9:110110 9:110110 9:110110 10:100100 6:111100 6:111100
  • 9. Need a large number of distance calcuration for achieving reasonable accuracy  Can not derive an analytical estimation of the fraction of missing neighbors
  • 10.  Motivation - Large-scale data - Needs for all pairs similarity search method - Single sorting sethod, and its drawbacks  Method - Multiple sorting method - Locality sensitive hashing - SketchSort  Experiments - Comparison of other state-of-the art methods - Use large-scale image datasets
  • 11. Input: set of fixed-length strings S={s1,…,sn }  Output: all pairs of strings within a Hamming distance d  By appling radixsort, enumerate all pairs in O(n+m) - n: number of strings, m: output pairs  Introduce block-wise masking technique for acceleration
  • 12. Sort strings by radixsort, divide strings into equivalence classes O(n)  Draw edges within all strings in an equivalece class O(m)  Computational Complexity: O(n+m) EMILY ALICE DAVID ALICE CHRIS BOBBY ALICE Sort CHRIS Equivalence DAVID DAVID Classes BOBBY DAVID DAVID DAVID ALICE EMILY
  • 13. Mask d characters in all possible ways l   Performe radixsort  d  times      Linear time to the number of strings  Time exponential to d, polynomial to the length of strings l  Ex)d=2 7:0000 0001 0011 1110 7:0000 0001 0011 1110 4:0100 0001 1101 1100 4:0100 0001 1101 1100 8:0101 1001 0111 1000 8:0101 1001 0111 1000 10:1001 0011 1001 0111 5:1010 0010 1110 1010 5:1010 0010 1110 1010 3:1100 1000 1101 1100 1:1011 1111 0011 1110 6:1111 0011 1001 0111 2:1101 0111 0111 0001 10:1001 0011 1001 0111 3:1100 1000 1101 1100 2:1101 0111 0111 0001 9:1101 1000 1101 1110 9:1101 1000 1101 1110 6:1111 0011 1001 0111 1:1011 1111 0011 1110
  • 14. Mask d blocks in all possible ways  Can reduce the number of sorting operations  Non-neighbor might be detected  Filter out by calcurating actual Hamming distance  Ex)d=2 7:0000 0001 0011 1110 7:0000 0001 0011 1110 7:0000 0000 0000 1110 4:0100 0001 1101 1100 4:0100 0001 1101 1100 4:0100 0000 0000 1100 8:0101 1001 0111 1000 8:0101 1001 0111 1000 8:0101 0000 0000 1000 10:1001 0011 1001 0111 10:1001 0011 1001 0111 10:1001 0000 0000 0111 5:1010 0010 1110 1010 5:1010 0000 1110 0000 5:1010 0000 0000 1010 1:1011 1111 0011 1110 1:1011 0000 0011 0000 1:1011 0000 0000 1110 3:1100 1000 0000 0000 3:1100 0000 1101 0000 3:1100 0000 0000 1100 2:1101 0111 0000 0000 2:1101 0000 0111 0000 2:1101 0000 0000 0001 9:1101 1000 0000 0000 9:1101 0000 1101 0000 9:1101 0000 0000 1110 6:1111 0011 0000 0000 6:1111 0000 1001 0000 6:1111 0000 0000 0111 7:0000 0001 0011 0000 4:0000 0001 0000 1100 1:0000 0000 0011 1110 4:0000 0001 1101 0000 7:0000 0001 0000 1110 7:0000 0000 0011 1110 5:0000 0010 1110 0000 5:0000 0010 0000 1010 2:0000 0000 0111 0001 6:0000 0011 1001 0000 6:0000 0011 0000 0111 8:0000 0000 0111 1000 10:0000 0011 1001 0000 10:0000 0011 0000 0111 4:0000 0000 0111 1000 2:0000 0111 0111 0000 2:0000 0111 0000 0001 6:0000 0000 1001 0111 3:0000 1000 1101 0000 3:0000 1000 0000 1100 10:0000 0000 1001 0111 9:0000 1000 1101 0000 9:0000 1000 0000 1110 3:0000 0000 1101 1100 8:0000 1001 0111 0000 8:0000 1001 0000 1000 9:0000 0000 1101 1110 1:0000 1111 0011 0000 1:0000 1111 0000 1110 5:0000 0000 1110 1010
  • 15. Step1. Perform radixsort in a block, and detect equivalence classes Step2. For each equivalence class, perform radixsort the next block Recursive 7:0000 0001 0011 0000 7:0000 0001 0011 0000 7:0000 0001 0011 0000 4:0000 0001 1101 0000 4:0000 0001 1101 0000 4:0000 0001 1101 0000 5:0000 0010 1110 0000 5:0000 0010 1110 0000 6:0000 0011 1001 0000 6:0000 0011 1001 0000 6:0000 0011 1001 0000 10:0000 0011 1001 0000 10:0000 0011 1001 0000 10:0000 0011 1001 0000 2:0000 0111 0111 0000 2:0000 0111 0111 0000 3:0000 1000 1101 0000 3:0000 1000 1101 0000 9:0000 1000 1101 0000 9:0000 1000 1101 0000 8:0000 1001 0111 0000 8:0000 1001 0111 0000 3:0000 1000 1101 0000 1:0000 1111 0011 0000 1:0000 1111 0011 0000 9:0000 1000 1101 0000
  • 16. All neighbor pairs can be 7:0000 0001 0011 1110 enumerated 4:0100 0001 1101 1100 2:1101 0111 0000 0000 8:0101 1001 0111 1000 9:1101 1000 0000 0000 10:1001 0011 1001 0111 5:1010 0010 1110 1010 2:1101 0111 0011 0000 1:1011 1111 0011 1110 9:1101 1000 0000 0000 3:1100 1000 0000 0000 2:1101 0111 0000 0000 2:1101 0111 0000 0001 9:1101 1000 0000 0000 9:1101 1000 0000 1110 6:1111 0011 0000 0000 7:0000 0001 0011 0000 4:0000 0001 1101 0000 6:0000 0011 1001 0000 10:0000 0011 1001 0000 7:0000 0001 0011 0000 1:0000 0000 0011 1110 1:0000 0000 0011 1110 7:0000 0000 0011 1110 4:0000 0001 1101 0000 3:0000 1000 1101 0000 7:0000 0000 0011 1110 5:0000 0010 1110 0000 9:0000 1000 1101 0000 2:0000 0000 0111 0001 2:0000 0000 0111 0001 6:0000 0011 1001 0000 8:0000 0000 0111 1000 8:0000 0000 0111 1000 10:0000 0011 1001 0000 4:0000 0001 0011 1000 4:0000 0000 0111 1000 4:0000 0000 0111 1000 2:0000 0111 0111 0000 7:0000 0001 1101 1110 6:0000 0000 1001 0111 3:0000 1000 1101 0000 6:0000 0011 1001 0111 10:0000 0000 1001 0111 9:0000 1000 1101 0000 3:0000 0000 1101 1100 10:0000 0011 1001 0111 3:0000 0000 1101 1100 8:0000 1001 0111 0000 9:0000 0000 1101 1110 9:0000 0000 1101 1110 1:0000 1111 0011 0000 3:0000 1000 1101 1100 5:0000 0000 1110 1010 9:0000 1000 1101 1110
  • 17. The same pair may be 7:0000 0001 0011 1110 detcted in different block 4:0100 0001 1101 1100 2:1101 0111 0000 0000 8:0101 1001 0111 1000 9:1101 1000 0000 0000 combinations 10:1001 0011 1001 0111 5:1010 0010 1110 1010 2:1101 0111 0011 0000 Ex) (3,9), (6,10) 1:1011 1111 0011 1110 9:1101 1000 0000 0000 3:1100 1000 0000 0000  Naïve method takes n2 memory 2:1101 0111 0000 0000 2:1101 0111 0000 0001 9:1101 1000 0000 0000 9:1101 1000 0000 1110 6:1111 0011 0000 0000 7:0000 0001 0011 0000 4:0000 0001 1101 0000 6:0000 0011 1001 0000 1:0000 0000 0011 1110 10:0000 0011 1001 0000 1:0000 0000 0011 1110 7:0000 0000 0011 1110 7:0000 0001 0011 0000 7:0000 0000 0011 1110 4:0000 0001 1101 0000 3:0000 1000 1101 0000 2:0000 0000 0111 0001 2:0000 0000 0111 0001 5:0000 0010 1110 0000 9:0000 1000 1101 0000 8:0000 0000 0111 1000 8:0000 0000 0111 1000 6:0000 0011 1001 0000 4:0000 0000 0111 1000 4:0000 0000 0111 1000 10:0000 0011 1001 0000 4:0000 0001 0011 1000 6:0000 0000 1001 0111 2:0000 0111 0111 0000 7:0000 0001 1101 1110 6:1111 0011 1001 0111 10:0000 0000 1001 0111 3:0000 1000 1101 0000 10:1001 0011 1001 0111 6:0000 0011 1001 0111 3:0000 0000 1101 1100 9:0000 1000 1101 0000 9:0000 0000 1101 1110 10:0000 0011 1001 0111 3:0000 0000 1101 1100 8:0000 1001 0111 0000 5:0000 0000 1110 1010 1:0000 1111 0011 0000 9:0000 0000 1101 1110 3:0000 1000 1101 1100 9:0000 1000 1101 1110
  • 18. 1 2 3 4 Step1: Make a total order among 6:1111 0011 1001 0111 blocks from left to right 10:1001 0011 1001 0111 Step2: Make a total order among block combinations (1,2)<(1,3)<(1,4) <(2,3)<(2,4)<(3,4) Step3: Take the minimum among matched block combinations Combination 1 Combination 2 1 2 3 4 1 2 3 4 6:1111 0011 1001 0111 6:1111 0011 1001 0111 10:1001 0011 1001 0111 10:1001 0011 1001 0111 (1,2) < (1,4)
  • 19. If the number of blocks is k-d, Eliminate duplicate pairs, and Calculate Hamming distance Call function to equivalence classes
  • 20.  Enumerates all neighbor pairs within a distance - (xi, xj), i < j, Δ(xi,xj) ≦ε,  Basic idea - Map vector data to sketches by LSH - Enumerate all neighbor pairs by MSM  SketchSort with cosine LSH - Enumerate all neighbor pairs within a consine distance threshold ε xiT x j - (xi, xj), i < j, Δ( xi , x j )  1  ≦ε || xi |||| x j ||  Applicable to Euclidean distance (Raginsky,10), Jaccard-coffecients
  • 21.  Basic idea  Generate a random hyperplain centered at 0  Map vector data to ‘1 if a data-points is above the hyperplain, or else ‘0’  Repeat l times 10…. l 1 1 1 0…. 0 0
  • 22. Basic idea: Map vector data to sketches and apply MSM  Not good: create long sketches and apply MSM at once  Divide long sketches to Q short sketches of length l (chunks)  Apply MSM to each chunk, obtain neighbor pairs w.r.t Hamming distance l l l 1100101010010101 0101010101001010 101010101010011 0101010101010101 0101000101010101 010101010010100 1001010101011110 1010101010101010 101010101010100 1000001010101010 1010101010101010 101010010101011 1111001010111110 1010101011111010 010101010101010 ...... 1010101010010101 0100111101010100 001111010001011 1000010000100001 0011111101000100 010010010001111 1111000001110101 0101001010101001 001000111100101 0001010010101001 0100101011100100 101001010001000 1111000100100010 0011010100010010 010010100010001 MSM MSM MSM  Report neighbor pairs no more than a cosine distance threshold ε
  • 23. A neighbor pair of sketches can be detected in several chunks within Hamming distance d Si1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 … Sj0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 … HamDist(si1,sj1)>d HamDist(si3,sj3)>d HamDist(si5,sj5)≦d HamDist(si2,sj2)≦d HamDist(si4,sj4)>d Duplication!!  The same pair is outputted several times (Duplication)
  • 24. Step1: Order chunks from left to right. 1 2 3 4 5 1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 … 0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …a Step2: Check whether left chunks are no more than Hamming distance d 1 2 3 4 5 1100101010010101 0101010101001010 101010101010011 10101111010011 111010101001001 … 0101010101010101 0101000101010101 010101010010100 11110101010011 111111111010011 …a HamDist(si1,sj1)>d? HamDist(si3,sj3)>d? HamDist(si5,sj5)≦d HamDist(si2,sj2)>d? HamDist(si4,sj4)>d? - If such chunk is found, trash the pair, Or else check cosine distance
  • 25. Block level Yes Trash duplication? No Check Hamming >d Distance? Trash ≦d chunk level Yes Trash duplication No Check Cosine >ε Distance Trash ≦ε Output
  • 26. Call function for each chunk  Check duplication four times  Divide sketches into equivalence classes  Call function recursively
  • 27. True edges E*, Our results E  Type-I error (false positive): A non-neighbor pair has a Hamming distance within d in at least one chunk  Type II-error (false negative): A neighbor pair has a Hamming distance larger than d in all chunks
  • 28.  Basically, type-II error is more crucial - type-I errors are filtered out by distance calculations  Missing edge ratio (type-II error) is bounded as where p is an upper bound of the non-collision probability of neighbors
  • 29.  Motivation - Large-scale data - Needs for All Pairs Similarity Search Method - Single Sorting Method, and its drawbacks  Method - Multiple Sorting Method - Cosine Locality Sensitive Hashing - SketchSort  Experiments - Comparison of other state-of-the art methods - Use large-scale image datasets
  • 30.  Two image datasets - MNIST (60,000 data, 748 dimension) - Tiny Image (100,000 data, 960 dimension)  Use missing edge ratio as an evaluation measure  Set cosine distance threshold of 0.15π  Length of each chunk to 32bit  Hamming distance and number of blocks are set to (2,5) and (3,6).  Make number of chunks vary from 2, 4, 6, …, 50  Compare our method to Lanczos bisection method (JMLR, 2009)
  • 31.
  • 32.
  • 33.  K-nearest neighbor graph construction by SketchSort - Keep k-nearest neighbor pairs by priority queue  Compare SketchSort to - Cover Tree (Beygelzimer et al., ICML 2006) - AllKNN (Ram et al., NIPS 2009), - Lanczos-bisection (JMLR, 2009)
  • 34.
  • 35.
  • 36.  Set parameters so as to keep missing edge ratio no more than 1.0×10-6  Enable to detect similar pairs nearly exactly  Take only 4.3 hours for 1.6 million images 0.05π 0.1π 0.15π
  • 37.  Fast all pairs similarity search method  Applicable to large-scale vector data  Various applications  Software - http://code.google.com/p/sketchsort/