Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Unique Identifier Generation in
Distributed Environment
Jianhan Zhu
β€’ In distributed systems, sequential IDs are not always an option
β€’ As short as possible for sharing
β€’ GUID of 36 characte...
Birthday Paradox
β€’ For 𝑛 randomly chosen persons, the probability that at least two of them have the
same birthday
π‘ƒπ‘Ÿπ‘œπ‘π‘Žπ‘π‘–...
ID length Num of IDs generated before a
collision (Simulation)
Probability of collision
5 36K 51% (36K)
6 289K 52% (289K)
...
β€’ Data estimates on current triple store http://indexing.parliament.uk
β€’ 174 million triples
β€’ 9.2 million unique subjects...
β€’ Conclusions:
β€’ 8 characters long ID for the near future
β€’ Need to increase ID length to accommodate more IDs
β€’ At 1 mill...
Further Reading
β€’ https://en.wikipedia.org/wiki/Birthday_problem
β€’ https://en.wikipedia.org/wiki/Cryptographically_secure_...
Upcoming SlideShare
Loading in …5
×

Data platform ID generation

Jianhan on how we generate identifiers for the UK Parliament data platform.

  • Be the first to comment

  • Be the first to like this

Data platform ID generation

  1. 1. Unique Identifier Generation in Distributed Environment Jianhan Zhu
  2. 2. β€’ In distributed systems, sequential IDs are not always an option β€’ As short as possible for sharing β€’ GUID of 36 characters could be too long: 00017071-8786-42a5-94d9-dc0f62f585fc β€’ A balance between ID length and probability of collision β€’ The shorter the ID, the higher the probability of collision Probability of collision (%) ID Length 0, 0 100 36
  3. 3. Birthday Paradox β€’ For 𝑛 randomly chosen persons, the probability that at least two of them have the same birthday π‘ƒπ‘Ÿπ‘œπ‘π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦ π‘œπ‘“ πΆπ‘œπ‘™π‘™π‘–π‘ π‘–π‘œπ‘› β‰ˆ 1 βˆ’ π‘’βˆ’ 𝑛2 2π‘₯ β€’ π‘₯: all possible ID values β€’ 𝑛: number of IDs we plan to have 𝒙 = πŸ”πŸ πŸ– β€’ 52 Alphabetic and 10 numeric characters β€’ ID of length 8 𝑛 β€’ Currently 40K, so a probability of collision: 0.0003% β€’ If 1 million, the probability is 0.23% β€’ Will be tens of millions or more in future In triple store: β€’ Generated ID: 1AFu55Hs β€’ Prefix: https://id.parliament.uk β€’ Resource URI: https://id.parliament.uk/1AFu55Hs 3
  4. 4. ID length Num of IDs generated before a collision (Simulation) Probability of collision 5 36K 51% (36K) 6 289K 52% (289K) 7 2.3 Million 51% (2.3 Million) 8 Out of memory 0.002% (100K) 0.06% (0.5 Million) 0.23% (1 Million) 5.56% (5 Million) 20.5% (10 Million) 9 - 0.37% (10 Million) 8.82% (50 Million) 30.88% (100 Million) 10 - 0.59% (100 Million) 2.35% (200 Million) 5.22% (300 Million) 13.84% (500 Million) 44.88% (1000 Million) 4 β€’ Results for different ID lengths: β€’ Random data source: Crypto Random
  5. 5. β€’ Data estimates on current triple store http://indexing.parliament.uk β€’ 174 million triples β€’ 9.2 million unique subjects (2.9 million blank nodes) 5 Subject Prefix Num of Triples Num of Unique Subjects Average Num of Triples per Subject http://data.parliament.uk/pimsdata/ 92,708,852 2,960,851 31.3 http://data.parliament.uk/edms/ 24,196,297 1,939,024 12.5 http://hansard.intranet.data.parliament.uk/ 18,115,694 552,505 32.8 http://tabledpq.indexing.parliament.uk/ 6,967,006 191,173 36.4 http://data.parliament.uk/writtenparliamentaryquestion/ 3,716,166 70,199 52.9 http://esid.parliament.uk/EUDocument/ 3,247,707 149,035 21.8 http://data.parliament.uk/depositedpapers/ 2,551,168 80,185 31.8 http://services.paperslaid.devci.dev.parliament.uk/ 644,193 23,121 27.9 http://data.parliament.uk/terms/uncontrolled/ 606,951 172,373 3.5 http://data.parliament.uk/resources/ 490,192 31,509 15.6 http://data.parliament.uk/currentawareness/ 487,227 22,044 22.1 http://paperslaidpoller.parliament.uk/ 396,153 9,636 41.1
  6. 6. β€’ Conclusions: β€’ 8 characters long ID for the near future β€’ Need to increase ID length to accommodate more IDs β€’ At 1 million (0.23%)? β€’ Data will be structured differently from previous two triple stores? β€’ In future, add ID collision check against the triple store if the effect of performance is acceptable β€’ Challenges: β€’ If a collision occurred, how to spot it? (Log generated IDs?) 6
  7. 7. Further Reading β€’ https://en.wikipedia.org/wiki/Birthday_problem β€’ https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorando m_number_generator β€’ https://en.wikipedia.org/wiki/Universally_unique_identifier β€’ https://eager.io/blog/how-long-does-an-id-need-to-be/ β€’ https://github.com/twitter/snowflake β€’ Parliament Data Platform: https://api.parliament.uk/openapi.json 7

Γ—