2. • In distributed systems, sequential IDs are not always an option
• As short as possible for sharing
• GUID of 36 characters could be too long: 00017071-8786-42a5-94d9-dc0f62f585fc
• A balance between ID length and probability of collision
• The shorter the ID, the higher the probability of collision
Probability of
collision (%)
ID Length
0, 0
100
36
3. Birthday Paradox
• For 𝑛 randomly chosen persons, the probability that at least two of them have the
same birthday
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝐶𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛 ≈ 1 − 𝑒−
𝑛2
2𝑥
• 𝑥: all possible ID values
• 𝑛: number of IDs we plan to have
𝒙 = 𝟔𝟐 𝟖
• 52 Alphabetic and 10 numeric characters
• ID of length 8
𝑛
• Currently 40K, so a probability of collision: 0.0003%
• If 1 million, the probability is 0.23%
• Will be tens of millions or more in future
In triple store:
• Generated ID: 1AFu55Hs
• Prefix: https://id.parliament.uk
• Resource URI: https://id.parliament.uk/1AFu55Hs
3
4. ID length Num of IDs generated before a
collision (Simulation)
Probability of collision
5 36K 51% (36K)
6 289K 52% (289K)
7 2.3 Million 51% (2.3 Million)
8 Out of memory 0.002% (100K)
0.06% (0.5 Million)
0.23% (1 Million)
5.56% (5 Million)
20.5% (10 Million)
9 - 0.37% (10 Million)
8.82% (50 Million)
30.88% (100 Million)
10 - 0.59% (100 Million)
2.35% (200 Million)
5.22% (300 Million)
13.84% (500 Million)
44.88% (1000 Million) 4
• Results for different ID lengths:
• Random data source: Crypto Random
5. • Data estimates on current triple store http://indexing.parliament.uk
• 174 million triples
• 9.2 million unique subjects (2.9 million blank nodes)
5
Subject Prefix
Num of
Triples
Num of Unique
Subjects
Average Num
of Triples per
Subject
http://data.parliament.uk/pimsdata/ 92,708,852 2,960,851 31.3
http://data.parliament.uk/edms/ 24,196,297 1,939,024 12.5
http://hansard.intranet.data.parliament.uk/ 18,115,694 552,505 32.8
http://tabledpq.indexing.parliament.uk/ 6,967,006 191,173 36.4
http://data.parliament.uk/writtenparliamentaryquestion/ 3,716,166 70,199 52.9
http://esid.parliament.uk/EUDocument/ 3,247,707 149,035 21.8
http://data.parliament.uk/depositedpapers/ 2,551,168 80,185 31.8
http://services.paperslaid.devci.dev.parliament.uk/ 644,193 23,121 27.9
http://data.parliament.uk/terms/uncontrolled/ 606,951 172,373 3.5
http://data.parliament.uk/resources/ 490,192 31,509 15.6
http://data.parliament.uk/currentawareness/ 487,227 22,044 22.1
http://paperslaidpoller.parliament.uk/ 396,153 9,636 41.1
6. • Conclusions:
• 8 characters long ID for the near future
• Need to increase ID length to accommodate more IDs
• At 1 million (0.23%)?
• Data will be structured differently from previous two triple stores?
• In future, add ID collision check against the triple store if the effect of
performance is acceptable
• Challenges:
• If a collision occurred, how to spot it? (Log generated IDs?)
6
7. Further Reading
• https://en.wikipedia.org/wiki/Birthday_problem
• https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorando
m_number_generator
• https://en.wikipedia.org/wiki/Universally_unique_identifier
• https://eager.io/blog/how-long-does-an-id-need-to-be/
• https://github.com/twitter/snowflake
• Parliament Data Platform: https://api.parliament.uk/openapi.json
7