Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unlocking the Indexing and Search Data Goldmine

Show and tell slides by Jianhan Zhu

  • Be the first to comment

  • Be the first to like this

Unlocking the Indexing and Search Data Goldmine

  1. 1. Unlocking Indexing and Search Data Goldmine
  2. 2. Written question data • 1.5 million written questions • Many fields, we currently only use: • uri - unique identifier - when tabled, given a uri. Later the tabled one deleted, and an answered question created with new uri • uin - not unique identifier, can be reused in different sessions, and can be missing • title – can be missing • questionText • answerText • askingMember_ses – members share the same ses Id, disambiguate by their incumbency dates • answeringMember_ses – members share the same ses Id • answeringDept_ses • dateTabled • dateOfAnswer • dateForAnswer
  3. 3. Schema Implementation
  4. 4. Answering department ses id • 191 unique answering department ses ids • Top 5: Department of Health (10%) Home Office (8%) Ministry of Defence (6%) Foreign and Commonwealth Office (6%) Treasury (5%) • We only have 39 answering bodies in triple store • Departments evolved and changed names, need to model these • 601,991 (40.1%) questions with answering bodies not in triple store • Top 5 missing answering bodies Department of Health Department of Trade and Industry Department for Communities and Local Government Department of the Environment Department for Culture, Media and Sport • 108,128 (7.2%) have null answering dept ses id
  5. 5. Asking member ses id • 2,836 unique asking member ses ids • Top 5 John Bercow (0.8%) Jim Cunningham (0.7%) Norman Baker (0.6%) Paul Flynn (0.6%) Andrew Rosindell (0.6%) • Three missing in the triple store RtHonLord Aberdare Elaine Thomson Jeff Cuthbert (National Assembly for Wales) • 6,942 (0.6%) have null asking member ses id
  6. 6. Answering member ses id • 834 unique answering member ses ids • Top 5 Dawn Primarolo (1%) Adam Ingram (0.8%) Rosie Winterton (0.8%) Ben Bradshaw (0.8%) Elliot Morley (0.7%) • One missing in the triple store RtHonLord Aberdare • 6,744 (0.4%) have null answering member ses id
  7. 7. Other • Days between Date Tabled and Date Of Answer • Average 14 days • Outliers: -748 days, 1317 days • Days between Date For Answer and Date Of Answer • Average 3.8 days • Outliers: -7930 days, 7895 days • Null uin value • 347671 (23%), mainly old data before 2000 • Null title value • 202213 (13%), mainly old data before 1993
  8. 8. Recent data • 70,880 questions tabled since Jan 1, 2017 • Answering department • 36 unique vs. 191 (all data) • 3 not in triple store vs. 152 (all data) • 9,644 (13.6%) questions with answering bodies not in triple store vs. 40.1% (all data) • Asking member • 1025 unique vs. 2,836 (all data) • 1,970 (2.8%) missing vs. 0.6% (all data) • Answering member • 150 unique vs. 834 (all data) • 1,970 (2.8%) missing vs. 0.4% (all data) • Days between Date Tabled and Date Of Answer • Average 9 days vs. 14 days (all data) • Days between Date For Answer and Date Of Answer • Average 2.7 days vs. 3.8 days (all data)
  9. 9. Querying data • Fixed query (packaged SPARQL queries) • Questions asked by a member https://api.parliament.uk/query/questions_askedby_member?member_id=4fn7q5Wl • Questions answered by a member https://api.parliament.uk/query/questions_answeredby_member?member_id=SWXSOmi9 • Questions search by terms in heading https://api.parliament.uk/query/questions_search_by_title?lowercase_string=health • OData (you can query in almost any way!) • Total number of questions https://api.parliament.uk/OData/Question/$count • Total number of answers https://api.parliament.uk/OData/Answer/$count • Questions by a member https://api.parliament.uk/OData/Member('0FqjjgNp')/AskingPersonHasQuestion • Answers by a member https://api.parliament.uk/OData/Member('0FqjjgNp')/AnsweringPersonHasAnswer • Questions asked on a date https://api.parliament.uk/OData/Question?$filter=QuestionAskedAt%20eq%202018-05-23T00:00:00Z • Questions asked between two dates https://api.parliament.uk/OData/Question?$filter=QuestionAskedAt%20gt%202018-04- 23T00:00:00Z%20and%20QuestionAskedAt%20lt%202018-04-26T00:00:00Z • Correcting answers expanded with corrected answers https://api.parliament.uk/OData/CorrectingAnswer?$expand=AnswerReplacesAnswer
  10. 10. Distributions of data • Follow a power law distribution 0 2000 4000 6000 8000 10000 12000 1 40 79 118 157 196 235 274 313 352 391 430 469 508 547 586 625 664 703 742 781 820 859 898 937 976 1015 1054 1093 1132 1171 1210 1249 1288 1327 1366 1405 1444 1483 1522 1561 1600 1639 1678 1717 1756 1795 1834 1873 1912 1951 1990 2029 2068 2107 2146 2185 2224 2263 2302 2341 2380 2419 2458 2497 2536 2575 2614 2653 2692 2731 2770 2809 Distribution of number of questions for asking members 0 2000 4000 6000 8000 10000 12000 14000 16000 1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 361 376 391 406 421 436 451 466 481 496 511 526 541 556 571 586 601 616 631 646 661 676 691 706 721 736 751 766 781 796 811 826 Distribution of number of questions for answering members
  11. 11. 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 103 106 109 112 115 118 121 124 127 130 133 136 139 142 145 148 151 154 157 160 163 166 169 172 175 178 181 184 187 190 Distribution of number of questions for answering bodies
  12. 12. 0 200 400 600 800 1000 1200 1400 1600 1800 Distribution of number of questions for tabling date 0 200 400 600 800 1000 1200 1400 1600 1800 1/6/2017 0:00 2/6/2017 0:003/6/2017 0:00 4/6/2017 0:00 5/6/2017 0:00 6/6/2017 0:00 7/6/2017 0:00 8/6/2017 0:00 9/6/2017 0:00 10/6/2017 0:00 11/6/2017 0:00 12/6/2017 0:00 1/6/2018 0:00 2/6/2018 0:003/6/2018 0:00 4/6/2018 0:00 5/6/2018 0:00 Distribution of number of questions for tabling date (January 2017 to Now)
  13. 13. 0 500 1000 1500 2000 2500 3000 Distribution of number of questions for answering date 0 200 400 600 800 1000 1200 1/3/2017 0:00 2/3/2017 0:003/3/2017 0:00 4/3/2017 0:00 5/3/2017 0:00 6/3/2017 0:00 7/3/2017 0:00 8/3/2017 0:00 9/3/2017 0:00 10/3/2017 0:00 11/3/2017 0:00 12/3/2017 0:00 1/3/2018 0:00 2/3/2018 0:003/3/2018 0:00 4/3/2018 0:00 5/3/2018 0:00 Distribution of number of questions for answering date (January 2017 to Now)
  14. 14. 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 -40 -20 0 20 40 60 80 100 120 140 160 180 Distribution of number of questions vs. days between date for and of answer 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 0 7 14 21 28 35 42 49 56 63 70 77 84 91 98 105 112 119 126 133 140 147 154 161 168 175 Distribution of number of questions vs. days between table date and date of answer
  15. 15. 0 50000 100000 150000 200000 250000 300000 dept treasury minister finance armed commons support immigration protection british statistics civil legal dean whitty ministerial operations financial charges rented change relations homes middle army green yorkshire day duties rail diabetes china shipping independent future select rescue palestine blackstone doctors minimum prevention peace maternity russia detainees political unmanned gaza trident bbc colombia agreements fighter languages mail prescriptions ashton inspections Distribution of terms counts in question headings
  16. 16. Member question network • A way to get an overview of question data • Nodes: 2,893 members • Edges: • 175,484 (member A’s question answered by member B) • Properties of the network (using Python NetworkX) • Average Node Degree: 121.3 • Network diameter: 6 • Network radius: 3 • Average shortest path length: 2.6 • Clustering coefficient: 0.3 • Network density: 0.04 • Network Centre: • Earl Attlee, Lord Hylton, Lord Wallace of Saltaire, Lord Stoddart of Swindon, Earl Howe, Lord Bates, Lord Patten, Lord Pearson of Rannoch, Lord Hoyle, Lord Howell of Guildford, Earl of Shrewsbury, Lord Davies of Oldham, Baroness Chalker of Wallasey, Lord Braine of Wheatley, Lord Waddington, Baroness Neville-Rolfe A B C10 5 1 250
  17. 17. All data - 2,893 nodes, 175,484 edges
  18. 18. Abortion – 1,281 questions House of Commons House of Lords
  19. 19. Brexit – 420 questions House of Commons
  20. 20. Education–44,714 questions
  21. 21. • We are only scratching the surface of the goldmine • More question data to import • Other data fields to import • Subject indexing and related items data to import • Other types of data to import • Much more to learn from the data • Some ideas • Incorporate answering departments, and terms and topics in answer networks • Improve network visualisation • Navigation, link direction, weights, zoom in to view details of members etc • Public can access question data through data platform, and do fantastic research and discovery!
  22. 22. Further reading • https://pds.blog.parliament.uk/2017/06/23/a-new-data-service-for-parliament/ • https://pds.blog.parliament.uk/2018/01/24/accessing-semantic-data-with-odata- web-interface/ • https://github.com/ukparliament/ontologies/tree/master/question-and-answer • https://medium.com/@langsamu/api-parliament-uk-7b87597019a4 • http://odata.github.io/ • http://www.iaeng.org/IJCS/issues_v43/issue_2/IJCS_43_2_03.pdf

×