1. Faculty of Science
Paul Groth | @pgroth | pgroth.com
May 12, 2019
Institute for Information Business – WU Wien
Thinking About the
Making of Data
Thanks to Kathleen Gregory (@gregory_km )
2. Faculty of Science
The making of data is important
“There is a major, largely unrealised potential to
merge and integrate the data from different
disciplines of science in order to reveal deep
patterns in the multi-facetted complexity that
underlies most of the domains of application that
are intrinsic to the major global challenges that
confront humanity.” – Grand Challenge for
Science
http://dataintegration.codata.org
Committee on Data of the
International Council for Science
(CODATA)
3. Faculty of Science
Software 2.0
https://link.medium.com/srrJhEl5bS
“In the 2.0 stack, the programming is done by
accumulating, massaging and cleaning datasets”
Figure 8
Data Science
Surveys 2017
& 2018
The making of data is hard
10. Paul Groth, Michael Lauruhn, Antony Scerri: “Open Information Extraction on
Scientific Text: An Evaluation”, 2018; [http://arxiv.org/abs/1802.05574
arXiv:1802.05574]
12. Faculty of Science
NOT JUST DATA SCIENCE
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2019).
Searching Data: A Review of Observational Data Retrieval
Practices. Journal of the Association for Information Science and
Technology. doi:10.1002/asi.24165
Some observations from @gregory_km
survey & interviews :
• The needs and behaviors of specific user groups (e.g. early
career researchers, policy makers, students) are not well
documented.
• Participants require details about data collection and handling
• Reconstructing data tables from journal articles, using
general search engines, and making direct data requests are
common.
Gregory, K.; Cousijn, H.; Groth, P.; Scharnhorst, A.; Wyatt, S. (forthcoming).
Understanding Data Search as a Socio-technical Practice. Journal of
Information Science. arXiv preprint: arXiv:1801.04971.
13. Faculty of Science
Spreadsheet Events
https://www.seh.ox.ac.uk/news/the-case-for-ceres-developing-a-postgraduate-mission-with-the-european-space-agency
14. Faculty of Science
BOTTLENECKS
1.Manual
2.Difficulty in creating flexible reusable workflows
3.Lack of transparency
Paul Groth."The Knowledge-Remixing Bottleneck,"
Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-
Oct. 2013 doi: 10.1109/MIS.2013.138
Paul Groth, "Transparency and Reliability in the Data Supply
Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71,
March-April 2013 doi: 10.1109/MIC.2013.41
15. Faculty of Science
• Focus on intelligent systems for supporting people working with data.
• 5 people by September 2019 + growing
• 3 Research areas:
• AI for Data Engineering Tasks
• Knowledge graph construction
• Data wrangling support + automation
• Transparency in data supply chains
• Lineage of provenance of data
• Understanding data professionals work
• Empirical insights into how people go about working with data
New lab at the University of Amsterdam http://indelab.org
16. Faculty of Science
Data search – is it just a regular search engine?
Survey of Research Challenges:
Adriane Chapman, Elena Simperl, Laura Koesten,
George Konstantinidis, Luis-Daniel Ibáñez-Gonzalez,
Emilia Kacprzak, Paul Groth (Jan 2019) "Dataset
search: a survey" https://arxiv.org/abs/1901.00735
17. Faculty of Science
“An information need is the topic about which the user desires to know
more” – Manning
Information Needs
18. Faculty of Science
Data as an information need
Researchers across communities need a diversity of
observational data, requiring data of different types, from
different sources and disciplines, and often collected at
different scales.
Integrating diverse data is a challenge.
Gregory, K.; Cousijn, H.; Groth, P.; Scharnhorst, A.; Wyatt, S. (2019). Searching data: A review
of observational data retrieval practices in selected disciplines. Journal of the Association for
Information Science and Technology. https://doi.org/10.1002/asi.24165
19. Faculty of Science
Primary: Semi-structured interviews with data seekers across disciplines (n=22)
Next stage: Multidisciplinary survey (n=1677, still in analysis phase)
How do researchers search for data?
Work of Kathleen Gregory
with Sally Wyatt, Andrea Scharnhorst, Helena Cousijn
20. Faculty of Science
Data needed for research are not always research data
Numerous roles - data as hubs for collaboration and
creativity
A broader understanding of the data
needed by users
Users and data needs
22. Faculty of Science
30.2
29.4
20.5
19.3
0.6
Percentage
Following citations to data
Search with goal of finding
data
While reading or searching
for literature
Extract data directly from
literature, tables, graphs
Other
How do you discover data using the academic literature?
23. Faculty of Science
Actively searching
online
Serendipitously,
while searching for
something else
While
sharing/managing
own data
Serendipitously,
when not actively
searching
How frequently do you find data in the following ways?
Never
Occasionally
Often
Percentage
24. Faculty of Science
Key role of social interactions
Search and discovery strategies
Actually, most of the times that I have looked for external data, it has
been through (personal) connections (11).
The human network of contacts is still the best way to find the
information you want, especially if it is a small group...that is the
most powerful and accurate source of information that I use at this
point. (17)
25. Faculty of Science
Role of social interactions continues
Evaluation and sense-making
I think if there was a good search engine, then I could get the dataset
directly. I would still get in touch with the data author anyway, both
for social reasons - developing the network and eventual
collaboration - and also because most of the times the metadata are
not enough to really understand the biology behind the species (4).
26. Faculty of Science
Role of social interactions continues
Evaluation and sensemaking
I am used to working with experts from different areas of knowledge.
For me it is usual to have partners with different expertise: biology,
agronomy, economy…I know the language of LCA (life cycle
assessment), not of electronics or agricultural biology. My limit is
not the data that I cannot find, but people that can work with these
data (16).
27. Faculty of Science
What does this mean for system design?
Consider how data are made available
• Metadata standardization and enrichment
• Summarization to facilitate sensemaking
Consider entirety of data needs
• Point to best practices or resources for other data
• Do disciplinary categories still fit?
Consider diversity and overlaps
• Differentiated interfaces
• Integration with infrastructures supporting other data and research practices
Consider how to incorporate role of social interactions
• Contact data author, integration with author profiles, ORCID?
• Links to in-person trainings? Connecting with “data experts”?
28. Faculty of Science
Integration of Data Into Workflows
Chichester, Christine, Daniela Digles, Ronald Siebes, Antonis Loizou, Paul Groth, and
Lee Harland. "Drug discovery FAQs: workflows for answering multidomain drug
discovery questions." Drug discovery today 20, no. 4 (2015): 399-405.
30. Faculty of Science
BUILD A KNOWLEDGE GRAPH
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Information
Extraction
Entity
Resolution
Matrix
Factorization
Knowledge
graph
Curation
Predicted
relations
Matrix
Completion
Taxonomy
Triple
Extraction
Concept
Resolution
14M
SD articles
475 M
triples
3.3 million
relations
49 M
relations
~15k ->
1M
entries
Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel
“Applying Universal Schemas for Domain Specific Ontology Expansion”
5th Workshop on Automated Knowledge Base Construction (AKBC) 2016
Michael Lauruhn, and Paul Groth. "Sources of Change for Modern
Knowledge Organization Systems." Knowledge Organization 43, no. 8
(2016).
31. Faculty of Science
SOURCES OF CHANGE
Concept1
Concept2 Concept3
KOS
Professional
Curators
Literature
Software
Non-professional
contributors
1. dealing with changing cultural and societal
norms, specifically to address or correct bias;
2. political influence
3. new concepts and terminology arising from
discoveries or change in perspective within a
technical/scientific community
4. gardening
5. incremental contributorship
6. progressive formalization
7. software and automation
8. integration of large numbers of data sources
9. variance in algorithm training data
Data
⚐Society & Politics
(4, 5, 6)
(7, 8, 9)
(3)
(1, 2)
Lauruhn, Michael, and Paul Groth. "Sources of Change for Modern Knowledge
Organization Systems." Knowledge Organization 43, no. 8 (2016).
33. Faculty of Science
4. GARDENING
Wikipedia Categories
25% increase in the number of categories over the 2012 - 2014 period vs
a 12% increase in the number of articles. Likewise, the number of
disambiguation pages has increased by 13%. (Bairi et al. 2015)
http://blog.schema.org/2015/11/schemaorg-whats-new.html
35. Faculty of Science
INTEGRATION OF LARGE NUMBERS OF DATA SOURCES
Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE
, vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138
• 10 different extractors
• E.g mapping-based infobox extractor
• Infobox uses a hand-built ontology based on the 350
• Based on acommonly used English language infoboxes
• Integrates with Yago
• Yago relies on Wikipedia + Wordnet
• Upper ontology from Wordnet and then a mapping to Wikipedia
categories based frequencies
• Wordnet is built by psycholinguists
36. Faculty of Science
Data are complex objects
Data are diverse.
Data do not stand alone.
Data are not always stable and do not
travel easily.
Borgman, C.L. (2015). Big Data, Little Data, No Data: Scholarship in the Networked World. MIT Press.
Leonelli, S., Rappert, B., & Davies, G. (2017). Data shadows: Knowledge, openness, and absence. Science, Technology, & Human Values,
42(2), p.191-202.
38. Faculty of Science
A MORE TRANSPARENT DATA SUPPLY CHAIN
Groth, Paul, "Transparency and Reliability in the Data Supply
Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-
April 2013 doi: 10.1109/MIC.2013.41
39. Faculty of Science
TRANSPARENCY ACKNOWLEDGES
MESSINESS
M. C. Elish & danah boyd (2018) Situating methods in the magic of
Big Data and AI, Communication Monographs, 85:1, 57-80, DOI:
10.1080/03637751.2017.1375130
40. Faculty of Science
• Data reuse though integration/munging/remixing is pervasive
• We need to reflect on the making especially as we can automate more
• How can we use the knowledge of making to help support our information need
Conclusion
Contact:
Paul Groth | @pgroth | pgroth.com
41. Faculty of Science
Can you skip all that?
Paul T. Groth, Antony Scerri, Ron Daniel
Jr., Bradley P. Allen:
End-to-End Learning for Answering Structured
Queries Directly over
Text. CoRRabs/1811.06303 (2018)
45. Faculty of Science
Now we only need to answer slot filling queries
WikiReading: A Novel Large-scale
Language Understanding Task over
Wikipedia, Hewlett, et al, ACL 2016
Constructing Datasets for Multi-hop Reading Comprehension
Across Documents, Johannes Welbl, Pontus
Stenetorp, Sebastian Riedel, Transactions of the Association
for Computational Linguistics 2018
46. Faculty of Science
Off the shelf QA architectures
Dirk Weissenborn, Georg Wiese, and Laura Seiffe. Making neural qa as simple as possible but
not simpler. In Proceedings of the 21st Conference on Computational Natural Language Learning
(CoNLL 2017), pages 271–280, 2017.
Tim Dettmers Isabelle Augenstein Johannes Welbl Tim Rocktaschel Matko
Bosnjak Jeff Mitchell Thomas Demeester Pontus Stenetorp Sebastian Riedel
Dirk Weissenborn, Pasquale Minervini. Jack the Reader – A Machine Reading
Framework. In Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (ACL) System Demonstrations, July 2018. URL
https://arxiv.org/abs/1806.08727
Question:
lexicalize(?city wdt:P131 wd:Q55) =>
Located in the administrative territorial entity of …. Netherlands
Input Text
“Amsterdam is the capital city and most populous municipality of
the Netherlands. ….”
Answer span
Amsterdam [0,9]
50. Faculty of Science
Primary: Semi-structured interviews with data seekers across disciplines (n=22)
Next stage: Multidisciplinary survey (n=1677, still in analysis phase)
Methodology
51. Faculty of Science
Data needed for research are not always research data
Numerous roles - data as hubs for collaboration and
creativity
A broader understanding of the data
needed by users
Users and data needs
52. Faculty of Science
Relationship with academic literature search
Overlaps with other practices
Search and discovery strategies
54. Faculty of Science
30.2
29.4
20.5
19.3
0.6
Percentage
Following citations to data
Search with goal of finding
data
While reading or searching
for literature
Extract data directly from
literature, tables, graphs
Other
How do you discover data using the academic literature?
55. Faculty of Science
Actively searching
online
Serendipitously,
while searching for
something else
While
sharing/managing
own data
Serendipitously,
when not actively
searching
How frequently do you find data in the following ways?
Never
Occasionally
Often
Percentage
56. Faculty of Science
Key role of social interactions
Search and discovery strategies
Actually, most of the times that I have looked for external data, it has
been through (personal) connections (11).
The human network of contacts is still the best way to find the
information you want, especially if it is a small group...that is the
most powerful and accurate source of information that I use at this
point. (17)
57. Faculty of Science
Role of social interactions continues
Evaluation and sense-making
I think if there was a good search engine, then I could get the dataset
directly. I would still get in touch with the data author anyway, both
for social reasons - developing the network and eventual
collaboration - and also because most of the times the metadata are
not enough to really understand the biology behind the species (4).
58. Faculty of Science
Role of social interactions continues
Evaluation and sensemaking
I am used to working with experts from different areas of knowledge.
For me it is usual to have partners with different expertise: biology,
agronomy, economy…I know the language of LCA (life cycle
assessment), not of electronics or agricultural biology. My limit is
not the data that I cannot find, but people that can work with these
data (16).
Interviews = primary results that will speak about today
Highest number of respondents from computer science and information sciences; also spoke with librarians who were able to give insight into behaviors and needs of patrons (whom they support) in numerous disciplines
Majority – researchers, although some active in other areas, have numerous roles
Diverse career stages - early career (0-5 years, n=5), mid-career (6-15 years, n=11), experienced (16+ years, n=6), and retired (n=1).
Countries - The most frequently represented countries are the United States (n=6) and the Netherlands (n=3). Some participants currently work outside of their home countries or have past experience working abroad, influences
I will also briefly discuss data from recent survey that built on findings of these interviews; still in analysis phase
A Broader Understanding of Data Needed by Users
Data needed for research are not always research data. Metadata, texts, server logs, device specifications, social media posts – all are used for foreground and background purposes in research but do not fall into what may traditionally be thought of as “research data.” – not created through research or for research
Data play many roles. Background uses (calibration, comparison) well-documented. But also others less well examined that were mentioned in interviews.
Data seekers use data to support research and teaching and to answer new research questions. Data also act as hubs for collaboration and creativity – but more small, local collaborations
Found that could seek data from researcher in order to enter collaboration, also that collaborations were seen as a “safe” way of sharing data. Data-related events/trainings at libraries new ideas for projects/collaborations
In survey – followed up on
Half of survey respondents say that they sometimes do (and sometimes don’t) discover data differently than they discover literature; 18% always do; 30% - no difference in how discover data and literature
Overlaps in search practices, quite likely influenced by other research practices
Dive a bit deeper – all respondents who indicated that use literature as source to find data asked this question; could select more than one response (To do: include number of responses on the slide)
Employ similar practices that know from academic literature search
-following citations – raises interesting questions, given unstandardized methods for data citation in many research disciplines
-While some purposely search the literature with the goal of finding data – 20 % also indicate that discover serendipitously – during course of other reading or searching practice
Also see overlaps in discovery strategies outside of using the academic literature; n=1677 – all respondents
17% and 71% - often or occasionally locate data serendipitously when searching actively for something else; lower numbers for those who discover serendipitously when not actively searching
See clear overlaps with other data practices – discovering data during sharing and management, when searching for something else – but at same time, also actively searching online -
--
Graph/image creation:
https://www.makeuseof.com/tag/convert-images-to-svg-format-with-inkscape/
https://kb.tableau.com/articles/howto/stacked-bar-chart-multiple-measures
Another common theme throughout the survey (be ready for a question here) and in all interviews except for one – the key role of social interactions and communication in discovering data became one of clearest themes.
Quotes from interviewees demonstrate this-
Seen as most efficient and accurate way of finding data, particularly within close community.
Quote from psychologist – looking for large datasets to reuse, and paleontologist – who knows quite a bit about issues of data reuse – makes heavy use of email list for his community to find his ”data”
Also not all data available or searchable; infrastructures not available – in case of water scientist spoke with in Malaysia – found that most effective way was the “personal approach” – go to governmental agency in person and develop relationship with
The importance of social interactions is not limited to finding and accessing data.
Extends to how participants are evaluating and making sense of data before reusing it. Read this quote.
Here – see as chance to develop collaborations and build networks, but also imperative to understanding what is really happening with the data
For another participant, problems in data search are not finding data, but finding collaborators who can make sense of the data
Consider Supporting Social Interactions
Social interactions are used to locate, evaluate and develop trust in data. Data themselves can facilitate new collaborations.
Designing ways to contact data authors, perhaps to provide more info about authors - and ways to contact them – by linking to other profiles – in Mendeley? ORCIDs? Scopus IDs?
Ways to integrate offline and online interactions around data
including links to in-person training opportunities, last quote about experts being bottleneck – way to create community of data experts – willing to collaborate and answer questions?
Why – because you want to be precise
Problem – information extractioni
1700 active contributors
Data are diverse – As saw in the earlier example, a single individual can need a diversity of data, but data themselves are diverse, the definition of what data even are lies in the eye of the beholder. Text (as in case of field notebook) or even journal articles themselves can be data to a person in a particular situation; for another person, or even in a different situation, the same objects are not data. As Chrsitine Bormgan puts it “one person’s signal is another person’s noise”
Data do not (or rarely) stand alone – In order to make sense of and reuse data, need associated information/metadata (e.g. protocols, collection conditions) and analysis tools. Also need skills, technologies, resources to process and use as evidence
Taking Borgman’s definition of data here: Data are representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship."
- Data do not travel easily – “Data often thought of as discrete units, stable in format and content, that can be moved across a range of contexts and reused” (Leonelli et al). But these different conceptions of data, different contexts of (re)use and the context of creation– make it difficult for data to be simply transported, unpacked and used.
Interviews = primary results that will speak about today
Highest number of respondents from computer science and information sciences; also spoke with librarians who were able to give insight into behaviors and needs of patrons (whom they support) in numerous disciplines
Majority – researchers, although some active in other areas, have numerous roles
Diverse career stages - early career (0-5 years, n=5), mid-career (6-15 years, n=11), experienced (16+ years, n=6), and retired (n=1).
Countries - The most frequently represented countries are the United States (n=6) and the Netherlands (n=3). Some participants currently work outside of their home countries or have past experience working abroad, influences
I will also briefly discuss data from recent survey that built on findings of these interviews; still in analysis phase
A Broader Understanding of Data Needed by Users
Data needed for research are not always research data. Metadata, texts, server logs, device specifications, social media posts – all are used for foreground and background purposes in research but do not fall into what may traditionally be thought of as “research data.” – not created through research or for research
Data play many roles. Background uses (calibration, comparison) well-documented. But also others less well examined that were mentioned in interviews.
Data seekers use data to support research and teaching and to answer new research questions. Data also act as hubs for collaboration and creativity – but more small, local collaborations
Found that could seek data from researcher in order to enter collaboration, also that collaborations were seen as a “safe” way of sharing data. Data-related events/trainings at libraries new ideas for projects/collaborations
Literature also an important source/entry point to data discovery for interview participants
Led to question of how exactly participants use the literature, and where the overlaps are with the use of and search for literature in course of other research practices
In survey – followed up on
Half of survey respondents say that they sometimes do (and sometimes don’t) discover data differently than they discover literature; 18% always do; 30% - no difference in how discover data and literature
Overlaps in search practices, quite likely influenced by other research practices
Dive a bit deeper – all respondents who indicated that use literature as source to find data asked this question; could select more than one response (To do: include number of responses on the slide)
Employ similar practices that know from academic literature search
-following citations – raises interesting questions, given unstandardized methods for data citation in many research disciplines
-While some purposely search the literature with the goal of finding data – 20 % also indicate that discover serendipitously – during course of other reading or searching practice
Also see overlaps in discovery strategies outside of using the academic literature; n=1677 – all respondents
17% and 71% - often or occasionally locate data serendipitously when searching actively for something else; lower numbers for those who discover serendipitously when not actively searching
See clear overlaps with other data practices – discovering data during sharing and management, when searching for something else – but at same time, also actively searching online -
--
Graph/image creation:
https://www.makeuseof.com/tag/convert-images-to-svg-format-with-inkscape/
https://kb.tableau.com/articles/howto/stacked-bar-chart-multiple-measures
Another common theme throughout the survey (be ready for a question here) and in all interviews except for one – the key role of social interactions and communication in discovering data became one of clearest themes.
Quotes from interviewees demonstrate this-
Seen as most efficient and accurate way of finding data, particularly within close community.
Quote from psychologist – looking for large datasets to reuse, and paleontologist – who knows quite a bit about issues of data reuse – makes heavy use of email list for his community to find his ”data”
Also not all data available or searchable; infrastructures not available – in case of water scientist spoke with in Malaysia – found that most effective way was the “personal approach” – go to governmental agency in person and develop relationship with
The importance of social interactions is not limited to finding and accessing data.
Extends to how participants are evaluating and making sense of data before reusing it. Read this quote.
Here – see as chance to develop collaborations and build networks, but also imperative to understanding what is really happening with the data
For another participant, problems in data search are not finding data, but finding collaborators who can make sense of the data