SlideShare a Scribd company logo
1 of 59
Faculty of Science
Paul Groth | @pgroth | pgroth.com
May 12, 2019
Institute for Information Business – WU Wien
Thinking About the
Making of Data
Thanks to Kathleen Gregory (@gregory_km )
Faculty of Science
The making of data is important
“There is a major, largely unrealised potential to
merge and integrate the data from different
disciplines of science in order to reveal deep
patterns in the multi-facetted complexity that
underlies most of the domains of application that
are intrinsic to the major global challenges that
confront humanity.” – Grand Challenge for
Science
http://dataintegration.codata.org
Committee on Data of the
International Council for Science
(CODATA)
Faculty of Science
Software 2.0
https://link.medium.com/srrJhEl5bS
“In the 2.0 stack, the programming is done by
accumulating, massaging and cleaning datasets”
Figure 8
Data Science
Surveys 2017
& 2018
The making of data is hard
Faculty of Science
Faculty of Science
Faculty of Science
Faculty of Science
Faculty of Science
Faculty of Science
Paul Groth, Michael Lauruhn, Antony Scerri: “Open Information Extraction on
Scientific Text: An Evaluation”, 2018; [http://arxiv.org/abs/1802.05574
arXiv:1802.05574]
Faculty of Science
COMPLEX DISTRIBUTED WORKFLOWS
Faculty of Science
NOT JUST DATA SCIENCE
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2019).
Searching Data: A Review of Observational Data Retrieval
Practices. Journal of the Association for Information Science and
Technology. doi:10.1002/asi.24165
Some observations from @gregory_km
survey & interviews :
• The needs and behaviors of specific user groups (e.g. early
career researchers, policy makers, students) are not well
documented.
• Participants require details about data collection and handling
• Reconstructing data tables from journal articles, using
general search engines, and making direct data requests are
common.
Gregory, K.; Cousijn, H.; Groth, P.; Scharnhorst, A.; Wyatt, S. (forthcoming).
Understanding Data Search as a Socio-technical Practice. Journal of
Information Science. arXiv preprint: arXiv:1801.04971.
Faculty of Science
Spreadsheet Events
https://www.seh.ox.ac.uk/news/the-case-for-ceres-developing-a-postgraduate-mission-with-the-european-space-agency
Faculty of Science
BOTTLENECKS
1.Manual
2.Difficulty in creating flexible reusable workflows
3.Lack of transparency
Paul Groth."The Knowledge-Remixing Bottleneck,"
Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-
Oct. 2013 doi: 10.1109/MIS.2013.138
Paul Groth, "Transparency and Reliability in the Data Supply
Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71,
March-April 2013 doi: 10.1109/MIC.2013.41
Faculty of Science
• Focus on intelligent systems for supporting people working with data.
• 5 people by September 2019 + growing
• 3 Research areas:
• AI for Data Engineering Tasks
• Knowledge graph construction
• Data wrangling support + automation
• Transparency in data supply chains
• Lineage of provenance of data
• Understanding data professionals work
• Empirical insights into how people go about working with data
New lab at the University of Amsterdam http://indelab.org
Faculty of Science
Data search – is it just a regular search engine?
Survey of Research Challenges:
Adriane Chapman, Elena Simperl, Laura Koesten,
George Konstantinidis, Luis-Daniel Ibáñez-Gonzalez,
Emilia Kacprzak, Paul Groth (Jan 2019) "Dataset
search: a survey" https://arxiv.org/abs/1901.00735
Faculty of Science
“An information need is the topic about which the user desires to know
more” – Manning
Information Needs
Faculty of Science
Data as an information need
 Researchers across communities need a diversity of
observational data, requiring data of different types, from
different sources and disciplines, and often collected at
different scales.
 Integrating diverse data is a challenge.
Gregory, K.; Cousijn, H.; Groth, P.; Scharnhorst, A.; Wyatt, S. (2019). Searching data: A review
of observational data retrieval practices in selected disciplines. Journal of the Association for
Information Science and Technology. https://doi.org/10.1002/asi.24165
Faculty of Science
Primary: Semi-structured interviews with data seekers across disciplines (n=22)
Next stage: Multidisciplinary survey (n=1677, still in analysis phase)
How do researchers search for data?
Work of Kathleen Gregory
with Sally Wyatt, Andrea Scharnhorst, Helena Cousijn
Faculty of Science
Data needed for research are not always research data
Numerous roles - data as hubs for collaboration and
creativity
A broader understanding of the data
needed by users
Users and data needs
Faculty of Science
52.2
29.8
18.1
Percentage
No Sometimes Yes
Do you discover data differently than how you discover
academic literature?
Faculty of Science
30.2
29.4
20.5
19.3
0.6
Percentage
Following citations to data
Search with goal of finding
data
While reading or searching
for literature
Extract data directly from
literature, tables, graphs
Other
How do you discover data using the academic literature?
Faculty of Science
Actively searching
online
Serendipitously,
while searching for
something else
While
sharing/managing
own data
Serendipitously,
when not actively
searching
How frequently do you find data in the following ways?
Never
Occasionally
Often
Percentage
Faculty of Science
Key role of social interactions
Search and discovery strategies
Actually, most of the times that I have looked for external data, it has
been through (personal) connections (11).
The human network of contacts is still the best way to find the
information you want, especially if it is a small group...that is the
most powerful and accurate source of information that I use at this
point. (17)
Faculty of Science
Role of social interactions continues
Evaluation and sense-making
I think if there was a good search engine, then I could get the dataset
directly. I would still get in touch with the data author anyway, both
for social reasons - developing the network and eventual
collaboration - and also because most of the times the metadata are
not enough to really understand the biology behind the species (4).
Faculty of Science
Role of social interactions continues
Evaluation and sensemaking
I am used to working with experts from different areas of knowledge.
For me it is usual to have partners with different expertise: biology,
agronomy, economy…I know the language of LCA (life cycle
assessment), not of electronics or agricultural biology. My limit is
not the data that I cannot find, but people that can work with these
data (16).
Faculty of Science
What does this mean for system design?
Consider how data are made available
• Metadata standardization and enrichment
• Summarization to facilitate sensemaking
Consider entirety of data needs
• Point to best practices or resources for other data
• Do disciplinary categories still fit?
Consider diversity and overlaps
• Differentiated interfaces
• Integration with infrastructures supporting other data and research practices
Consider how to incorporate role of social interactions
• Contact data author, integration with author profiles, ORCID?
• Links to in-person trainings? Connecting with “data experts”?
Faculty of Science
Integration of Data Into Workflows
Chichester, Christine, Daniela Digles, Ronald Siebes, Antonis Loizou, Paul Groth, and
Lee Harland. "Drug discovery FAQs: workflows for answering multidomain drug
discovery questions." Drug discovery today 20, no. 4 (2015): 399-405.
Faculty of Science
Run structured queries
Faculty of Science
BUILD A KNOWLEDGE GRAPH
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Information
Extraction
Entity
Resolution
Matrix
Factorization
Knowledge
graph
Curation
Predicted
relations
Matrix
Completion
Taxonomy
Triple
Extraction
Concept
Resolution
14M
SD articles
475 M
triples
3.3 million
relations
49 M
relations
~15k ->
1M
entries
Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel
“Applying Universal Schemas for Domain Specific Ontology Expansion”
5th Workshop on Automated Knowledge Base Construction (AKBC) 2016
Michael Lauruhn, and Paul Groth. "Sources of Change for Modern
Knowledge Organization Systems." Knowledge Organization 43, no. 8
(2016).
Faculty of Science
SOURCES OF CHANGE
Concept1
Concept2 Concept3
KOS
Professional
Curators
Literature
Software
Non-professional
contributors
1. dealing with changing cultural and societal
norms, specifically to address or correct bias;
2. political influence
3. new concepts and terminology arising from
discoveries or change in perspective within a
technical/scientific community
4. gardening
5. incremental contributorship
6. progressive formalization
7. software and automation
8. integration of large numbers of data sources
9. variance in algorithm training data
Data
⚐Society & Politics
(4, 5, 6)
(7, 8, 9)
(3)
(1, 2)
Lauruhn, Michael, and Paul Groth. "Sources of Change for Modern Knowledge
Organization Systems." Knowledge Organization 43, no. 8 (2016).
Faculty of Science
WIKIDATA VOCABULARY
Faculty of Science
4. GARDENING
Wikipedia Categories
25% increase in the number of categories over the 2012 - 2014 period vs
a 12% increase in the number of articles. Likewise, the number of
disambiguation pages has increased by 13%. (Bairi et al. 2015)
http://blog.schema.org/2015/11/schemaorg-whats-new.html
Faculty of Science
INCREMENTAL CONTRIBUTORSHIP
Over 17,000 active users on wikidata as of
Feb 2017
Faculty of Science
INTEGRATION OF LARGE NUMBERS OF DATA SOURCES
Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE
, vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138
• 10 different extractors
• E.g mapping-based infobox extractor
• Infobox uses a hand-built ontology based on the 350
• Based on acommonly used English language infoboxes
• Integrates with Yago
• Yago relies on Wikipedia + Wordnet
• Upper ontology from Wordnet and then a mapping to Wikipedia
categories based frequencies
• Wordnet is built by psycholinguists
Faculty of Science
Data are complex objects
Data are diverse.
Data do not stand alone.
Data are not always stable and do not
travel easily.
Borgman, C.L. (2015). Big Data, Little Data, No Data: Scholarship in the Networked World. MIT Press.
Leonelli, S., Rappert, B., & Davies, G. (2017). Data shadows: Knowledge, openness, and absence. Science, Technology, & Human Values,
42(2), p.191-202.
Faculty of Science
http://www.publicbooks.org/justice-for-data-janitors/
Faculty of Science
A MORE TRANSPARENT DATA SUPPLY CHAIN
Groth, Paul, "Transparency and Reliability in the Data Supply
Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-
April 2013 doi: 10.1109/MIC.2013.41
Faculty of Science
TRANSPARENCY ACKNOWLEDGES
MESSINESS
M. C. Elish & danah boyd (2018) Situating methods in the magic of
Big Data and AI, Communication Monographs, 85:1, 57-80, DOI:
10.1080/03637751.2017.1375130
Faculty of Science
• Data reuse though integration/munging/remixing is pervasive
• We need to reflect on the making especially as we can automate more
• How can we use the knowledge of making to help support our information need
Conclusion
Contact:
Paul Groth | @pgroth | pgroth.com
Faculty of Science
Can you skip all that?
Paul T. Groth, Antony Scerri, Ron Daniel
Jr., Bradley P. Allen:
End-to-End Learning for Answering Structured
Queries Directly over
Text. CoRRabs/1811.06303 (2018)
Faculty of Science
Machine Comprehension + Question Answering Tasks
https://nlp.stanford.edu/software/sempre/wikitable/
Faculty of Science
We have a parallel corpora
Faculty of Science
Triple Pattern Fragments
http://linkeddatafragments.org/concept/
Faculty of Science
Now we only need to answer slot filling queries
WikiReading: A Novel Large-scale
Language Understanding Task over
Wikipedia, Hewlett, et al, ACL 2016
Constructing Datasets for Multi-hop Reading Comprehension
Across Documents, Johannes Welbl, Pontus
Stenetorp, Sebastian Riedel, Transactions of the Association
for Computational Linguistics 2018
Faculty of Science
Off the shelf QA architectures
Dirk Weissenborn, Georg Wiese, and Laura Seiffe. Making neural qa as simple as possible but
not simpler. In Proceedings of the 21st Conference on Computational Natural Language Learning
(CoNLL 2017), pages 271–280, 2017.
Tim Dettmers Isabelle Augenstein Johannes Welbl Tim Rocktaschel Matko
Bosnjak Jeff Mitchell Thomas Demeester Pontus Stenetorp Sebastian Riedel
Dirk Weissenborn, Pasquale Minervini. Jack the Reader – A Machine Reading
Framework. In Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (ACL) System Demonstrations, July 2018. URL
https://arxiv.org/abs/1806.08727
Question:
lexicalize(?city wdt:P131 wd:Q55) =>
Located in the administrative territorial entity of …. Netherlands
Input Text
“Amsterdam is the capital city and most populous municipality of
the Netherlands. ….”
Answer span
Amsterdam [0,9]
Faculty of Science
Results
Faculty of Science
Results
Faculty of Science
A Prototype
Faculty of Science
Primary: Semi-structured interviews with data seekers across disciplines (n=22)
Next stage: Multidisciplinary survey (n=1677, still in analysis phase)
Methodology
Faculty of Science
Data needed for research are not always research data
Numerous roles - data as hubs for collaboration and
creativity
A broader understanding of the data
needed by users
Users and data needs
Faculty of Science
Relationship with academic literature search
Overlaps with other practices
Search and discovery strategies
Faculty of Science
52.2
29.8
18.1
Percentage
No Sometimes Yes
Do you discover data differently than how you discover
academic literature?
Faculty of Science
30.2
29.4
20.5
19.3
0.6
Percentage
Following citations to data
Search with goal of finding
data
While reading or searching
for literature
Extract data directly from
literature, tables, graphs
Other
How do you discover data using the academic literature?
Faculty of Science
Actively searching
online
Serendipitously,
while searching for
something else
While
sharing/managing
own data
Serendipitously,
when not actively
searching
How frequently do you find data in the following ways?
Never
Occasionally
Often
Percentage
Faculty of Science
Key role of social interactions
Search and discovery strategies
Actually, most of the times that I have looked for external data, it has
been through (personal) connections (11).
The human network of contacts is still the best way to find the
information you want, especially if it is a small group...that is the
most powerful and accurate source of information that I use at this
point. (17)
Faculty of Science
Role of social interactions continues
Evaluation and sense-making
I think if there was a good search engine, then I could get the dataset
directly. I would still get in touch with the data author anyway, both
for social reasons - developing the network and eventual
collaboration - and also because most of the times the metadata are
not enough to really understand the biology behind the species (4).
Faculty of Science
Role of social interactions continues
Evaluation and sensemaking
I am used to working with experts from different areas of knowledge.
For me it is usual to have partners with different expertise: biology,
agronomy, economy…I know the language of LCA (life cycle
assessment), not of electronics or agricultural biology. My limit is
not the data that I cannot find, but people that can work with these
data (16).
Faculty of Science

More Related Content

What's hot

Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningPaul Groth
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?Paul Groth
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data ShowcasingPaul Groth
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chainPaul Groth
 
Machines are people too
Machines are people tooMachines are people too
Machines are people tooPaul Groth
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the WebRinke Hoekstra
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataRinke Hoekstra
 
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Rinke Hoekstra
 
Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationRinke Hoekstra
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Sören Auer
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphSören Auer
 
Self adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation ofSelf adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation ofNurfadhlina Mohd Sharef
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphPaul Groth
 

What's hot (20)

Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data Showcasing
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
 
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
 
Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance Visualization
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
CV
CVCV
CV
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge Graph
 
Self adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation ofSelf adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation of
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
 

Similar to Thinking About the Making of Data

A metadata scheme of the software-data relationship: A proposal
A metadata scheme of the software-data relationship: A proposalA metadata scheme of the software-data relationship: A proposal
A metadata scheme of the software-data relationship: A proposalKai Li
 
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...NOVA DATASCIENCE
 
Making our mark: the important role of social scientists in the ‘era of big d...
Making our mark: the important role of social scientists in the ‘era of big d...Making our mark: the important role of social scientists in the ‘era of big d...
Making our mark: the important role of social scientists in the ‘era of big d...The Higher Education Academy
 
Social media as a tool for researchers
Social media as a tool for researchersSocial media as a tool for researchers
Social media as a tool for researchersJari Laru
 
OII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
OII Summer Doctoral Programme 2010: Global brain by Meyer & SchroederOII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
OII Summer Doctoral Programme 2010: Global brain by Meyer & SchroederEric Meyer
 
Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reusevoginip
 
Data Services at a Liberal Arts College Library
Data Services at a Liberal Arts College LibraryData Services at a Liberal Arts College Library
Data Services at a Liberal Arts College LibraryJulie Judkins
 
e-infrastructures supporting open knowledge circulation - OpenAIRE France
e-infrastructures supporting open knowledge circulation - OpenAIRE Francee-infrastructures supporting open knowledge circulation - OpenAIRE France
e-infrastructures supporting open knowledge circulation - OpenAIRE FranceJean-François Lutz
 
Decomposing Social and Semantic Networks in Emerging “Big Data” Research
Decomposing Social and Semantic Networks in Emerging “Big Data” ResearchDecomposing Social and Semantic Networks in Emerging “Big Data” Research
Decomposing Social and Semantic Networks in Emerging “Big Data” ResearchHan Woo PARK
 
Data Science definition
Data Science definitionData Science definition
Data Science definitionCarloLauro1
 
Let's talk about Data Science
Let's talk about Data ScienceLet's talk about Data Science
Let's talk about Data ScienceCarlo Lauro
 
Big data divided (24 march2014)
Big data divided (24 march2014)Big data divided (24 march2014)
Big data divided (24 march2014)Han Woo PARK
 
Data Management and Broader Impacts: a holistic approach
Data Management and Broader Impacts: a holistic approachData Management and Broader Impacts: a holistic approach
Data Management and Broader Impacts: a holistic approachMegan O'Donnell
 
Data Science & Analytics (light overview)
Data Science & Analytics (light overview) Data Science & Analytics (light overview)
Data Science & Analytics (light overview) Shalin Hai-Jew
 
Analíticas del aprendizaje: una perspectiva crítica
Analíticas del aprendizaje: una perspectiva críticaAnalíticas del aprendizaje: una perspectiva crítica
Analíticas del aprendizaje: una perspectiva críticaCENT
 
Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?LEARN Project
 
Organizational Implications of Data Science Environments in Education, Resear...
Organizational Implications of Data Science Environments in Education, Resear...Organizational Implications of Data Science Environments in Education, Resear...
Organizational Implications of Data Science Environments in Education, Resear...Victoria Steeves
 

Similar to Thinking About the Making of Data (20)

A metadata scheme of the software-data relationship: A proposal
A metadata scheme of the software-data relationship: A proposalA metadata scheme of the software-data relationship: A proposal
A metadata scheme of the software-data relationship: A proposal
 
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
NOVA Data Science Meetup 8-10-2017 Presentation - State of Data Science Educa...
 
Making our mark: the important role of social scientists in the ‘era of big d...
Making our mark: the important role of social scientists in the ‘era of big d...Making our mark: the important role of social scientists in the ‘era of big d...
Making our mark: the important role of social scientists in the ‘era of big d...
 
Scienceofscience
ScienceofscienceScienceofscience
Scienceofscience
 
Social media as a tool for researchers
Social media as a tool for researchersSocial media as a tool for researchers
Social media as a tool for researchers
 
OII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
OII Summer Doctoral Programme 2010: Global brain by Meyer & SchroederOII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
OII Summer Doctoral Programme 2010: Global brain by Meyer & Schroeder
 
The Internet, Science, and Transformations of Knowledge (Ralph Schroeder)
The Internet, Science, and Transformations of Knowledge (Ralph Schroeder)The Internet, Science, and Transformations of Knowledge (Ralph Schroeder)
The Internet, Science, and Transformations of Knowledge (Ralph Schroeder)
 
Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reuse
 
Data Services at a Liberal Arts College Library
Data Services at a Liberal Arts College LibraryData Services at a Liberal Arts College Library
Data Services at a Liberal Arts College Library
 
e-infrastructures supporting open knowledge circulation - OpenAIRE France
e-infrastructures supporting open knowledge circulation - OpenAIRE Francee-infrastructures supporting open knowledge circulation - OpenAIRE France
e-infrastructures supporting open knowledge circulation - OpenAIRE France
 
Decomposing Social and Semantic Networks in Emerging “Big Data” Research
Decomposing Social and Semantic Networks in Emerging “Big Data” ResearchDecomposing Social and Semantic Networks in Emerging “Big Data” Research
Decomposing Social and Semantic Networks in Emerging “Big Data” Research
 
Data Science definition
Data Science definitionData Science definition
Data Science definition
 
Let's talk about Data Science
Let's talk about Data ScienceLet's talk about Data Science
Let's talk about Data Science
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
 
Big data divided (24 march2014)
Big data divided (24 march2014)Big data divided (24 march2014)
Big data divided (24 march2014)
 
Data Management and Broader Impacts: a holistic approach
Data Management and Broader Impacts: a holistic approachData Management and Broader Impacts: a holistic approach
Data Management and Broader Impacts: a holistic approach
 
Data Science & Analytics (light overview)
Data Science & Analytics (light overview) Data Science & Analytics (light overview)
Data Science & Analytics (light overview)
 
Analíticas del aprendizaje: una perspectiva crítica
Analíticas del aprendizaje: una perspectiva críticaAnalíticas del aprendizaje: una perspectiva crítica
Analíticas del aprendizaje: una perspectiva crítica
 
Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?
 
Organizational Implications of Data Science Environments in Education, Resear...
Organizational Implications of Data Science Environments in Education, Resear...Organizational Implications of Data Science Environments in Education, Resear...
Organizational Implications of Data Science Environments in Education, Resear...
 

More from Paul Groth

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIPaul Groth
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsPaul Groth
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationPaul Groth
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicinePaul Groth
 
Are we finally ready for transclusion?*
Are we finally ready for transclusion?*Are we finally ready for transclusion?*
Are we finally ready for transclusion?*Paul Groth
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialPaul Groth
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkPaul Groth
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersPaul Groth
 
Tradeoffs in Automatic Provenance Capture
Tradeoffs in Automatic Provenance CaptureTradeoffs in Automatic Provenance Capture
Tradeoffs in Automatic Provenance CapturePaul Groth
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaPaul Groth
 
Information architecture at Elsevier
Information architecture at ElsevierInformation architecture at Elsevier
Information architecture at ElsevierPaul Groth
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
 

More from Paul Groth (13)

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
Are we finally ready for transclusion?*
Are we finally ready for transclusion?*Are we finally ready for transclusion?*
Are we finally ready for transclusion?*
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational Material
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchers
 
Tradeoffs in Automatic Provenance Capture
Tradeoffs in Automatic Provenance CaptureTradeoffs in Automatic Provenance Capture
Tradeoffs in Automatic Provenance Capture
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPedia
 
Information architecture at Elsevier
Information architecture at ElsevierInformation architecture at Elsevier
Information architecture at Elsevier
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 

Recently uploaded

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Recently uploaded (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Thinking About the Making of Data

  • 1. Faculty of Science Paul Groth | @pgroth | pgroth.com May 12, 2019 Institute for Information Business – WU Wien Thinking About the Making of Data Thanks to Kathleen Gregory (@gregory_km )
  • 2. Faculty of Science The making of data is important “There is a major, largely unrealised potential to merge and integrate the data from different disciplines of science in order to reveal deep patterns in the multi-facetted complexity that underlies most of the domains of application that are intrinsic to the major global challenges that confront humanity.” – Grand Challenge for Science http://dataintegration.codata.org Committee on Data of the International Council for Science (CODATA)
  • 3. Faculty of Science Software 2.0 https://link.medium.com/srrJhEl5bS “In the 2.0 stack, the programming is done by accumulating, massaging and cleaning datasets” Figure 8 Data Science Surveys 2017 & 2018 The making of data is hard
  • 10. Paul Groth, Michael Lauruhn, Antony Scerri: “Open Information Extraction on Scientific Text: An Evaluation”, 2018; [http://arxiv.org/abs/1802.05574 arXiv:1802.05574]
  • 11. Faculty of Science COMPLEX DISTRIBUTED WORKFLOWS
  • 12. Faculty of Science NOT JUST DATA SCIENCE Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2019). Searching Data: A Review of Observational Data Retrieval Practices. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24165 Some observations from @gregory_km survey & interviews : • The needs and behaviors of specific user groups (e.g. early career researchers, policy makers, students) are not well documented. • Participants require details about data collection and handling • Reconstructing data tables from journal articles, using general search engines, and making direct data requests are common. Gregory, K.; Cousijn, H.; Groth, P.; Scharnhorst, A.; Wyatt, S. (forthcoming). Understanding Data Search as a Socio-technical Practice. Journal of Information Science. arXiv preprint: arXiv:1801.04971.
  • 13. Faculty of Science Spreadsheet Events https://www.seh.ox.ac.uk/news/the-case-for-ceres-developing-a-postgraduate-mission-with-the-european-space-agency
  • 14. Faculty of Science BOTTLENECKS 1.Manual 2.Difficulty in creating flexible reusable workflows 3.Lack of transparency Paul Groth."The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.- Oct. 2013 doi: 10.1109/MIS.2013.138 Paul Groth, "Transparency and Reliability in the Data Supply Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-April 2013 doi: 10.1109/MIC.2013.41
  • 15. Faculty of Science • Focus on intelligent systems for supporting people working with data. • 5 people by September 2019 + growing • 3 Research areas: • AI for Data Engineering Tasks • Knowledge graph construction • Data wrangling support + automation • Transparency in data supply chains • Lineage of provenance of data • Understanding data professionals work • Empirical insights into how people go about working with data New lab at the University of Amsterdam http://indelab.org
  • 16. Faculty of Science Data search – is it just a regular search engine? Survey of Research Challenges: Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez-Gonzalez, Emilia Kacprzak, Paul Groth (Jan 2019) "Dataset search: a survey" https://arxiv.org/abs/1901.00735
  • 17. Faculty of Science “An information need is the topic about which the user desires to know more” – Manning Information Needs
  • 18. Faculty of Science Data as an information need  Researchers across communities need a diversity of observational data, requiring data of different types, from different sources and disciplines, and often collected at different scales.  Integrating diverse data is a challenge. Gregory, K.; Cousijn, H.; Groth, P.; Scharnhorst, A.; Wyatt, S. (2019). Searching data: A review of observational data retrieval practices in selected disciplines. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.24165
  • 19. Faculty of Science Primary: Semi-structured interviews with data seekers across disciplines (n=22) Next stage: Multidisciplinary survey (n=1677, still in analysis phase) How do researchers search for data? Work of Kathleen Gregory with Sally Wyatt, Andrea Scharnhorst, Helena Cousijn
  • 20. Faculty of Science Data needed for research are not always research data Numerous roles - data as hubs for collaboration and creativity A broader understanding of the data needed by users Users and data needs
  • 21. Faculty of Science 52.2 29.8 18.1 Percentage No Sometimes Yes Do you discover data differently than how you discover academic literature?
  • 22. Faculty of Science 30.2 29.4 20.5 19.3 0.6 Percentage Following citations to data Search with goal of finding data While reading or searching for literature Extract data directly from literature, tables, graphs Other How do you discover data using the academic literature?
  • 23. Faculty of Science Actively searching online Serendipitously, while searching for something else While sharing/managing own data Serendipitously, when not actively searching How frequently do you find data in the following ways? Never Occasionally Often Percentage
  • 24. Faculty of Science Key role of social interactions Search and discovery strategies Actually, most of the times that I have looked for external data, it has been through (personal) connections (11). The human network of contacts is still the best way to find the information you want, especially if it is a small group...that is the most powerful and accurate source of information that I use at this point. (17)
  • 25. Faculty of Science Role of social interactions continues Evaluation and sense-making I think if there was a good search engine, then I could get the dataset directly. I would still get in touch with the data author anyway, both for social reasons - developing the network and eventual collaboration - and also because most of the times the metadata are not enough to really understand the biology behind the species (4).
  • 26. Faculty of Science Role of social interactions continues Evaluation and sensemaking I am used to working with experts from different areas of knowledge. For me it is usual to have partners with different expertise: biology, agronomy, economy…I know the language of LCA (life cycle assessment), not of electronics or agricultural biology. My limit is not the data that I cannot find, but people that can work with these data (16).
  • 27. Faculty of Science What does this mean for system design? Consider how data are made available • Metadata standardization and enrichment • Summarization to facilitate sensemaking Consider entirety of data needs • Point to best practices or resources for other data • Do disciplinary categories still fit? Consider diversity and overlaps • Differentiated interfaces • Integration with infrastructures supporting other data and research practices Consider how to incorporate role of social interactions • Contact data author, integration with author profiles, ORCID? • Links to in-person trainings? Connecting with “data experts”?
  • 28. Faculty of Science Integration of Data Into Workflows Chichester, Christine, Daniela Digles, Ronald Siebes, Antonis Loizou, Paul Groth, and Lee Harland. "Drug discovery FAQs: workflows for answering multidomain drug discovery questions." Drug discovery today 20, no. 4 (2015): 399-405.
  • 29. Faculty of Science Run structured queries
  • 30. Faculty of Science BUILD A KNOWLEDGE GRAPH Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction Concept Resolution 14M SD articles 475 M triples 3.3 million relations 49 M relations ~15k -> 1M entries Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel “Applying Universal Schemas for Domain Specific Ontology Expansion” 5th Workshop on Automated Knowledge Base Construction (AKBC) 2016 Michael Lauruhn, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016).
  • 31. Faculty of Science SOURCES OF CHANGE Concept1 Concept2 Concept3 KOS Professional Curators Literature Software Non-professional contributors 1. dealing with changing cultural and societal norms, specifically to address or correct bias; 2. political influence 3. new concepts and terminology arising from discoveries or change in perspective within a technical/scientific community 4. gardening 5. incremental contributorship 6. progressive formalization 7. software and automation 8. integration of large numbers of data sources 9. variance in algorithm training data Data ⚐Society & Politics (4, 5, 6) (7, 8, 9) (3) (1, 2) Lauruhn, Michael, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016).
  • 33. Faculty of Science 4. GARDENING Wikipedia Categories 25% increase in the number of categories over the 2012 - 2014 period vs a 12% increase in the number of articles. Likewise, the number of disambiguation pages has increased by 13%. (Bairi et al. 2015) http://blog.schema.org/2015/11/schemaorg-whats-new.html
  • 34. Faculty of Science INCREMENTAL CONTRIBUTORSHIP Over 17,000 active users on wikidata as of Feb 2017
  • 35. Faculty of Science INTEGRATION OF LARGE NUMBERS OF DATA SOURCES Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138 • 10 different extractors • E.g mapping-based infobox extractor • Infobox uses a hand-built ontology based on the 350 • Based on acommonly used English language infoboxes • Integrates with Yago • Yago relies on Wikipedia + Wordnet • Upper ontology from Wordnet and then a mapping to Wikipedia categories based frequencies • Wordnet is built by psycholinguists
  • 36. Faculty of Science Data are complex objects Data are diverse. Data do not stand alone. Data are not always stable and do not travel easily. Borgman, C.L. (2015). Big Data, Little Data, No Data: Scholarship in the Networked World. MIT Press. Leonelli, S., Rappert, B., & Davies, G. (2017). Data shadows: Knowledge, openness, and absence. Science, Technology, & Human Values, 42(2), p.191-202.
  • 38. Faculty of Science A MORE TRANSPARENT DATA SUPPLY CHAIN Groth, Paul, "Transparency and Reliability in the Data Supply Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March- April 2013 doi: 10.1109/MIC.2013.41
  • 39. Faculty of Science TRANSPARENCY ACKNOWLEDGES MESSINESS M. C. Elish & danah boyd (2018) Situating methods in the magic of Big Data and AI, Communication Monographs, 85:1, 57-80, DOI: 10.1080/03637751.2017.1375130
  • 40. Faculty of Science • Data reuse though integration/munging/remixing is pervasive • We need to reflect on the making especially as we can automate more • How can we use the knowledge of making to help support our information need Conclusion Contact: Paul Groth | @pgroth | pgroth.com
  • 41. Faculty of Science Can you skip all that? Paul T. Groth, Antony Scerri, Ron Daniel Jr., Bradley P. Allen: End-to-End Learning for Answering Structured Queries Directly over Text. CoRRabs/1811.06303 (2018)
  • 42. Faculty of Science Machine Comprehension + Question Answering Tasks https://nlp.stanford.edu/software/sempre/wikitable/
  • 43. Faculty of Science We have a parallel corpora
  • 44. Faculty of Science Triple Pattern Fragments http://linkeddatafragments.org/concept/
  • 45. Faculty of Science Now we only need to answer slot filling queries WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia, Hewlett, et al, ACL 2016 Constructing Datasets for Multi-hop Reading Comprehension Across Documents, Johannes Welbl, Pontus Stenetorp, Sebastian Riedel, Transactions of the Association for Computational Linguistics 2018
  • 46. Faculty of Science Off the shelf QA architectures Dirk Weissenborn, Georg Wiese, and Laura Seiffe. Making neural qa as simple as possible but not simpler. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 271–280, 2017. Tim Dettmers Isabelle Augenstein Johannes Welbl Tim Rocktaschel Matko Bosnjak Jeff Mitchell Thomas Demeester Pontus Stenetorp Sebastian Riedel Dirk Weissenborn, Pasquale Minervini. Jack the Reader – A Machine Reading Framework. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL) System Demonstrations, July 2018. URL https://arxiv.org/abs/1806.08727 Question: lexicalize(?city wdt:P131 wd:Q55) => Located in the administrative territorial entity of …. Netherlands Input Text “Amsterdam is the capital city and most populous municipality of the Netherlands. ….” Answer span Amsterdam [0,9]
  • 49. Faculty of Science A Prototype
  • 50. Faculty of Science Primary: Semi-structured interviews with data seekers across disciplines (n=22) Next stage: Multidisciplinary survey (n=1677, still in analysis phase) Methodology
  • 51. Faculty of Science Data needed for research are not always research data Numerous roles - data as hubs for collaboration and creativity A broader understanding of the data needed by users Users and data needs
  • 52. Faculty of Science Relationship with academic literature search Overlaps with other practices Search and discovery strategies
  • 53. Faculty of Science 52.2 29.8 18.1 Percentage No Sometimes Yes Do you discover data differently than how you discover academic literature?
  • 54. Faculty of Science 30.2 29.4 20.5 19.3 0.6 Percentage Following citations to data Search with goal of finding data While reading or searching for literature Extract data directly from literature, tables, graphs Other How do you discover data using the academic literature?
  • 55. Faculty of Science Actively searching online Serendipitously, while searching for something else While sharing/managing own data Serendipitously, when not actively searching How frequently do you find data in the following ways? Never Occasionally Often Percentage
  • 56. Faculty of Science Key role of social interactions Search and discovery strategies Actually, most of the times that I have looked for external data, it has been through (personal) connections (11). The human network of contacts is still the best way to find the information you want, especially if it is a small group...that is the most powerful and accurate source of information that I use at this point. (17)
  • 57. Faculty of Science Role of social interactions continues Evaluation and sense-making I think if there was a good search engine, then I could get the dataset directly. I would still get in touch with the data author anyway, both for social reasons - developing the network and eventual collaboration - and also because most of the times the metadata are not enough to really understand the biology behind the species (4).
  • 58. Faculty of Science Role of social interactions continues Evaluation and sensemaking I am used to working with experts from different areas of knowledge. For me it is usual to have partners with different expertise: biology, agronomy, economy…I know the language of LCA (life cycle assessment), not of electronics or agricultural biology. My limit is not the data that I cannot find, but people that can work with these data (16).

Editor's Notes

  1. Work with dans Reviewed 400 papers deep dive 114
  2. Tons of challenges
  3. Interviews = primary results that will speak about today Highest number of respondents from computer science and information sciences; also spoke with librarians who were able to give insight into behaviors and needs of patrons (whom they support) in numerous disciplines Majority – researchers, although some active in other areas, have numerous roles Diverse career stages - early career (0-5 years, n=5), mid-career (6-15 years, n=11), experienced (16+ years, n=6), and retired (n=1). Countries - The most frequently represented countries are the United States (n=6) and the Netherlands (n=3). Some participants currently work outside of their home countries or have past experience working abroad, influences I will also briefly discuss data from recent survey that built on findings of these interviews; still in analysis phase
  4. A Broader Understanding of Data Needed by Users Data needed for research are not always research data. Metadata, texts, server logs, device specifications, social media posts – all are used for foreground and background purposes in research but do not fall into what may traditionally be thought of as “research data.” – not created through research or for research Data play many roles. Background uses (calibration, comparison) well-documented. But also others less well examined that were mentioned in interviews. Data seekers use data to support research and teaching and to answer new research questions. Data also act as hubs for collaboration and creativity – but more small, local collaborations Found that could seek data from researcher in order to enter collaboration, also that collaborations were seen as a “safe” way of sharing data. Data-related events/trainings at libraries  new ideas for projects/collaborations
  5. In survey – followed up on Half of survey respondents say that they sometimes do (and sometimes don’t) discover data differently than they discover literature; 18% always do; 30% - no difference in how discover data and literature  Overlaps in search practices, quite likely influenced by other research practices
  6. Dive a bit deeper – all respondents who indicated that use literature as source to find data asked this question; could select more than one response (To do: include number of responses on the slide) Employ similar practices that know from academic literature search -following citations – raises interesting questions, given unstandardized methods for data citation in many research disciplines -While some purposely search the literature with the goal of finding data – 20 % also indicate that discover serendipitously – during course of other reading or searching practice
  7. Also see overlaps in discovery strategies outside of using the academic literature; n=1677 – all respondents 17% and 71% - often or occasionally locate data serendipitously when searching actively for something else; lower numbers for those who discover serendipitously when not actively searching See clear overlaps with other data practices – discovering data during sharing and management, when searching for something else – but at same time, also actively searching online - -- Graph/image creation: https://www.makeuseof.com/tag/convert-images-to-svg-format-with-inkscape/ https://kb.tableau.com/articles/howto/stacked-bar-chart-multiple-measures
  8. Another common theme throughout the survey (be ready for a question here) and in all interviews except for one – the key role of social interactions and communication in discovering data became one of clearest themes. Quotes from interviewees demonstrate this- Seen as most efficient and accurate way of finding data, particularly within close community. Quote from psychologist – looking for large datasets to reuse, and paleontologist – who knows quite a bit about issues of data reuse – makes heavy use of email list for his community to find his ”data” Also not all data available or searchable; infrastructures not available – in case of water scientist spoke with in Malaysia – found that most effective way was the “personal approach” – go to governmental agency in person and develop relationship with
  9. The importance of social interactions is not limited to finding and accessing data. Extends to how participants are evaluating and making sense of data before reusing it. Read this quote. Here – see as chance to develop collaborations and build networks, but also imperative to understanding what is really happening with the data
  10. For another participant, problems in data search are not finding data, but finding collaborators who can make sense of the data
  11. Consider Supporting Social Interactions Social interactions are used to locate, evaluate and develop trust in data. Data themselves can facilitate new collaborations. Designing ways to contact data authors, perhaps to provide more info about authors - and ways to contact them – by linking to other profiles – in Mendeley? ORCIDs? Scopus IDs? Ways to integrate offline and online interactions around data including links to in-person training opportunities, last quote about experts being bottleneck – way to create community of data experts – willing to collaborate and answer questions?
  12. Why – because you want to be precise Problem – information extractioni
  13. 1700 active contributors
  14. Data are diverse – As saw in the earlier example, a single individual can need a diversity of data, but data themselves are diverse, the definition of what data even are lies in the eye of the beholder. Text (as in case of field notebook) or even journal articles themselves can be data to a person in a particular situation; for another person, or even in a different situation, the same objects are not data. As Chrsitine Bormgan puts it “one person’s signal is another person’s noise” Data do not (or rarely) stand alone – In order to make sense of and reuse data, need associated information/metadata (e.g. protocols, collection conditions) and analysis tools. Also need skills, technologies, resources to process and use as evidence Taking Borgman’s definition of data here: Data are representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship." - Data do not travel easily – “Data often thought of as discrete units, stable in format and content, that can be moved across a range of contexts and reused” (Leonelli et al). But these different conceptions of data, different contexts of (re)use and the context of creation– make it difficult for data to be simply transported, unpacked and used.
  15. Interviews = primary results that will speak about today Highest number of respondents from computer science and information sciences; also spoke with librarians who were able to give insight into behaviors and needs of patrons (whom they support) in numerous disciplines Majority – researchers, although some active in other areas, have numerous roles Diverse career stages - early career (0-5 years, n=5), mid-career (6-15 years, n=11), experienced (16+ years, n=6), and retired (n=1). Countries - The most frequently represented countries are the United States (n=6) and the Netherlands (n=3). Some participants currently work outside of their home countries or have past experience working abroad, influences I will also briefly discuss data from recent survey that built on findings of these interviews; still in analysis phase
  16. A Broader Understanding of Data Needed by Users Data needed for research are not always research data. Metadata, texts, server logs, device specifications, social media posts – all are used for foreground and background purposes in research but do not fall into what may traditionally be thought of as “research data.” – not created through research or for research Data play many roles. Background uses (calibration, comparison) well-documented. But also others less well examined that were mentioned in interviews. Data seekers use data to support research and teaching and to answer new research questions. Data also act as hubs for collaboration and creativity – but more small, local collaborations Found that could seek data from researcher in order to enter collaboration, also that collaborations were seen as a “safe” way of sharing data. Data-related events/trainings at libraries  new ideas for projects/collaborations
  17. Literature also an important source/entry point to data discovery for interview participants  Led to question of how exactly participants use the literature, and where the overlaps are with the use of and search for literature in course of other research practices
  18. In survey – followed up on Half of survey respondents say that they sometimes do (and sometimes don’t) discover data differently than they discover literature; 18% always do; 30% - no difference in how discover data and literature  Overlaps in search practices, quite likely influenced by other research practices
  19. Dive a bit deeper – all respondents who indicated that use literature as source to find data asked this question; could select more than one response (To do: include number of responses on the slide) Employ similar practices that know from academic literature search -following citations – raises interesting questions, given unstandardized methods for data citation in many research disciplines -While some purposely search the literature with the goal of finding data – 20 % also indicate that discover serendipitously – during course of other reading or searching practice
  20. Also see overlaps in discovery strategies outside of using the academic literature; n=1677 – all respondents 17% and 71% - often or occasionally locate data serendipitously when searching actively for something else; lower numbers for those who discover serendipitously when not actively searching See clear overlaps with other data practices – discovering data during sharing and management, when searching for something else – but at same time, also actively searching online - -- Graph/image creation: https://www.makeuseof.com/tag/convert-images-to-svg-format-with-inkscape/ https://kb.tableau.com/articles/howto/stacked-bar-chart-multiple-measures
  21. Another common theme throughout the survey (be ready for a question here) and in all interviews except for one – the key role of social interactions and communication in discovering data became one of clearest themes. Quotes from interviewees demonstrate this- Seen as most efficient and accurate way of finding data, particularly within close community. Quote from psychologist – looking for large datasets to reuse, and paleontologist – who knows quite a bit about issues of data reuse – makes heavy use of email list for his community to find his ”data” Also not all data available or searchable; infrastructures not available – in case of water scientist spoke with in Malaysia – found that most effective way was the “personal approach” – go to governmental agency in person and develop relationship with
  22. The importance of social interactions is not limited to finding and accessing data. Extends to how participants are evaluating and making sense of data before reusing it. Read this quote. Here – see as chance to develop collaborations and build networks, but also imperative to understanding what is really happening with the data
  23. For another participant, problems in data search are not finding data, but finding collaborators who can make sense of the data