SlideShare a Scribd company logo
1 of 41
Open Data in a Big Data World: easy to
say, but hard to do?
Sarah Callaghan
sarah.callaghan@stfc.ac.uk
@sorcha_ni
ORCID: 0000-0002-0517-1031
Geoffrey Boulton, Dominique Babini, Simon Hodson, Jianhui Li, Tshilidzi
Marwala, Maria Musoke, Paul Uhlir, Sally Wyatt
3rd LEARN workshop on Research Data Management,
“Make research data management policies work”
Helsinki, 28 June 2016
Principles, Policies & Practice
Responsibilities
1-2. Scientists
3.Research institutions & universities
4.Publishers
5.Funding agencies
6.Scholarly societies and academies
7.Libraries & repositories
8. Boundaries of openness
Enabling practices
9. Citation and provenance
10. Interoperability
11. Non-restrictive re-use
12. Linkability
http://www.icsu.org/science-
international/accord
The Data Deluge
http://www.economist.com/node/21521549
http://www.leadformix.com/blog/2013/02/the-big-data-deluge/
It used to be “easy”…
Suber cells and mimosa
leaves. Robert Hooke, Micrographia,
1665
The Scientific Papers of William Parsons,
Third Earl of Rosse 1800-1867
…but datasets have gotten so big, it’s not useful
to publish them in hard copy anymore
Hard copy of the Human Genome at the
Wellcome Collection
Example Big Data: CMIP5
CMIP5: Fifth Coupled Model
Intercomparison Project
• Global community activity under the
World Meteorological Organisation (WMO)
via the World Climate Research
Programme (WCRP)
•Aim:
– to address outstanding scientific
questions that arose as part of the
4th
Assessment Report process,
– improve understanding of climate,
and
– to provide estimates of future
climate change that will be useful to
those considering its possible
consequences.
Many distinct experiments, with very
different characteristics, which influence the
configuration of the models, (what they can
do, and how they should be interpreted).
Simulations:
~ 90,000 years
~ 60 experiments
~ 20 modelling centres (from around the world)
using
~ 30 major(*) model configurations
~ 2 million output “atomic” datasets
~ 10's of petabytes of output
~ 2 petabytes of CMIP5 requested output
~ 1 petabyte of CMIP5 “replicated” output
Which are replicated at a number of sites
(including ours)
Major international collaboration!
Funded by EU FP7 projects (IS-ENES2,
Metafor) and US (ESG) and other national
sources (e.g. NERC for the UK)
CMIP5 numbers
10
Summary of the CMIP5 example
The Climate problem needs:
– Major physical e-infrastructure (networks, supercomputers)
– Comprehensive information architectures covering the whole information life
cycle, including annotation (particularly of quality)
… and hard work populating these information objects, particularly with
provenance detail.
– Sophisticated tools to produce and consume the data and information
objects
– State of the art access control techniques
Major distributed systems are social challenges as much as technical challenges.
CMIP5 is Big Data, with lots of different participants and lots of different
technologies.
It also has a community willing to work together to standardise and automate data
and metadata production and curation, and with the willingness to support the
effort needed for openness.
Big Data:
•Industrialised and standardised data
and metadata production
•Large groups of people involved
•Methods for making the data open,
attribution and credit for data creation
established
Long Tail Data:
•Bespoke data and metadata creation
methods
•Small groups/lone researchers
•No generally accepted methods for
attribution and credit for data creation.
Often data is closed due to lack of effort
to open it
https://flic.kr/p/g1EHPR
Most people have an idea of what a
publication is
Some examples of data (just from the
Earth Sciences)
1. Time series, some still being updated
e.g. meteorological measurements
2. Large 4D synthesised datasets, e.g.
Climate, Oceanographic, Hydrological
and Numerical Weather Prediction
model data generated on a
supercomputer
3. 2D scans e.g. satellite data, weather
radar data
4. 2D snapshots, e.g. cloud camera
5. Traces through a changing medium,
e.g. radiosonde launches, aircraft
flights, ocean salinity and temperature
6. Datasets consisting of data from
multiple instruments as part of the
same measurement campaign
7. Physical samples, e.g. fossils
Open Data is not a new idea
Henry Oldenburg
Data, Reproducibility and Science
Science should be reproducible –
other people doing the same
experiments in the same way should
get the same results.
Observational data is not
reproducible (unless you have a time
machine)
Therefore we need to have access to
the data to confirm the science is
valid!
Poor data analysis generates false
facts – and false facts &
inaccessible data undermine
science & its credibility
http://www.flickr.com/photos/31333486@N00/1893012324/siz
es/o/in/photostream/
A crisis of reproducibility and
credibility?
The data providing the evidence for a published concept MUST be concurrently
published, together with the metadata. To do otherwise is scientific MALPRACTICE
Pre-clinical oncology – 89% not reproducible
Why?
•Misconduct/fraud
•Invalid reasoning
•Absent or inadequate data and/or metadata
We’re only going to get more data
More big data - linked data – machine learning
The internet of things
So, what must we do?
•Concurrently publish data and metadata that are the evidence for a published
scientific claim – to do otherwise is malpractice
•Data science skills for researchers
•Re-establish standards of reproducibility for a data-intensive age
• Patterns not hitherto seen
• Unsuspected relationships
• Integrated analysis of diverse data (e.g. natural & social science)
• Complex systems
e.g. complexity: dynamic evolution and system state
But not all research is or needs to be data-intensive
Scientific Opportunities of Big Data
https://www.clickz.com/clic
kz/column/2389218/create
-better-content-via-humor
http://www.tylervigen.com/spurious-correlations
Caveat Emptor!
Data supporting a published claim Other data for re-use & integration
Pillars of the Digital Revolution
Big Data
Volume
Velocity
Variety
Veracity
Linked
Data
Many
databases
Semantic
Relations
Deeper
meaning
Foundations : Openness
Machine analysis & learning
The Open Data Edifice
Open Data initiatives in areas of:
Life sciences
Earth Science,
Environmental Science
Food Science
Agricultural Science
Chemical Crystallography
Bioinformatics/Genomics
Linguistics
Social Sciences
Evolutionary biology
Biodiversity
Astronomy
Earth Observation (GEO)
Archaeology
Atmospheric sciences
EMBL-EBI services
Labs around the
world send us
their data and
we…
Archive it
Classify it
Share it with
other data
providers
Analyse, add
value and
integrate it
…provide
tools to help
researchers
use it
A collaborative
enterprise
Elixir programme
It is happening: bottom-
up Open Data initiatives
The Open Data Iceberg
The Technical Challenge
The Consent Challenge
The Institutional Challenge
The Funding Challenge
The Support Challenge
The Skills Challenge
The Incentives Challenge
The Mindset Challenge
Processes &
Organisation
People
Developed from: Deetjen, U., E. T. Meyer and R. Schroeder
(2015). OECD Digital Economy Papers, No. 246, OECD
A National Infrastructure
Technology
Scientists
i.Publicly funded scientists have a responsibility to contribute to the
public good through the creation and communication of new
knowledge, of which associated data are intrinsic parts. They should
make such data openly available to others as soon as possible after
their production in ways that permit them to be re-used and re-
purposed.
ii. The data that provide evidence for published scientific claims
should be made concurrently and publicly available in an
intelligently open form. This should permit the logic of the link
between data and claim to be rigorously scrutinised and the
validity of the data to be tested by replication of experiments or
observations. To the extent possible, data should be deposited in
well-managed and trusted repositories with low access barriers.
From the Accord: Responsibilities
Creating a dataset is hard work!
"Piled Higher and Deeper" by Jorge Cham
www.phdcomics.com
Documenting a dataset so that it is usable and understandable by
others is extra work!
“I’m all for the free sharing
of information, provided
it’s them sharing their
information with us.”
http://discworld.wikia.com/wiki/Mustrum_Ri
dcully
Mustrum Ridcully, D.Thau., D.M., D.S.,
D.Mn., D.G., D.D., D.C.L., D.M. Phil.,
D.M.S., D.C.M., D.W., B.El.L,
Archancellor, Unseen University, Anhk-
Morpork, Discworld
- As quoted in “Unseen Academicals”, by
Terry Pratchett
Open is not enough!
“When required to make the data available by
my program manager, my collaborators, and
ultimately by law, I will grudgingly do so by
placing the raw data on an FTP site, named
with UUIDs like 4e283d36-61c4-11df-9a26-
edddf420622d. I will under no circumstances
make any attempt to provide analysis source
code, documentation for formats, or any
metadata with the raw data. When requested
(and ONLY when requested), I will provide an
Excel spreadsheet linking the names to data
sets with published results. This spreadsheet
will likely be wrong -- but since no one will be
able to analyze the data, that won't matter.”
- http://ivory.idyll.org/blog/data-
management.html https://flic.kr/p/awnCQu
Incentives for Open Data
• Need reward
structures and
incentives for
researchers to
encourage them to
make their data open
• Data citation and
publication
• (again, issues with
treating data as a
special case of
publications…)
The Understandability
Challenge: Article
What the data set looks
like on disk
What the raw data files look like.
I could make these files open
easily, but no one would have
a clue how to use them!
The
Understandability
Challenge: Data
It’s ok, I’ll just put it out there and if it’s
important other people will figure it out
These documents have been preserved for thousands of years!
But they’ve both been translated many times, with different meanings each time.
We need Metadata to preserve Information
We can’t rely on Data Archaeology
Phaistos Disk, 1700BC
http://theupturnedmicroscope.com/comi
c/negative-data/
It’s not just data!
• Experimental protocols
• Workflows
• Software code
• Metadata
• Things that went wrong!
• …
Usability, trust, metadata
http://trollcats.com/2009/11/im-your-friend-and-i-
only-want-whats-best-for-you-trollcat/
When you read a journal paper, it’s easy to
read and get a quick understanding of the
quality of the paper.
You don’t want to be downloading many
GB of dataset to open it and see if it’s any
use to you.
Need to use proxies for quality:
•Do you know the data source/repository?
Can you trust it?
•Is there enough metadata so that you can
understand and/or use the data?
In the same way that not all journal
publishers are created equal, not all data
repositories are created equal
Example metadata from a published
dataset:
“rain.csv contains rainfall in mm for each
month at Marysville, Victoria from
January 1995 to February 2009”
Lindenmayer, David B.; Wood, Jeff; McBurney, Lachlan;
Michael, Damian; Crane, Mason; MacGregor, Christopher;
Montague-Drake, Rebecca; Gibbons, Philip; Banks, Sam C.;
(2011): rain; Dryad Digital Repository.
http://doi.org/10.5061/DRYAD.QP1F6H0S/3
Should ALL data be open?
Most data produced through
publically funded research
should be open.
But!
• Confidentiality issues (e.g.
named persons’ health records)
• Conservation issues (e.g. maps
of locations of rare animals at
risk from poachers)
• Security issues (e.g. data and
methodologies for building
biological weapons) There should be a very good
reason for publically funded
data to not be open.
Getting scooped
http://www.phdcomics.com/comics/archive.php?comicid=795
It happened to me!
I shared my data with another research group. They published
the first results using that data.
I wasn’t a co-author. I didn’t get an acknowledgement.
Citeable does not equal Open!
Just like you can cite a paper that is
behind a paywall, you can cite a
dataset that isn’t open.
Making something citeable means
that:
• You know it exists
• You know who’s responsible for it
• You know where to find it
• You know a little bit about it (title,
abstract,…)
Even if you can’t download/read the
thing yourself.
Citation gives benefits that
encourage data producers to
make their data open
Be careful of your citations!
Inputs Outputs
Open access
Administrative
data (held by
public
authorities e.g.
prescription
data)
Public Sector
Research data
(e.g. Met
Office weather
data)
Research
Data (e.g.
CERN,
generated in
universities)
Research
publications
(i.e. papers
in journals)
Open data
Open science
A direction of travel?
Collecting
the data
Doing
research
Doing science
openly
Researchers - Govt & Public sector - Businesses - Citizens - Citizen scientists
(communication/dialogue – joint production of knowledge)
Stakeholders
• Communication/dialogue must be audience-sensitive
• Is it – with all stakeholder groups?
Summary and maybe
conclusions?
• We need to open the products of research
• to encourage innovation and collaboration
• to give credit to the people who’ve created
them
• to be transparent and trustworthy
• Openness does come at a cost!
• It’s not enough for data to be open
• it needs to be usable and understandable
too
• Data citation and publication are ways of
encouraging researchers to make their data
open
• or at least tell the world that their data exists!
• We need a culture change – but it’s
already happening!
http://www.keepcalm-o-matic.co.uk/default.asp
Thanks!
Any questions?
sarah.callaghan@stfc.ac.uk
@sorcha_ni
http://citingbytes.blogspot.co.uk/
“Publishing research without data is simply
advertising, not science” - Graham Steel
http://blog.okfn.org/2013/09/03/publishing-research-without-data-is-simply-advertising-not-science/
http://heywhipple.com/dont-show-me-a-something-
about-show-me-something/

More Related Content

What's hot

A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...
A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...
A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...
LIBER Europe
 

What's hot (20)

The Challenges of Making Data Travel, by Sabina Leonelli
The Challenges of Making Data Travel, by Sabina LeonelliThe Challenges of Making Data Travel, by Sabina Leonelli
The Challenges of Making Data Travel, by Sabina Leonelli
 
Research Data in an Open Science World - Prof. Dr. Eva Mendez, uc3m
Research Data in an Open Science World - Prof. Dr. Eva Mendez, uc3mResearch Data in an Open Science World - Prof. Dr. Eva Mendez, uc3m
Research Data in an Open Science World - Prof. Dr. Eva Mendez, uc3m
 
LEARN Final Conference: Tutorial Group | Implementing the LEARN RDM Toolkit
LEARN Final Conference: Tutorial Group | Implementing the LEARN RDM ToolkitLEARN Final Conference: Tutorial Group | Implementing the LEARN RDM Toolkit
LEARN Final Conference: Tutorial Group | Implementing the LEARN RDM Toolkit
 
Opening Research Data in EU Universities: Policies, Motivators and Challenges
Opening Research Data in EU Universities: Policies, Motivators and ChallengesOpening Research Data in EU Universities: Policies, Motivators and Challenges
Opening Research Data in EU Universities: Policies, Motivators and Challenges
 
What does open science mean? A stakeholder perspective
What does open science mean? A stakeholder perspectiveWhat does open science mean? A stakeholder perspective
What does open science mean? A stakeholder perspective
 
A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...
A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...
A Revolution in Open Science: Open Data and the Role of Libraries (Professor ...
 
The Needs of Stakeholders in the RDM Process - the role of LEARN
The Needs of Stakeholders in the RDM Process - the role of LEARNThe Needs of Stakeholders in the RDM Process - the role of LEARN
The Needs of Stakeholders in the RDM Process - the role of LEARN
 
Introduction to open science
Introduction to open scienceIntroduction to open science
Introduction to open science
 
Developing a Framework for Research Data Management Protocols
Developing a Framework for Research Data Management ProtocolsDeveloping a Framework for Research Data Management Protocols
Developing a Framework for Research Data Management Protocols
 
Fostering Open Science to Research Using a Taxonomy and an eLearning Portal
Fostering Open Science to Research Using a Taxonomy and an eLearning PortalFostering Open Science to Research Using a Taxonomy and an eLearning Portal
Fostering Open Science to Research Using a Taxonomy and an eLearning Portal
 
Supporting Research Data Management in UK Universities: the Jisc Managing Res...
Supporting Research Data Management in UK Universities: the Jisc Managing Res...Supporting Research Data Management in UK Universities: the Jisc Managing Res...
Supporting Research Data Management in UK Universities: the Jisc Managing Res...
 
LEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
LEARN Final Conference: Tutorial Group | How To Engage Early Career ResearchersLEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
LEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
 
November 10, 2015 NISO/ICSTI Joint Webinar: A Pathway from Open Access and Da...
November 10, 2015 NISO/ICSTI Joint Webinar: A Pathway from Open Access and Da...November 10, 2015 NISO/ICSTI Joint Webinar: A Pathway from Open Access and Da...
November 10, 2015 NISO/ICSTI Joint Webinar: A Pathway from Open Access and Da...
 
Open Science
Open ScienceOpen Science
Open Science
 
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...
 
Enabling Data-Intensive Science Through Data Infrastructures
Enabling Data-Intensive Science Through Data InfrastructuresEnabling Data-Intensive Science Through Data Infrastructures
Enabling Data-Intensive Science Through Data Infrastructures
 
UK Research Data Management: overview to ADBU congress, 19 Sep 2013 by Laura ...
UK Research Data Management: overview to ADBU congress, 19 Sep 2013 by Laura ...UK Research Data Management: overview to ADBU congress, 19 Sep 2013 by Laura ...
UK Research Data Management: overview to ADBU congress, 19 Sep 2013 by Laura ...
 
Why science needs open data – Jisc and CNI conference 10 July 2014
Why science needs open data – Jisc and CNI conference 10 July 2014Why science needs open data – Jisc and CNI conference 10 July 2014
Why science needs open data – Jisc and CNI conference 10 July 2014
 
Data, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of ChileData, Science, Society - Claudio Gutierrez, University of Chile
Data, Science, Society - Claudio Gutierrez, University of Chile
 
Open science, open data - FOSTER training, Potsdam
Open science, open data - FOSTER training, PotsdamOpen science, open data - FOSTER training, Potsdam
Open science, open data - FOSTER training, Potsdam
 

Similar to Open Data in a Big Data World: easy to say, but hard to do?

Similar to Open Data in a Big Data World: easy to say, but hard to do? (20)

Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open Science
 
Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open Data
 
Open science curriculum for students, June 2019
Open science curriculum for students, June 2019Open science curriculum for students, June 2019
Open science curriculum for students, June 2019
 
Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decade
 
Stories of “Glocality"—Nations in a Global Infrastructure
Stories of “Glocality"—Nations in a Global InfrastructureStories of “Glocality"—Nations in a Global Infrastructure
Stories of “Glocality"—Nations in a Global Infrastructure
 
The State of Open Data Report by @figshare
The State of Open Data Report  by @figshareThe State of Open Data Report  by @figshare
The State of Open Data Report by @figshare
 
Rda nitrd 2015 berman - final
Rda nitrd 2015 berman  - finalRda nitrd 2015 berman  - final
Rda nitrd 2015 berman - final
 
Gobinda Chowdhury
Gobinda ChowdhuryGobinda Chowdhury
Gobinda Chowdhury
 
Open FAIR Data and Open Science: Developing Partnerships, Strategies, Policie...
Open FAIR Data and Open Science: Developing Partnerships, Strategies, Policie...Open FAIR Data and Open Science: Developing Partnerships, Strategies, Policie...
Open FAIR Data and Open Science: Developing Partnerships, Strategies, Policie...
 
Open Data in a Global Ecosystem
Open Data in a Global EcosystemOpen Data in a Global Ecosystem
Open Data in a Global Ecosystem
 
Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reuse
 
The world of research data: when should data be closed, shared or open
The world of research data: when should data be closed, shared or openThe world of research data: when should data be closed, shared or open
The world of research data: when should data be closed, shared or open
 
Science as an Open Enterprise – Geoffrey Boulton
Science as an Open Enterprise – Geoffrey BoultonScience as an Open Enterprise – Geoffrey Boulton
Science as an Open Enterprise – Geoffrey Boulton
 
A coordinated framework for open data open science in Botswana/Simon Hodson
A coordinated framework for open data open science in Botswana/Simon HodsonA coordinated framework for open data open science in Botswana/Simon Hodson
A coordinated framework for open data open science in Botswana/Simon Hodson
 
Ebi
EbiEbi
Ebi
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics Institute
 
Open Notebook Science
Open Notebook ScienceOpen Notebook Science
Open Notebook Science
 
Open Science Globally: Some Developments/Dr Simon Hodson
Open Science Globally: Some Developments/Dr Simon HodsonOpen Science Globally: Some Developments/Dr Simon Hodson
Open Science Globally: Some Developments/Dr Simon Hodson
 

More from LEARN Project

More from LEARN Project (20)

Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster
 
LEARN Final Conference: Tutorial Group | Using the LEARN Model RDM Policy
LEARN Final Conference: Tutorial Group | Using the LEARN Model RDM PolicyLEARN Final Conference: Tutorial Group | Using the LEARN Model RDM Policy
LEARN Final Conference: Tutorial Group | Using the LEARN Model RDM Policy
 
LEARN Final Conference: Tutorial Group | Costing RDM
LEARN Final Conference: Tutorial Group | Costing RDMLEARN Final Conference: Tutorial Group | Costing RDM
LEARN Final Conference: Tutorial Group | Costing RDM
 
Paolo Budroni at COAR Annual Meeting
Paolo Budroni at COAR Annual MeetingPaolo Budroni at COAR Annual Meeting
Paolo Budroni at COAR Annual Meeting
 
LEARN Webinar
LEARN WebinarLEARN Webinar
LEARN Webinar
 
About Data From A Machine Learning Perspective
About Data From A Machine Learning PerspectiveAbout Data From A Machine Learning Perspective
About Data From A Machine Learning Perspective
 
LEARN Carribean Workshop Opening Remarks
LEARN Carribean Workshop Opening RemarksLEARN Carribean Workshop Opening Remarks
LEARN Carribean Workshop Opening Remarks
 
Managing Research Data in the Caribbean: Good practices and challenges
Managing Research Data in the Caribbean: Good practices and challengesManaging Research Data in the Caribbean: Good practices and challenges
Managing Research Data in the Caribbean: Good practices and challenges
 
LEARN Project: The Story So Far
LEARN Project: The Story So FarLEARN Project: The Story So Far
LEARN Project: The Story So Far
 
The Data Deluge: the Role of Research Organisations
The Data Deluge: the Role of Research OrganisationsThe Data Deluge: the Role of Research Organisations
The Data Deluge: the Role of Research Organisations
 
Data for Development in the Caribbean
Data for Development in the CaribbeanData for Development in the Caribbean
Data for Development in the Caribbean
 
Open Data in a Big World by Fernando Ariel López
Open Data in a Big World by Fernando Ariel López Open Data in a Big World by Fernando Ariel López
Open Data in a Big World by Fernando Ariel López
 
CENTRO DE DATOS
CENTRO DE DATOSCENTRO DE DATOS
CENTRO DE DATOS
 
Research Data Management in São Paulo by Fabio Kon FAPESP
Research Data Management in São Paulo by Fabio Kon FAPESPResearch Data Management in São Paulo by Fabio Kon FAPESP
Research Data Management in São Paulo by Fabio Kon FAPESP
 
Gestion de datos para la investigacion: el caso peruano by Edward Mezones, Su...
Gestion de datos para la investigacion: el caso peruano by Edward Mezones, Su...Gestion de datos para la investigacion: el caso peruano by Edward Mezones, Su...
Gestion de datos para la investigacion: el caso peruano by Edward Mezones, Su...
 
TALLER LEARN SOBRE DATOS DE INVESTIGACIÓN IMPLEMENTACIÓN DE POLÍTICAS Y ESTRA...
TALLER LEARN SOBRE DATOS DE INVESTIGACIÓN IMPLEMENTACIÓN DE POLÍTICAS Y ESTRA...TALLER LEARN SOBRE DATOS DE INVESTIGACIÓN IMPLEMENTACIÓN DE POLÍTICAS Y ESTRA...
TALLER LEARN SOBRE DATOS DE INVESTIGACIÓN IMPLEMENTACIÓN DE POLÍTICAS Y ESTRA...
 
Avances en torno a la Ley 26.899 e iniciativa regional de datos primarios de...
Avances en torno a la Ley 26.899 e iniciativa regional de datos primarios de...Avances en torno a la Ley 26.899 e iniciativa regional de datos primarios de...
Avances en torno a la Ley 26.899 e iniciativa regional de datos primarios de...
 
“Data for Development – the value of data for research and society” by Dr. Ma...
“Data for Development – the value of data for research and society” by Dr. Ma...“Data for Development – the value of data for research and society” by Dr. Ma...
“Data for Development – the value of data for research and society” by Dr. Ma...
 
Conicyt Y Mandato OECD by Patricia Muñoz, CONICYT (Chile)
Conicyt Y Mandato OECD by Patricia Muñoz, CONICYT (Chile)Conicyt Y Mandato OECD by Patricia Muñoz, CONICYT (Chile)
Conicyt Y Mandato OECD by Patricia Muñoz, CONICYT (Chile)
 
Datos Abiertos de Investigacion - Caso Mexico
Datos Abiertos de Investigacion - Caso MexicoDatos Abiertos de Investigacion - Caso Mexico
Datos Abiertos de Investigacion - Caso Mexico
 

Recently uploaded

Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 

Recently uploaded (18)

ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
 

Open Data in a Big Data World: easy to say, but hard to do?

  • 1. Open Data in a Big Data World: easy to say, but hard to do? Sarah Callaghan sarah.callaghan@stfc.ac.uk @sorcha_ni ORCID: 0000-0002-0517-1031 Geoffrey Boulton, Dominique Babini, Simon Hodson, Jianhui Li, Tshilidzi Marwala, Maria Musoke, Paul Uhlir, Sally Wyatt 3rd LEARN workshop on Research Data Management, “Make research data management policies work” Helsinki, 28 June 2016
  • 2. Principles, Policies & Practice Responsibilities 1-2. Scientists 3.Research institutions & universities 4.Publishers 5.Funding agencies 6.Scholarly societies and academies 7.Libraries & repositories 8. Boundaries of openness Enabling practices 9. Citation and provenance 10. Interoperability 11. Non-restrictive re-use 12. Linkability http://www.icsu.org/science- international/accord
  • 4.
  • 5.
  • 6. It used to be “easy”… Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665 The Scientific Papers of William Parsons, Third Earl of Rosse 1800-1867 …but datasets have gotten so big, it’s not useful to publish them in hard copy anymore
  • 7. Hard copy of the Human Genome at the Wellcome Collection
  • 8. Example Big Data: CMIP5 CMIP5: Fifth Coupled Model Intercomparison Project • Global community activity under the World Meteorological Organisation (WMO) via the World Climate Research Programme (WCRP) •Aim: – to address outstanding scientific questions that arose as part of the 4th Assessment Report process, – improve understanding of climate, and – to provide estimates of future climate change that will be useful to those considering its possible consequences. Many distinct experiments, with very different characteristics, which influence the configuration of the models, (what they can do, and how they should be interpreted).
  • 9. Simulations: ~ 90,000 years ~ 60 experiments ~ 20 modelling centres (from around the world) using ~ 30 major(*) model configurations ~ 2 million output “atomic” datasets ~ 10's of petabytes of output ~ 2 petabytes of CMIP5 requested output ~ 1 petabyte of CMIP5 “replicated” output Which are replicated at a number of sites (including ours) Major international collaboration! Funded by EU FP7 projects (IS-ENES2, Metafor) and US (ESG) and other national sources (e.g. NERC for the UK) CMIP5 numbers
  • 10. 10 Summary of the CMIP5 example The Climate problem needs: – Major physical e-infrastructure (networks, supercomputers) – Comprehensive information architectures covering the whole information life cycle, including annotation (particularly of quality) … and hard work populating these information objects, particularly with provenance detail. – Sophisticated tools to produce and consume the data and information objects – State of the art access control techniques Major distributed systems are social challenges as much as technical challenges. CMIP5 is Big Data, with lots of different participants and lots of different technologies. It also has a community willing to work together to standardise and automate data and metadata production and curation, and with the willingness to support the effort needed for openness.
  • 11. Big Data: •Industrialised and standardised data and metadata production •Large groups of people involved •Methods for making the data open, attribution and credit for data creation established Long Tail Data: •Bespoke data and metadata creation methods •Small groups/lone researchers •No generally accepted methods for attribution and credit for data creation. Often data is closed due to lack of effort to open it https://flic.kr/p/g1EHPR
  • 12. Most people have an idea of what a publication is
  • 13. Some examples of data (just from the Earth Sciences) 1. Time series, some still being updated e.g. meteorological measurements 2. Large 4D synthesised datasets, e.g. Climate, Oceanographic, Hydrological and Numerical Weather Prediction model data generated on a supercomputer 3. 2D scans e.g. satellite data, weather radar data 4. 2D snapshots, e.g. cloud camera 5. Traces through a changing medium, e.g. radiosonde launches, aircraft flights, ocean salinity and temperature 6. Datasets consisting of data from multiple instruments as part of the same measurement campaign 7. Physical samples, e.g. fossils
  • 14. Open Data is not a new idea Henry Oldenburg
  • 15. Data, Reproducibility and Science Science should be reproducible – other people doing the same experiments in the same way should get the same results. Observational data is not reproducible (unless you have a time machine) Therefore we need to have access to the data to confirm the science is valid! Poor data analysis generates false facts – and false facts & inaccessible data undermine science & its credibility http://www.flickr.com/photos/31333486@N00/1893012324/siz es/o/in/photostream/
  • 16. A crisis of reproducibility and credibility? The data providing the evidence for a published concept MUST be concurrently published, together with the metadata. To do otherwise is scientific MALPRACTICE Pre-clinical oncology – 89% not reproducible Why? •Misconduct/fraud •Invalid reasoning •Absent or inadequate data and/or metadata
  • 17. We’re only going to get more data More big data - linked data – machine learning The internet of things So, what must we do? •Concurrently publish data and metadata that are the evidence for a published scientific claim – to do otherwise is malpractice •Data science skills for researchers •Re-establish standards of reproducibility for a data-intensive age
  • 18. • Patterns not hitherto seen • Unsuspected relationships • Integrated analysis of diverse data (e.g. natural & social science) • Complex systems e.g. complexity: dynamic evolution and system state But not all research is or needs to be data-intensive Scientific Opportunities of Big Data https://www.clickz.com/clic kz/column/2389218/create -better-content-via-humor
  • 20. Data supporting a published claim Other data for re-use & integration Pillars of the Digital Revolution Big Data Volume Velocity Variety Veracity Linked Data Many databases Semantic Relations Deeper meaning Foundations : Openness Machine analysis & learning The Open Data Edifice
  • 21. Open Data initiatives in areas of: Life sciences Earth Science, Environmental Science Food Science Agricultural Science Chemical Crystallography Bioinformatics/Genomics Linguistics Social Sciences Evolutionary biology Biodiversity Astronomy Earth Observation (GEO) Archaeology Atmospheric sciences EMBL-EBI services Labs around the world send us their data and we… Archive it Classify it Share it with other data providers Analyse, add value and integrate it …provide tools to help researchers use it A collaborative enterprise Elixir programme It is happening: bottom- up Open Data initiatives
  • 22. The Open Data Iceberg The Technical Challenge The Consent Challenge The Institutional Challenge The Funding Challenge The Support Challenge The Skills Challenge The Incentives Challenge The Mindset Challenge Processes & Organisation People Developed from: Deetjen, U., E. T. Meyer and R. Schroeder (2015). OECD Digital Economy Papers, No. 246, OECD A National Infrastructure Technology
  • 23. Scientists i.Publicly funded scientists have a responsibility to contribute to the public good through the creation and communication of new knowledge, of which associated data are intrinsic parts. They should make such data openly available to others as soon as possible after their production in ways that permit them to be re-used and re- purposed. ii. The data that provide evidence for published scientific claims should be made concurrently and publicly available in an intelligently open form. This should permit the logic of the link between data and claim to be rigorously scrutinised and the validity of the data to be tested by replication of experiments or observations. To the extent possible, data should be deposited in well-managed and trusted repositories with low access barriers. From the Accord: Responsibilities
  • 24. Creating a dataset is hard work! "Piled Higher and Deeper" by Jorge Cham www.phdcomics.com Documenting a dataset so that it is usable and understandable by others is extra work!
  • 25. “I’m all for the free sharing of information, provided it’s them sharing their information with us.” http://discworld.wikia.com/wiki/Mustrum_Ri dcully Mustrum Ridcully, D.Thau., D.M., D.S., D.Mn., D.G., D.D., D.C.L., D.M. Phil., D.M.S., D.C.M., D.W., B.El.L, Archancellor, Unseen University, Anhk- Morpork, Discworld - As quoted in “Unseen Academicals”, by Terry Pratchett
  • 26. Open is not enough! “When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26- edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter.” - http://ivory.idyll.org/blog/data- management.html https://flic.kr/p/awnCQu
  • 27. Incentives for Open Data • Need reward structures and incentives for researchers to encourage them to make their data open • Data citation and publication • (again, issues with treating data as a special case of publications…)
  • 29. What the data set looks like on disk What the raw data files look like. I could make these files open easily, but no one would have a clue how to use them! The Understandability Challenge: Data
  • 30. It’s ok, I’ll just put it out there and if it’s important other people will figure it out These documents have been preserved for thousands of years! But they’ve both been translated many times, with different meanings each time. We need Metadata to preserve Information We can’t rely on Data Archaeology Phaistos Disk, 1700BC
  • 32. It’s not just data! • Experimental protocols • Workflows • Software code • Metadata • Things that went wrong! • …
  • 33. Usability, trust, metadata http://trollcats.com/2009/11/im-your-friend-and-i- only-want-whats-best-for-you-trollcat/ When you read a journal paper, it’s easy to read and get a quick understanding of the quality of the paper. You don’t want to be downloading many GB of dataset to open it and see if it’s any use to you. Need to use proxies for quality: •Do you know the data source/repository? Can you trust it? •Is there enough metadata so that you can understand and/or use the data? In the same way that not all journal publishers are created equal, not all data repositories are created equal Example metadata from a published dataset: “rain.csv contains rainfall in mm for each month at Marysville, Victoria from January 1995 to February 2009” Lindenmayer, David B.; Wood, Jeff; McBurney, Lachlan; Michael, Damian; Crane, Mason; MacGregor, Christopher; Montague-Drake, Rebecca; Gibbons, Philip; Banks, Sam C.; (2011): rain; Dryad Digital Repository. http://doi.org/10.5061/DRYAD.QP1F6H0S/3
  • 34. Should ALL data be open? Most data produced through publically funded research should be open. But! • Confidentiality issues (e.g. named persons’ health records) • Conservation issues (e.g. maps of locations of rare animals at risk from poachers) • Security issues (e.g. data and methodologies for building biological weapons) There should be a very good reason for publically funded data to not be open.
  • 35.
  • 36. Getting scooped http://www.phdcomics.com/comics/archive.php?comicid=795 It happened to me! I shared my data with another research group. They published the first results using that data. I wasn’t a co-author. I didn’t get an acknowledgement.
  • 37. Citeable does not equal Open! Just like you can cite a paper that is behind a paywall, you can cite a dataset that isn’t open. Making something citeable means that: • You know it exists • You know who’s responsible for it • You know where to find it • You know a little bit about it (title, abstract,…) Even if you can’t download/read the thing yourself. Citation gives benefits that encourage data producers to make their data open
  • 38. Be careful of your citations!
  • 39. Inputs Outputs Open access Administrative data (held by public authorities e.g. prescription data) Public Sector Research data (e.g. Met Office weather data) Research Data (e.g. CERN, generated in universities) Research publications (i.e. papers in journals) Open data Open science A direction of travel? Collecting the data Doing research Doing science openly Researchers - Govt & Public sector - Businesses - Citizens - Citizen scientists (communication/dialogue – joint production of knowledge) Stakeholders • Communication/dialogue must be audience-sensitive • Is it – with all stakeholder groups?
  • 40. Summary and maybe conclusions? • We need to open the products of research • to encourage innovation and collaboration • to give credit to the people who’ve created them • to be transparent and trustworthy • Openness does come at a cost! • It’s not enough for data to be open • it needs to be usable and understandable too • Data citation and publication are ways of encouraging researchers to make their data open • or at least tell the world that their data exists! • We need a culture change – but it’s already happening! http://www.keepcalm-o-matic.co.uk/default.asp
  • 41. Thanks! Any questions? sarah.callaghan@stfc.ac.uk @sorcha_ni http://citingbytes.blogspot.co.uk/ “Publishing research without data is simply advertising, not science” - Graham Steel http://blog.okfn.org/2013/09/03/publishing-research-without-data-is-simply-advertising-not-science/ http://heywhipple.com/dont-show-me-a-something- about-show-me-something/

Editor's Notes

  1. This is Henry Oldenberg, the first secretary of the newly formed Royal Society in the early 1660s. Henry was an inveterate correspondent, with those we would now call scientists both in Europe and beyond. Rather than keep this correspondence private, he thought it would be a good idea to publish it, and persuaded the new Society to do so by creating the Philosophical Transactions, which remains a top-flight journal to the present day. But he demanded two things of his correspondents: that they should submit in the vernacular and not Latin; and that evidence (data) that supported a concept must be published together with the concept. It permitted others to scrutinize the logic of the concept, the extent to which it was supported by the data and permitted replication and re-use. Open publication of concept and evidence is the basis of “scientific self-correction”, which historians of science argue were the crucial building blocks on which the scientific revolution of the 18th and 19th centuries was built and remain fundamental to the progress of science. Openness to scrutiny by scientific peers is the most powerful form of peer review.
  2. The fundamental challenge is to scientific self-correction. Journals can no longer contain the data, and neither scientists nor journals have taken the obvious step of having data relevant to a publication concurrently available in an electronic database. (example of last year’s Nature paper revealing that only 11% of results in 50 benchmark papers in pre-clinical oncology were replicable. If lack of Oldenburg’s rigour in presenting evidence is widespread, a failure of replicability risks undermines science as a reliable way of acquiring knowledge and can therefore undermines its credibility.
  3. Lots of interchangeable and fluid terms but many shared principles. The word “science” is used to mean the systematic organisation of knowledge that can be rationally explained and reliably applied. It is not exclusively restricted to “natural science”.