SlideShare a Scribd company logo
1 of 99
Data Curation:
A BioCurators perspective.
Chris Hunter
21 April 2017
chris@gigasciencejournal.com
Session structure
• Introductions:
– A bit about me, a bit about you, House
keeping, What is GigaDB
• (Meta)Data Handling
– Curation, BioCuration, Sharing data
• BioCuration Life Cycle and tools
– Dictionaries, CVs, spreadsheets, standards
and checklists
• OpenRefine practical
Session structure
• Introductions:
– A bit about me, a bit about you, House
keeping, What is GigaDB
• (Meta)Data Handling
– Curation, BioCuration, Sharing data
• BioCuration Life Cycle and tools
– Dictionaries, CVs, spreadsheets, standards
and checklists
• OpenRefine practical
Communicating in-class
• Chat channel:
http://backchannelchat.com/chat/dw131
• Feel free to ask questions, requests to
speed up/slow down
• The example files & slides available here:
ftp://climb.genomics.cn/pub/10.5524/presentations/MLIM.dir/
Also feel free to email: chris@gigasciencejournal.com
This is me
• LinkedIN:
https://hk.linkedin.com/in/chr1shunter
• ORCID ID:
http://orcid.org/0000-0002-1335-0881
My background
• Applied Biology Degree (Nottingham, UK)
• Genetics/Genomics PhD (Cambridge, UK)
• Postdoc – function of small DNA motifs
• Postdoc – Cancer Genome Project
• EBI – Curator for SRA
• EBI – Bioinformatician/Curator on
Metagenomics portal
• GigaScience Database – Lead BioCurator
95-99
99-03
03-04
04-07
07 -09
09-12
13-
present
Why tell you about me?
• An indication of what qualifies me to be
teaching you about curation!
• The sort of person that you might meet in
the role of BioCurator
• To show that you don’t need to know your
end goal to make a career, just make the
most of opportunities.
Who are you?
• I would like to take a few minutes to hear
from each of you (~30secs each)
• Name
• Background
• Scientific/academic interests
• Any idea whats next for your career?
Questions?
WHAT IS GIGADB?
GigaScience journal
• GigaScience is an OPEN access publisher
of Life Science articles
• Highly reproducible articles
• Focus on Big data
• Peer reviewed for reliability
• Provide open access free to all
• Run as a not-for-profit to best benefit
researchers
What makes us different?
• GigaDB
What is GigaDB?
• Open access database
• Data organized into datasets
• Datasets associated to GigaScience
articles
• Manually curated
• Indexed and searchable metadata
enabling discoverability and reuse.
• Currently >300 datasets available
• Genomic datasets represent majority of
data(~55%)
• ~75% of all data from BGI (or
collaborators)
• ~20 different data types represented
• All manually curated
Data types
• Nucleotide:
– Genomic, Transcriptomic, Metagenomic,
• Mass spectrometry:
– Proteomics, Metabolomics, MS-Imaging.
• Software & Workflows
• Other
– Imaging, Neuroscience, Network analysis
http://www.GigaDB.org
Anatomy of a GigaDB entry
• All relevant information
is held together in
packets called Datasets
• Each dataset has a
stable DOI page
• If required there can be
a hierarchy of datasets
• Title
• Study type(s)
• Image
• Citation
• Description
• Funders
• Links to Google
scholar and EuroPMC
to see who has cited
this dataset
• Email submitter
• Link to manuscript
• Links to external
resources
Cont.
• Samples used in
the study
• Files listed as part
of the study
• History of dataset
changes
• Social media links
• Links to other
datasets of similar
nature
Downloading the data
FTP
• Conventional/easy to use
• Can pull individually from
web page
• 1 or multiple files using
command line unix
• Speed = upto 1 Mb/sec
Questions?
Session structure
• Introductions:
– A bit about me, a bit about you, House
keeping, What is GigaDB
• (Meta)Data Handling
– Curation, BioCuration, Sharing data
• BioCuration Life Cycle and tools
– Dictionaries, CVs, spreadsheets, standards
and checklists
• OpenRefine practical
(META)DATA HANDLING
What is data?
• “Data may exist only in the eye of the
beholder: The recognition that an
observation, artifact, or record constitutes
data is itself a scholarly act.” (Borgman,
2012)
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci
Tec, 63: 1059–1078. doi:10.1002/asi.22634
What is data?
• We use the term “data” to be broadly
inclusive. It includes
– digital manifestations of literature
– laboratory data: including spectrographic, genomic
sequencing, and electron microscopy data
– observational data: remote sensing, geospatial, and
socioeconomic data
– other forms of data either generated or
compiled, by humans or machines: software,
scripts, intermediary data, tabular data used to generate
charts
How data is created
• Gathered or produced by researchers
– Observations, experiments, or models
– Survey results
– Records (census, economic, etc.)
– Digitized/born digital text and images
So what is metadata?
• Data ABOUT data
• a set of data that describes and gives
information about other data.
– http://dictionary.casrai.org/Metadata
• Its not a new concept, think about old
catalog cards
WikiData: Tomwsulcer
Curate the data
• To classify and catalog data
• Metadata is the classification and
cataloging of data to aid discoverability
and reuse.
• Strongly reliant on controlled vocabularies
and ontological terms
Data curation is…
• “the active and ongoing management of
data throughout its entire lifecycle of
interest and usefulness to scholarship”
Cragin et al., 2007
http://hdl.handle.net/2142/3493
• I would also add:
“the cataloging of data to increase its
usefulness”
Data curation…
• Is a dynamic process
– Not a one time, or one step activity
• Happens in a lifecycle
– Creation, management, preservation
• Aims to maintain the utility of the data
What gets curated?
• Data
– At various stages
• Methods (sometimes)
– Algorithms, code
• Metadata
– Information about the data
• Links
– metadata can form networks of linked data to
help knowledge acquisition
Data curation or BioCuration?
• Distinct, but related
• Data curation is broader
• BioCuration is more specific to Biological
data curation
– “Biocuration involves the translation and
integration of information relevant to biology
into a database or resource that enables
integration of the scientific literature as well as
large data sets. ”
http://biocuration.org/dissemination/who-are-we/
BioCuration
• The process of curating biological data
• International Society of BioCuration (ISB)
– Yearly meetings
– Society website (http://biocuration.org/)
– Discussion forum
– Job adverts
BioCuration2018 - Shanghai
SHARING DATA
Why share data?
• Concepts related to the scientific method
• Reproducibility:
– Experiment can be replicated by the original
researcher or another researcher
• Reliability:
– Similar results can be achieved in other
experiments
• Re-use
– Others can make use of data in other ways
than originally intended
What’s important?
• An attractive, tabular lay-out in a
spreadsheet for presentational purposes?
• An accessible version that is suitable for
re-use with minimal editing?
• Both of the above?
– Consider releasing multiple formats of your
data
Manuscripts
• The traditional publication is
“presentational” version of the data,
– often lurking in supplemental files as PDF’s
Data Journals
• Publication option for datasets
– Often discipline-specific
– Can be peer-reviewed
• Sometimes provide a means of useable
data release, or sometime just an
independently citable version of
supplemental files.
Data Repositories
• Where data is stored for the long term
• Computer accessible
• Some repositories are discipline-specific
– Genomic data: GenBank / ENA
• Some repositories are built for an
organization
– For a university / institute
– For a funder
– Not-for-profit (Dryad, Figshare, GigaDB, Zonodo)
FYI: GigaScience is…
• Combination of
– Peer reviewed Manuscript publication
linked to a
– Manually curated Data repository
Session structure
• Introductions:
– A bit about me, a bit about you, House
keeping, What is GigaDB
• (Meta)Data Handling
– Curation, BioCuration, Sharing data
• BioCuration Life Cycle and tools
– Dictionaries, CVs, spreadsheets, standards
and checklists
• OpenRefine practical
BIO-CURATION LIFE CYCLE
(Primary) BioCuration activities
• Documentation
– Keeping track of how the data was:
• Generated; used; analyzed
• Annotation
– Addition of structured information to
accompany data/files
• Connection
– Linking of files/data to related items both
within dataset and to external items
(Ancilliary) BioCuration activities
• Collection and aggregation
– Files in directories; databases
• Storage and archiving
– Saving data (on digital media)
– Providing consistent and permanent
identifiers (DOI)
• Migration
– Active preservation of data to keep it readable
• Repeat the process on an ongoing basis
The BioCurators tools
• Ontologies / CV’s / Dictionaries
• key:value pairs, RDF/triplestores
Dictionary
• An alphabetical reference list of terms or names
important to a particular subject or activity along
with discussion of their meanings and applications
• Casrai
– Particularly IRIDIUM (Research Data Management):
• http://dictionary.casrai.org/Category:Research_Data_Domain
– Many other dictionaries maintained by Casra
http://casrai.org/standards
Controlled Vocabularies
• A controlled vocabulary is an organized
arrangement of words and phrases used to
index content
• Can be a subset of a dictionary
Key:value pairs
• A key-value pair (KVP) is a set of two linked
data items: a key, which is a unique identifier for
some item of data, and the value, which is either
the data that is identified or a pointer to the
location of that data.
• Structured pairing of particular terms,
• one or both can be from CV’s
• Particularly used for computer readable
matadata
Ontologies
• a set of concepts and categories in a subject area or
domain that shows their properties and the relations
between them.
• More complex than CV’s includes relationship
information and inherited concepts
• Most ontologies in common use in
BioCuration are infact hierarchical CVs
• Much work is being done to integrate, merge
and unify many of these into a true ontology
which will enable symantic web applications.
RDF (Resource Description
Framework
• a model for encoding semantic relationships
between items of data so that these
relationships can be interpreted computationally.
• A complete extrapolation of all ontologies
to include all CV’s with dictionary
definitions and links to all related terms
• Entirely computer readable using URIs
Questions?
Reminder for Chris: Its probably about time for a break!
The BioCurators tools(2)
• Ontologies / CV’s / Dictionaries
• key:value pairs, RDF/triplestores,
• tools for handling metadata (Excel, CSV,
OpenRefine)
Whats good about spreadsheets?
• Most people are familiar with them
• No programing skills required
• Can be used to make data look pretty
(highlighting, different fonts, etc)
• Are forgiving of non-data cells (e.g.
comments)
Whats bad about spreadsheets?
• They allow merging of cells & other odd
formatting to appeal to the eye.
• Dates (reformatted)
• Spreadsheet programs are not appropriate
for analysis/statistics.
• Incompatible (native) file formats with
command line software such as R
• Size limitations (requires a lot of RAM to open files
with millions of rows)
• Most people still use spreadsheet to
organize there own data
• Good practices with data collection can aid
downstream processes
Using spreadsheets wisely
• Useful reference http://kbroman.org/dataorg/
– Be consistent
– Write dates as YYYY-MM-DD
– Fill in all of the cells
– Put just one thing in a cell
– Create a data dictionary (like a CV)
– No calculations in the raw data files
– Don’t use font colour or highlighting as data
– Choose good names for things
– Make backups
– Save the data in plain text files
PRACTICAL EXAMPLE
Hand-on part 1 (Excel)
• First of three quick practical examples of
BioCuration
– Using Excel wisely
– Exploring the DataCite XML schema
– Rationalising data using OpenRefine
Excel
• Keep in mind: http://kbroman.org/dataorg/
• Using this file as a starting point:
• ftp://climb.genomics.cn/pub/10.5524/prese
ntations/MLIM.dir/sample_attribute_spread
sheet-example.csv
• It contains 10,000 rows of the GigaDB
sample attributes table
Questions
• Are the dates effected by being
manipulated via Excel?
• Do the ages all have units?
• What has happened with some of the text
in the first few rows?!
• Are all latitude and longitude values
consistent and appropriate?
Answers
• Some dates appear as serial dates (i.e. the
number of days after (or before) 1900-Jan-
01 e.g. 37074 = 2001-Jul-02
• Null dates have been converted to 0 or
1900-Jan-00
• Only 403 / 928 age values have units
• The hyphen has been converted to –
which is UTF8 code:
– http://www.i18nqa.com/debug/utf8-debug.html
• Only 2 Lat-long values in this subset and
they are both in different formats! 29.097221 -83.067351
44.000306N, 16.01625E
The BioCurators tools(3)
• Ontologies / CV’s / Dictionaries
• key:value pairs, RDF/triplestores,
• tools for handling metadata (Excel, CSV,
OpenRefine)
• Database (SQL/MySQL etc.)
• Structured computational formats (XML,
JSON)
• Standards
DATA FORMATS/STANDARDS
Standards
• Examples:
– Dublin core
– GSC
• Resources:
– www.BioSharing.org
• Results of the use of standards:
– www.Repositive.io
Dublin Core
• “The Dublin Core metadata standard is a simple
yet effective element set for describing a wide
range of networked resources.”
http://dublincore.org/documents/usageguide/index.shtml
Contributor
Coverage
Creator
Date
Description
Format
Identifier
Language
Publisher
Relation
Rights
Source
Subject
Title
Type
Genomics Standard Consortium
• Minimal Information about any sequence –
“MIxS” *
• Covers a variety of different
“environmental packages”
• Each recommends terms from a list of
~700 defined attributes
• Each has ~10-20 mandatory attributes
• MIxS is effectively a dictionary of attributes
* Yilmaz, P et al. Nature Biotechnology 29, 415-420 (2011) doi:10.1038/nbt.1823
Example of MIxS compliant sample
Standards in Genomic Sciences201611:91
DOI: 10.1186/s40793-016-0213-3
Attributes
Description Actinoalloteichus hymeniacidonis DSM 45092, an
actinomycete isolated from the marine sponge Hymeniacidon perleve
BioProject PRJNA273752
strain HPA177(T) (=DSM 45092(T))
host Hymeniacidon perleve
isolation source intertidal marine sponge from the beach of Dalian
collection date 2006
geographic location China: beach of Dalian
sample type pure culture
biomaterial provider DSM 45092
culture collection DSM:45092
environment biome intertidal zone
host tissue sampled washed sponge
latitude and longitude 38.8667 N 121.6833 E
Publication
Effective standards and checklists
• Make extensive use of CVs, Ontologies
and KVPs
• Uptake of new standards is usually slow
and requires incentives for users
Application Programming Interface
• While webpages are human readable
machine require structured data
• Application Programming Interface (API)
Schema design
• In order for machines to understand data
and its relationships they need to follow a
set structure (schema).
• GigaDB has a fairly complex structure as a
relational database
partially expressed in 785 lines of XSD schema for beta API
Schema design
• In order for machines to understand data
and its relationships they need to follow a
set structure (schema).
• GigaDB is complex
• DataCite is less complicated, it’s stored in
XML (the comprehensive XSD to describe it is ~500
lines)
DataCite
• The XSD is available here:
– http://schema.datacite.org/meta/kernel-
4.0/metadata.xsd
• And described here:
– http://schema.datacite.org/meta/kernel-
4.0/doc/DataCite-MetadataKernel_v4.0.pdf
• Example are provided
– http://schema.datacite.org/meta/kernel-4.0/
A simple DataCite example
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4"xsi:schemaLocation="http://datacite.org/schema/kernel-4
http://schema.datacite.org/meta/kernel-4/metadata.xsd">
<identifier identifierType="DOI">10.5072/D3P26Q35R-Test</identifier>
<creators>
<creator>
<creatorName>Fosmire, Michael</creatorName>
</creator>
<creator>
<creatorName>Wertz, Ruth</creatorName>
</creator>
<creator>
<creatorName>Purzer, Senay</creatorName>
</creator>
</creators>
<titles>
<title>Critical Engineering Literacy Test (CELT)</title>
</titles>
<publisher>Purdue University Research Repository (PURR)</publisher>
<publicationYear>2013</publicationYear>
<subjects>
<subject>Assessment</subject>
<subject>Information Literacy</subject>
<subject>Engineering</subject>
<subject>Undergraduate Students</subject>
<subject>CELT</subject>
<subject>Purdue University</subject>
</subjects>
<language>eng</language>
<resourceType resourceTypeGeneral="Dataset">Dataset</resourceType>
<version>1</version>
<descriptions>
<description descriptionType="Abstract">
We developed an instrument, Critical Engineering Literacy Test (CELT), which is a multiple choice instrument designed to measure undergraduate students’ scientific and
information literacy skills. It requires students to first read a technical memo and, based on the memo’s arguments, answer eight multiple choice and six open-ended response
questions. We collected data from 143 first-year engineering students and conducted an item analysis. The KR-20 reliability of the instrument was .39. Item difficulties ranged
between .17 to .83. The results indicate low reliability index but acceptable levels of item difficulties and item discrimination indices. Students were most challenged when
answering items measuring scientific and mathematical literacy (i.e., identifying incorrect information).
</description>
</descriptions>
</resource>
PRACTICAL EXAMPLE 2
Hand-on part 2 (DataCite)
• Looking at the DataCite schema
– Description:
• http://schema.datacite.org/meta/kernel-
4.0/doc/DataCite-MetadataKernel_v4.0.pdf
• What relationships do these two datacite
records show?:
• ftp://climb.genomics.cn/pub/10.5524/presentations/
MLIM.dir/example_datacite_100038.xml
• ftp://climb.genomics.cn/pub/10.5524/presentations/
MLIM.dir/example_datacite_101041.xml
Answers
• 100038.xml Is a New Version Of dataset
doi:10.5524/100015
• 100038.xml Is Compiled By dataset
doi:10.5524/100044
• 10.5524/101041 Continues dataset
doi:10.5524/101000
BioCuration Life Cycle Summary
• As lead BioCurator for GigaDB; I am
involved in the schema design and data
capture of all types of life science data
behind GigaScience publications.
• We receive, appraise and ingest data into
GigaDB
• We preserve and store data
• We provide access for re-use of data
• All the while attempting to maintain
consistency
BioCuration Life Cycle Summary
Helping build knowledge from data
THE FINAL PART
OpenRefine
• According to http://openrefine.org/
“OpenRefine (formerly Google Refine) is a
powerful tool for working with messy data:
cleaning it; transforming it from one format
into another”
• Very useful for Curators to enable
exploration (and cleaning/curation) of vast
tables of metadata
PRACTICAL EXAMPLE 3
Rationalizing data using OpenRefine
OpenRefine
• Download:
– http://openrefine.org/download.html
• Install: (for windows that just unzip it)
• Run: open file “openrefine.exe”
• Download example file:
– ftp://climb.genomics.cn/pub/10.5524/presentat
ions/MLIM.dir/sample_attribute_spreadsheet-
example.csv
Some things to try
• Watch the 7 minute demo video:
– https://www.youtube.com/watch?v=B70J_H_zA
WM
• Common transformations
– Cells to numbers
– Remove trailing white space
• Text Facet
– Look for attribute name = “analyte”
• Merge clusters
– Text facet on “attribute_name”
Quick test
• Can you find 5 problems in the
“attribute_name” column?
• Put some answers in the backchannel
http://backchannelchat.com/chat/dw131
There maybe others!
• Alternative name = alternative names
• Height = Height or length = hight = high or
length
• Patient = patient ID
• Pool details = pooling details
• Specimen voucher = specimen_voucher
• Tissue = tissue type
• Life stage = life stageseed
Looking at “value” field
• Problem is >10,000 unique terms
• Solution, to first facet on attribute_name
• E.g. attribute_name = sex
– The number of different values
is 21! Can that be refined?
( I got down to 9)
WRAP-UP
Summary
 I’m a BioCurator using a variety of experiences to
help others publish data effectively
 GigaScience is a unique publication combining the
traditional manuscript with open access to
underlying data via GigaDB
 Biocuration is a broad field from fine details to high
level metadata
 The goal of curation is to enable discovery of
knowledge
 A variety of tools are available
Further reading / useful links
 OpenRefine online tutorial
http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial
 Excel / spreadsheet do’s and don’ts
http://kbroman.org/dataorg/
 GSC MIxS
http://www.doi.org/10.1038/nbt.1823
 Casrai – dictionary and standards
http://casrai.org/standards
 List of biological standards, checklists and
databases
www.BioSharing.org
BioCuration2018 - Shanghai
Reflection: how fair is FAIR?
Read the FAIR principles
paper.
Do you think they are applicable and
feasible for HK? If it is feasible, what
is needed to implement them?
http://www.nature.com/articles/sdata201618
Reminder: Please comment in Moodle Forum. Scott will give feedback on Monday
Reminder: Final Project
• For the final project for this course need to
choose from 3 assignment options (see
moodle).
• The assignment is due on the 15th May and it
is worth 40% of your grade.
• Time will be set aside for presenting on this
during the final class on the 24th April:
covering why you chose the option, what
discipline/dataset/topic you are covering, and
what work you've done so far (5 mins per
student including any group feedback)
Scott needs your slides by Monday morning for 5 min presentation.
Looking ahead…
• Final project due 10th May
– Need to present preliminary version on 26th
April to get feedback before completion. Send
Scott slides by the 25th April so he can get
them ready for the class

More Related Content

What's hot

Going Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of PretoriaGoing Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of PretoriaJohann van Wyk
 
Repositories & Research Data Management
Repositories & Research Data ManagementRepositories & Research Data Management
Repositories & Research Data ManagementElena Yaroshenko
 
Best practices data collection
Best practices data collectionBest practices data collection
Best practices data collectionSherry Lake
 
Managing Your Research Data
Managing Your Research DataManaging Your Research Data
Managing Your Research DataKristin Briney
 
2017 05 03 Implementing Pure at UWA - ANDS Webinar Series
2017 05 03 Implementing Pure at UWA - ANDS Webinar Series2017 05 03 Implementing Pure at UWA - ANDS Webinar Series
2017 05 03 Implementing Pure at UWA - ANDS Webinar SeriesKatina Toufexis
 
Landing Pages - Joe Hourcle - RDAP12
Landing Pages - Joe Hourcle - RDAP12Landing Pages - Joe Hourcle - RDAP12
Landing Pages - Joe Hourcle - RDAP12ASIS&T
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE
 
Emerging domain agnostic functionalities on the handle-centered networks
Emerging domain agnostic functionalities on the handle-centered networksEmerging domain agnostic functionalities on the handle-centered networks
Emerging domain agnostic functionalities on the handle-centered networksNational Institute of Informatics
 
The Dryad Digital Repository: Published data as part of the greater data ecos...
The Dryad Digital Repository: Published data as part of the greater data ecos...The Dryad Digital Repository: Published data as part of the greater data ecos...
The Dryad Digital Repository: Published data as part of the greater data ecos...Hilmar Lapp
 
Fsci 2018 thursday2_august_am6
Fsci 2018 thursday2_august_am6Fsci 2018 thursday2_august_am6
Fsci 2018 thursday2_august_am6ARDC
 
Dk net webinar tutorial pen
Dk net webinar tutorial penDk net webinar tutorial pen
Dk net webinar tutorial penMaryann Martone
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE
 
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
 
dkNET Poster ENDO 2016
dkNET Poster ENDO 2016 dkNET Poster ENDO 2016
dkNET Poster ENDO 2016 dkNET
 
Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing Mojtaba Lotfaliany
 
Presentation to the UM Library Emergent Research Series
Presentation to the UM Library Emergent Research SeriesPresentation to the UM Library Emergent Research Series
Presentation to the UM Library Emergent Research SeriesSEAD
 

What's hot (20)

Preparing Data for Sharing: The FAIR Principles
Preparing Data for Sharing: The FAIR PrinciplesPreparing Data for Sharing: The FAIR Principles
Preparing Data for Sharing: The FAIR Principles
 
Going Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of PretoriaGoing Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of Pretoria
 
Repositories & Research Data Management
Repositories & Research Data ManagementRepositories & Research Data Management
Repositories & Research Data Management
 
Best practices data collection
Best practices data collectionBest practices data collection
Best practices data collection
 
Managing Your Research Data
Managing Your Research DataManaging Your Research Data
Managing Your Research Data
 
2017 05 03 Implementing Pure at UWA - ANDS Webinar Series
2017 05 03 Implementing Pure at UWA - ANDS Webinar Series2017 05 03 Implementing Pure at UWA - ANDS Webinar Series
2017 05 03 Implementing Pure at UWA - ANDS Webinar Series
 
Landing Pages - Joe Hourcle - RDAP12
Landing Pages - Joe Hourcle - RDAP12Landing Pages - Joe Hourcle - RDAP12
Landing Pages - Joe Hourcle - RDAP12
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
Emerging domain agnostic functionalities on the handle-centered networks
Emerging domain agnostic functionalities on the handle-centered networksEmerging domain agnostic functionalities on the handle-centered networks
Emerging domain agnostic functionalities on the handle-centered networks
 
Activities of JaLC as a national service
Activities of JaLC as a national serviceActivities of JaLC as a national service
Activities of JaLC as a national service
 
The Dryad Digital Repository: Published data as part of the greater data ecos...
The Dryad Digital Repository: Published data as part of the greater data ecos...The Dryad Digital Repository: Published data as part of the greater data ecos...
The Dryad Digital Repository: Published data as part of the greater data ecos...
 
Fsci 2018 thursday2_august_am6
Fsci 2018 thursday2_august_am6Fsci 2018 thursday2_august_am6
Fsci 2018 thursday2_august_am6
 
Working with Global Infrastructure at a National Level
Working with Global Infrastructure at a National LevelWorking with Global Infrastructure at a National Level
Working with Global Infrastructure at a National Level
 
Dk net webinar tutorial pen
Dk net webinar tutorial penDk net webinar tutorial pen
Dk net webinar tutorial pen
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management Planning
 
Martone acs presentation
Martone acs presentationMartone acs presentation
Martone acs presentation
 
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
 
dkNET Poster ENDO 2016
dkNET Poster ENDO 2016 dkNET Poster ENDO 2016
dkNET Poster ENDO 2016
 
Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing
 
Presentation to the UM Library Emergent Research Series
Presentation to the UM Library Emergent Research SeriesPresentation to the UM Library Emergent Research Series
Presentation to the UM Library Emergent Research Series
 

Similar to HKU Data Curation MLIM7350 Class 9

PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...Sarah Anna Stewart
 
NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific Data
NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific DataNIH iDASH meeting on data sharing - BioSharing, ISA and Scientific Data
NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific DataSusanna-Assunta Sansone
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planC. Tobin Magle
 
Love Your Data Locally
Love Your Data LocallyLove Your Data Locally
Love Your Data LocallyErin D. Foster
 
Data and Donuts: How to write a data management plan
Data and Donuts: How to write a data management planData and Donuts: How to write a data management plan
Data and Donuts: How to write a data management planC. Tobin Magle
 
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...Susanna-Assunta Sansone
 
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseLaurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseGigaScience, BGI Hong Kong
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
Research data management workshop april12 2016
Research data management workshop april12 2016 Research data management workshop april12 2016
Research data management workshop april12 2016 Rebecca Raworth, MLIS
 
Research data management workshop April 2016
Research data management workshop April 2016Research data management workshop April 2016
Research data management workshop April 2016Rebecca Raworth, MLIS
 
Data Management for librarians
Data Management for librariansData Management for librarians
Data Management for librariansC. Tobin Magle
 
Preparing your data for sharing and publishing
Preparing your data for sharing and publishingPreparing your data for sharing and publishing
Preparing your data for sharing and publishingVarsha Khodiyar
 
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...dkNET
 
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014Susanna-Assunta Sansone
 
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data LifecycleSteven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data LifecycleSteve Androulakis
 
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017ARDC
 

Similar to HKU Data Curation MLIM7350 Class 9 (20)

PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
 
NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific Data
NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific DataNIH iDASH meeting on data sharing - BioSharing, ISA and Scientific Data
NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific Data
 
Rdm slides march 2014
Rdm slides march 2014Rdm slides march 2014
Rdm slides march 2014
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
 
Love Your Data Locally
Love Your Data LocallyLove Your Data Locally
Love Your Data Locally
 
Research data life cycle
Research data life cycleResearch data life cycle
Research data life cycle
 
Data and Donuts: How to write a data management plan
Data and Donuts: How to write a data management planData and Donuts: How to write a data management plan
Data and Donuts: How to write a data management plan
 
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
 
Researh data management
Researh data managementResearh data management
Researh data management
 
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseLaurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
 
CDL research lifecycle
CDL research lifecycleCDL research lifecycle
CDL research lifecycle
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
Research data management workshop april12 2016
Research data management workshop april12 2016 Research data management workshop april12 2016
Research data management workshop april12 2016
 
Research data management workshop April 2016
Research data management workshop April 2016Research data management workshop April 2016
Research data management workshop April 2016
 
Data Management for librarians
Data Management for librariansData Management for librarians
Data Management for librarians
 
Preparing your data for sharing and publishing
Preparing your data for sharing and publishingPreparing your data for sharing and publishing
Preparing your data for sharing and publishing
 
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
 
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
 
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data LifecycleSteven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
 
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
 

More from Scott Edmunds

Free the Data! Pitch to Hong Kong Open Data Day 2019
Free the Data! Pitch to Hong Kong Open Data Day 2019Free the Data! Pitch to Hong Kong Open Data Day 2019
Free the Data! Pitch to Hong Kong Open Data Day 2019Scott Edmunds
 
Scott Edmunds: Access to Information Consultation Recomendations
Scott Edmunds: Access to Information Consultation RecomendationsScott Edmunds: Access to Information Consultation Recomendations
Scott Edmunds: Access to Information Consultation RecomendationsScott Edmunds
 
Open Data Hong Kong Update: CCCHK@10
Open Data Hong Kong Update: CCCHK@10Open Data Hong Kong Update: CCCHK@10
Open Data Hong Kong Update: CCCHK@10Scott Edmunds
 
Scott Edmunds Lightning talk: Experiences of NGO
Scott Edmunds Lightning talk: Experiences of NGOScott Edmunds Lightning talk: Experiences of NGO
Scott Edmunds Lightning talk: Experiences of NGOScott Edmunds
 
Scott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecuture
Scott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecutureScott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecuture
Scott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecutureScott Edmunds
 
HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10Scott Edmunds
 
Emblematic education to know thy DNA? TEDxEduHK
Emblematic education to know thy DNA? TEDxEduHKEmblematic education to know thy DNA? TEDxEduHK
Emblematic education to know thy DNA? TEDxEduHKScott Edmunds
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8Scott Edmunds
 
HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7Scott Edmunds
 
Hong Kong 2017 Open Data Day hackathon results: RacismWatch:HK
Hong Kong 2017 Open Data Day hackathon results: RacismWatch:HKHong Kong 2017 Open Data Day hackathon results: RacismWatch:HK
Hong Kong 2017 Open Data Day hackathon results: RacismWatch:HKScott Edmunds
 
Bauhinia Genome talk at the Galaxy Australasia Meeting
Bauhinia Genome talk at the Galaxy Australasia MeetingBauhinia Genome talk at the Galaxy Australasia Meeting
Bauhinia Genome talk at the Galaxy Australasia MeetingScott Edmunds
 
David Palmer: China Open Access week
David Palmer: China Open Access weekDavid Palmer: China Open Access week
David Palmer: China Open Access weekScott Edmunds
 
Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...
Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...
Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...Scott Edmunds
 
ODHK.Meet.37 Intro to Research Data Policies and Platforms
ODHK.Meet.37 Intro to Research Data Policies and PlatformsODHK.Meet.37 Intro to Research Data Policies and Platforms
ODHK.Meet.37 Intro to Research Data Policies and PlatformsScott Edmunds
 
Scott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetup
Scott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetupScott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetup
Scott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetupScott Edmunds
 
Scott Edmunds talking Bauhina Genome at DIYBIOHK
Scott Edmunds talking Bauhina Genome at DIYBIOHKScott Edmunds talking Bauhina Genome at DIYBIOHK
Scott Edmunds talking Bauhina Genome at DIYBIOHKScott Edmunds
 
Introductory slides for the MakerBay/ODHK #ZikaHackathon
Introductory slides for the MakerBay/ODHK #ZikaHackathonIntroductory slides for the MakerBay/ODHK #ZikaHackathon
Introductory slides for the MakerBay/ODHK #ZikaHackathonScott Edmunds
 
Bauhina Genome slides for school visit
Bauhina Genome slides for school visitBauhina Genome slides for school visit
Bauhina Genome slides for school visitScott Edmunds
 
Intro for ODHK.meet.32 on Hacking the "Human Genome"
Intro for ODHK.meet.32 on Hacking the "Human Genome"Intro for ODHK.meet.32 on Hacking the "Human Genome"
Intro for ODHK.meet.32 on Hacking the "Human Genome"Scott Edmunds
 
BauhinaGenome preview at #ICG10
BauhinaGenome preview at #ICG10BauhinaGenome preview at #ICG10
BauhinaGenome preview at #ICG10Scott Edmunds
 

More from Scott Edmunds (20)

Free the Data! Pitch to Hong Kong Open Data Day 2019
Free the Data! Pitch to Hong Kong Open Data Day 2019Free the Data! Pitch to Hong Kong Open Data Day 2019
Free the Data! Pitch to Hong Kong Open Data Day 2019
 
Scott Edmunds: Access to Information Consultation Recomendations
Scott Edmunds: Access to Information Consultation RecomendationsScott Edmunds: Access to Information Consultation Recomendations
Scott Edmunds: Access to Information Consultation Recomendations
 
Open Data Hong Kong Update: CCCHK@10
Open Data Hong Kong Update: CCCHK@10Open Data Hong Kong Update: CCCHK@10
Open Data Hong Kong Update: CCCHK@10
 
Scott Edmunds Lightning talk: Experiences of NGO
Scott Edmunds Lightning talk: Experiences of NGOScott Edmunds Lightning talk: Experiences of NGO
Scott Edmunds Lightning talk: Experiences of NGO
 
Scott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecuture
Scott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecutureScott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecuture
Scott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecuture
 
HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10
 
Emblematic education to know thy DNA? TEDxEduHK
Emblematic education to know thy DNA? TEDxEduHKEmblematic education to know thy DNA? TEDxEduHK
Emblematic education to know thy DNA? TEDxEduHK
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7
 
Hong Kong 2017 Open Data Day hackathon results: RacismWatch:HK
Hong Kong 2017 Open Data Day hackathon results: RacismWatch:HKHong Kong 2017 Open Data Day hackathon results: RacismWatch:HK
Hong Kong 2017 Open Data Day hackathon results: RacismWatch:HK
 
Bauhinia Genome talk at the Galaxy Australasia Meeting
Bauhinia Genome talk at the Galaxy Australasia MeetingBauhinia Genome talk at the Galaxy Australasia Meeting
Bauhinia Genome talk at the Galaxy Australasia Meeting
 
David Palmer: China Open Access week
David Palmer: China Open Access weekDavid Palmer: China Open Access week
David Palmer: China Open Access week
 
Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...
Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...
Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...
 
ODHK.Meet.37 Intro to Research Data Policies and Platforms
ODHK.Meet.37 Intro to Research Data Policies and PlatformsODHK.Meet.37 Intro to Research Data Policies and Platforms
ODHK.Meet.37 Intro to Research Data Policies and Platforms
 
Scott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetup
Scott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetupScott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetup
Scott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetup
 
Scott Edmunds talking Bauhina Genome at DIYBIOHK
Scott Edmunds talking Bauhina Genome at DIYBIOHKScott Edmunds talking Bauhina Genome at DIYBIOHK
Scott Edmunds talking Bauhina Genome at DIYBIOHK
 
Introductory slides for the MakerBay/ODHK #ZikaHackathon
Introductory slides for the MakerBay/ODHK #ZikaHackathonIntroductory slides for the MakerBay/ODHK #ZikaHackathon
Introductory slides for the MakerBay/ODHK #ZikaHackathon
 
Bauhina Genome slides for school visit
Bauhina Genome slides for school visitBauhina Genome slides for school visit
Bauhina Genome slides for school visit
 
Intro for ODHK.meet.32 on Hacking the "Human Genome"
Intro for ODHK.meet.32 on Hacking the "Human Genome"Intro for ODHK.meet.32 on Hacking the "Human Genome"
Intro for ODHK.meet.32 on Hacking the "Human Genome"
 
BauhinaGenome preview at #ICG10
BauhinaGenome preview at #ICG10BauhinaGenome preview at #ICG10
BauhinaGenome preview at #ICG10
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

HKU Data Curation MLIM7350 Class 9

  • 1. Data Curation: A BioCurators perspective. Chris Hunter 21 April 2017 chris@gigasciencejournal.com
  • 2. Session structure • Introductions: – A bit about me, a bit about you, House keeping, What is GigaDB • (Meta)Data Handling – Curation, BioCuration, Sharing data • BioCuration Life Cycle and tools – Dictionaries, CVs, spreadsheets, standards and checklists • OpenRefine practical
  • 3. Session structure • Introductions: – A bit about me, a bit about you, House keeping, What is GigaDB • (Meta)Data Handling – Curation, BioCuration, Sharing data • BioCuration Life Cycle and tools – Dictionaries, CVs, spreadsheets, standards and checklists • OpenRefine practical
  • 4. Communicating in-class • Chat channel: http://backchannelchat.com/chat/dw131 • Feel free to ask questions, requests to speed up/slow down • The example files & slides available here: ftp://climb.genomics.cn/pub/10.5524/presentations/MLIM.dir/ Also feel free to email: chris@gigasciencejournal.com
  • 5. This is me • LinkedIN: https://hk.linkedin.com/in/chr1shunter • ORCID ID: http://orcid.org/0000-0002-1335-0881
  • 6. My background • Applied Biology Degree (Nottingham, UK) • Genetics/Genomics PhD (Cambridge, UK) • Postdoc – function of small DNA motifs • Postdoc – Cancer Genome Project • EBI – Curator for SRA • EBI – Bioinformatician/Curator on Metagenomics portal • GigaScience Database – Lead BioCurator 95-99 99-03 03-04 04-07 07 -09 09-12 13- present
  • 7. Why tell you about me? • An indication of what qualifies me to be teaching you about curation! • The sort of person that you might meet in the role of BioCurator • To show that you don’t need to know your end goal to make a career, just make the most of opportunities.
  • 8. Who are you? • I would like to take a few minutes to hear from each of you (~30secs each) • Name • Background • Scientific/academic interests • Any idea whats next for your career?
  • 11. GigaScience journal • GigaScience is an OPEN access publisher of Life Science articles • Highly reproducible articles • Focus on Big data • Peer reviewed for reliability • Provide open access free to all • Run as a not-for-profit to best benefit researchers
  • 12. What makes us different? • GigaDB
  • 13. What is GigaDB? • Open access database • Data organized into datasets • Datasets associated to GigaScience articles • Manually curated • Indexed and searchable metadata enabling discoverability and reuse.
  • 14. • Currently >300 datasets available • Genomic datasets represent majority of data(~55%) • ~75% of all data from BGI (or collaborators) • ~20 different data types represented • All manually curated
  • 15. Data types • Nucleotide: – Genomic, Transcriptomic, Metagenomic, • Mass spectrometry: – Proteomics, Metabolomics, MS-Imaging. • Software & Workflows • Other – Imaging, Neuroscience, Network analysis
  • 17. Anatomy of a GigaDB entry • All relevant information is held together in packets called Datasets • Each dataset has a stable DOI page • If required there can be a hierarchy of datasets
  • 18. • Title • Study type(s) • Image • Citation • Description • Funders • Links to Google scholar and EuroPMC to see who has cited this dataset • Email submitter • Link to manuscript • Links to external resources Cont.
  • 19. • Samples used in the study • Files listed as part of the study • History of dataset changes • Social media links • Links to other datasets of similar nature
  • 20. Downloading the data FTP • Conventional/easy to use • Can pull individually from web page • 1 or multiple files using command line unix • Speed = upto 1 Mb/sec
  • 22. Session structure • Introductions: – A bit about me, a bit about you, House keeping, What is GigaDB • (Meta)Data Handling – Curation, BioCuration, Sharing data • BioCuration Life Cycle and tools – Dictionaries, CVs, spreadsheets, standards and checklists • OpenRefine practical
  • 24. What is data? • “Data may exist only in the eye of the beholder: The recognition that an observation, artifact, or record constitutes data is itself a scholarly act.” (Borgman, 2012) Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
  • 25. What is data? • We use the term “data” to be broadly inclusive. It includes – digital manifestations of literature – laboratory data: including spectrographic, genomic sequencing, and electron microscopy data – observational data: remote sensing, geospatial, and socioeconomic data – other forms of data either generated or compiled, by humans or machines: software, scripts, intermediary data, tabular data used to generate charts
  • 26. How data is created • Gathered or produced by researchers – Observations, experiments, or models – Survey results – Records (census, economic, etc.) – Digitized/born digital text and images
  • 27. So what is metadata? • Data ABOUT data • a set of data that describes and gives information about other data. – http://dictionary.casrai.org/Metadata • Its not a new concept, think about old catalog cards WikiData: Tomwsulcer
  • 28. Curate the data • To classify and catalog data • Metadata is the classification and cataloging of data to aid discoverability and reuse. • Strongly reliant on controlled vocabularies and ontological terms
  • 29. Data curation is… • “the active and ongoing management of data throughout its entire lifecycle of interest and usefulness to scholarship” Cragin et al., 2007 http://hdl.handle.net/2142/3493 • I would also add: “the cataloging of data to increase its usefulness”
  • 30. Data curation… • Is a dynamic process – Not a one time, or one step activity • Happens in a lifecycle – Creation, management, preservation • Aims to maintain the utility of the data
  • 31. What gets curated? • Data – At various stages • Methods (sometimes) – Algorithms, code • Metadata – Information about the data • Links – metadata can form networks of linked data to help knowledge acquisition
  • 32. Data curation or BioCuration? • Distinct, but related • Data curation is broader • BioCuration is more specific to Biological data curation – “Biocuration involves the translation and integration of information relevant to biology into a database or resource that enables integration of the scientific literature as well as large data sets. ” http://biocuration.org/dissemination/who-are-we/
  • 33. BioCuration • The process of curating biological data • International Society of BioCuration (ISB) – Yearly meetings – Society website (http://biocuration.org/) – Discussion forum – Job adverts
  • 36. Why share data? • Concepts related to the scientific method • Reproducibility: – Experiment can be replicated by the original researcher or another researcher • Reliability: – Similar results can be achieved in other experiments • Re-use – Others can make use of data in other ways than originally intended
  • 37. What’s important? • An attractive, tabular lay-out in a spreadsheet for presentational purposes? • An accessible version that is suitable for re-use with minimal editing? • Both of the above? – Consider releasing multiple formats of your data
  • 38. Manuscripts • The traditional publication is “presentational” version of the data, – often lurking in supplemental files as PDF’s
  • 39. Data Journals • Publication option for datasets – Often discipline-specific – Can be peer-reviewed • Sometimes provide a means of useable data release, or sometime just an independently citable version of supplemental files.
  • 40. Data Repositories • Where data is stored for the long term • Computer accessible • Some repositories are discipline-specific – Genomic data: GenBank / ENA • Some repositories are built for an organization – For a university / institute – For a funder – Not-for-profit (Dryad, Figshare, GigaDB, Zonodo)
  • 41. FYI: GigaScience is… • Combination of – Peer reviewed Manuscript publication linked to a – Manually curated Data repository
  • 42. Session structure • Introductions: – A bit about me, a bit about you, House keeping, What is GigaDB • (Meta)Data Handling – Curation, BioCuration, Sharing data • BioCuration Life Cycle and tools – Dictionaries, CVs, spreadsheets, standards and checklists • OpenRefine practical
  • 44. (Primary) BioCuration activities • Documentation – Keeping track of how the data was: • Generated; used; analyzed • Annotation – Addition of structured information to accompany data/files • Connection – Linking of files/data to related items both within dataset and to external items
  • 45. (Ancilliary) BioCuration activities • Collection and aggregation – Files in directories; databases • Storage and archiving – Saving data (on digital media) – Providing consistent and permanent identifiers (DOI) • Migration – Active preservation of data to keep it readable • Repeat the process on an ongoing basis
  • 46. The BioCurators tools • Ontologies / CV’s / Dictionaries • key:value pairs, RDF/triplestores
  • 47. Dictionary • An alphabetical reference list of terms or names important to a particular subject or activity along with discussion of their meanings and applications • Casrai – Particularly IRIDIUM (Research Data Management): • http://dictionary.casrai.org/Category:Research_Data_Domain – Many other dictionaries maintained by Casra http://casrai.org/standards
  • 48. Controlled Vocabularies • A controlled vocabulary is an organized arrangement of words and phrases used to index content • Can be a subset of a dictionary
  • 49. Key:value pairs • A key-value pair (KVP) is a set of two linked data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is identified or a pointer to the location of that data. • Structured pairing of particular terms, • one or both can be from CV’s • Particularly used for computer readable matadata
  • 50. Ontologies • a set of concepts and categories in a subject area or domain that shows their properties and the relations between them. • More complex than CV’s includes relationship information and inherited concepts • Most ontologies in common use in BioCuration are infact hierarchical CVs • Much work is being done to integrate, merge and unify many of these into a true ontology which will enable symantic web applications.
  • 51. RDF (Resource Description Framework • a model for encoding semantic relationships between items of data so that these relationships can be interpreted computationally. • A complete extrapolation of all ontologies to include all CV’s with dictionary definitions and links to all related terms • Entirely computer readable using URIs
  • 52. Questions? Reminder for Chris: Its probably about time for a break!
  • 53. The BioCurators tools(2) • Ontologies / CV’s / Dictionaries • key:value pairs, RDF/triplestores, • tools for handling metadata (Excel, CSV, OpenRefine)
  • 54. Whats good about spreadsheets? • Most people are familiar with them • No programing skills required • Can be used to make data look pretty (highlighting, different fonts, etc) • Are forgiving of non-data cells (e.g. comments)
  • 55. Whats bad about spreadsheets? • They allow merging of cells & other odd formatting to appeal to the eye. • Dates (reformatted) • Spreadsheet programs are not appropriate for analysis/statistics. • Incompatible (native) file formats with command line software such as R • Size limitations (requires a lot of RAM to open files with millions of rows)
  • 56. • Most people still use spreadsheet to organize there own data • Good practices with data collection can aid downstream processes
  • 57. Using spreadsheets wisely • Useful reference http://kbroman.org/dataorg/ – Be consistent – Write dates as YYYY-MM-DD – Fill in all of the cells – Put just one thing in a cell – Create a data dictionary (like a CV) – No calculations in the raw data files – Don’t use font colour or highlighting as data – Choose good names for things – Make backups – Save the data in plain text files
  • 59. Hand-on part 1 (Excel) • First of three quick practical examples of BioCuration – Using Excel wisely – Exploring the DataCite XML schema – Rationalising data using OpenRefine
  • 60. Excel • Keep in mind: http://kbroman.org/dataorg/ • Using this file as a starting point: • ftp://climb.genomics.cn/pub/10.5524/prese ntations/MLIM.dir/sample_attribute_spread sheet-example.csv • It contains 10,000 rows of the GigaDB sample attributes table
  • 61.
  • 62. Questions • Are the dates effected by being manipulated via Excel? • Do the ages all have units? • What has happened with some of the text in the first few rows?! • Are all latitude and longitude values consistent and appropriate?
  • 63. Answers • Some dates appear as serial dates (i.e. the number of days after (or before) 1900-Jan- 01 e.g. 37074 = 2001-Jul-02 • Null dates have been converted to 0 or 1900-Jan-00 • Only 403 / 928 age values have units • The hyphen has been converted to – which is UTF8 code: – http://www.i18nqa.com/debug/utf8-debug.html • Only 2 Lat-long values in this subset and they are both in different formats! 29.097221 -83.067351 44.000306N, 16.01625E
  • 64. The BioCurators tools(3) • Ontologies / CV’s / Dictionaries • key:value pairs, RDF/triplestores, • tools for handling metadata (Excel, CSV, OpenRefine) • Database (SQL/MySQL etc.) • Structured computational formats (XML, JSON) • Standards
  • 66. Standards • Examples: – Dublin core – GSC • Resources: – www.BioSharing.org • Results of the use of standards: – www.Repositive.io
  • 67. Dublin Core • “The Dublin Core metadata standard is a simple yet effective element set for describing a wide range of networked resources.” http://dublincore.org/documents/usageguide/index.shtml Contributor Coverage Creator Date Description Format Identifier Language Publisher Relation Rights Source Subject Title Type
  • 68. Genomics Standard Consortium • Minimal Information about any sequence – “MIxS” * • Covers a variety of different “environmental packages” • Each recommends terms from a list of ~700 defined attributes • Each has ~10-20 mandatory attributes • MIxS is effectively a dictionary of attributes * Yilmaz, P et al. Nature Biotechnology 29, 415-420 (2011) doi:10.1038/nbt.1823
  • 69.
  • 70. Example of MIxS compliant sample Standards in Genomic Sciences201611:91 DOI: 10.1186/s40793-016-0213-3 Attributes Description Actinoalloteichus hymeniacidonis DSM 45092, an actinomycete isolated from the marine sponge Hymeniacidon perleve BioProject PRJNA273752 strain HPA177(T) (=DSM 45092(T)) host Hymeniacidon perleve isolation source intertidal marine sponge from the beach of Dalian collection date 2006 geographic location China: beach of Dalian sample type pure culture biomaterial provider DSM 45092 culture collection DSM:45092 environment biome intertidal zone host tissue sampled washed sponge latitude and longitude 38.8667 N 121.6833 E Publication
  • 71. Effective standards and checklists • Make extensive use of CVs, Ontologies and KVPs • Uptake of new standards is usually slow and requires incentives for users
  • 72. Application Programming Interface • While webpages are human readable machine require structured data • Application Programming Interface (API)
  • 73. Schema design • In order for machines to understand data and its relationships they need to follow a set structure (schema). • GigaDB has a fairly complex structure as a relational database
  • 74. partially expressed in 785 lines of XSD schema for beta API
  • 75. Schema design • In order for machines to understand data and its relationships they need to follow a set structure (schema). • GigaDB is complex • DataCite is less complicated, it’s stored in XML (the comprehensive XSD to describe it is ~500 lines)
  • 76. DataCite • The XSD is available here: – http://schema.datacite.org/meta/kernel- 4.0/metadata.xsd • And described here: – http://schema.datacite.org/meta/kernel- 4.0/doc/DataCite-MetadataKernel_v4.0.pdf • Example are provided – http://schema.datacite.org/meta/kernel-4.0/
  • 77. A simple DataCite example <resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4"xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd"> <identifier identifierType="DOI">10.5072/D3P26Q35R-Test</identifier> <creators> <creator> <creatorName>Fosmire, Michael</creatorName> </creator> <creator> <creatorName>Wertz, Ruth</creatorName> </creator> <creator> <creatorName>Purzer, Senay</creatorName> </creator> </creators> <titles> <title>Critical Engineering Literacy Test (CELT)</title> </titles> <publisher>Purdue University Research Repository (PURR)</publisher> <publicationYear>2013</publicationYear> <subjects> <subject>Assessment</subject> <subject>Information Literacy</subject> <subject>Engineering</subject> <subject>Undergraduate Students</subject> <subject>CELT</subject> <subject>Purdue University</subject> </subjects> <language>eng</language> <resourceType resourceTypeGeneral="Dataset">Dataset</resourceType> <version>1</version> <descriptions> <description descriptionType="Abstract"> We developed an instrument, Critical Engineering Literacy Test (CELT), which is a multiple choice instrument designed to measure undergraduate students’ scientific and information literacy skills. It requires students to first read a technical memo and, based on the memo’s arguments, answer eight multiple choice and six open-ended response questions. We collected data from 143 first-year engineering students and conducted an item analysis. The KR-20 reliability of the instrument was .39. Item difficulties ranged between .17 to .83. The results indicate low reliability index but acceptable levels of item difficulties and item discrimination indices. Students were most challenged when answering items measuring scientific and mathematical literacy (i.e., identifying incorrect information). </description> </descriptions> </resource>
  • 79. Hand-on part 2 (DataCite) • Looking at the DataCite schema – Description: • http://schema.datacite.org/meta/kernel- 4.0/doc/DataCite-MetadataKernel_v4.0.pdf • What relationships do these two datacite records show?: • ftp://climb.genomics.cn/pub/10.5524/presentations/ MLIM.dir/example_datacite_100038.xml • ftp://climb.genomics.cn/pub/10.5524/presentations/ MLIM.dir/example_datacite_101041.xml
  • 80. Answers • 100038.xml Is a New Version Of dataset doi:10.5524/100015 • 100038.xml Is Compiled By dataset doi:10.5524/100044 • 10.5524/101041 Continues dataset doi:10.5524/101000
  • 81. BioCuration Life Cycle Summary • As lead BioCurator for GigaDB; I am involved in the schema design and data capture of all types of life science data behind GigaScience publications. • We receive, appraise and ingest data into GigaDB • We preserve and store data • We provide access for re-use of data • All the while attempting to maintain consistency
  • 82. BioCuration Life Cycle Summary Helping build knowledge from data
  • 84. OpenRefine • According to http://openrefine.org/ “OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another” • Very useful for Curators to enable exploration (and cleaning/curation) of vast tables of metadata
  • 85. PRACTICAL EXAMPLE 3 Rationalizing data using OpenRefine
  • 86. OpenRefine • Download: – http://openrefine.org/download.html • Install: (for windows that just unzip it) • Run: open file “openrefine.exe” • Download example file: – ftp://climb.genomics.cn/pub/10.5524/presentat ions/MLIM.dir/sample_attribute_spreadsheet- example.csv
  • 87.
  • 88.
  • 89. Some things to try • Watch the 7 minute demo video: – https://www.youtube.com/watch?v=B70J_H_zA WM • Common transformations – Cells to numbers – Remove trailing white space • Text Facet – Look for attribute name = “analyte” • Merge clusters – Text facet on “attribute_name”
  • 90. Quick test • Can you find 5 problems in the “attribute_name” column? • Put some answers in the backchannel http://backchannelchat.com/chat/dw131
  • 91. There maybe others! • Alternative name = alternative names • Height = Height or length = hight = high or length • Patient = patient ID • Pool details = pooling details • Specimen voucher = specimen_voucher • Tissue = tissue type • Life stage = life stageseed
  • 92. Looking at “value” field • Problem is >10,000 unique terms • Solution, to first facet on attribute_name • E.g. attribute_name = sex – The number of different values is 21! Can that be refined? ( I got down to 9)
  • 94. Summary  I’m a BioCurator using a variety of experiences to help others publish data effectively  GigaScience is a unique publication combining the traditional manuscript with open access to underlying data via GigaDB  Biocuration is a broad field from fine details to high level metadata  The goal of curation is to enable discovery of knowledge  A variety of tools are available
  • 95. Further reading / useful links  OpenRefine online tutorial http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial  Excel / spreadsheet do’s and don’ts http://kbroman.org/dataorg/  GSC MIxS http://www.doi.org/10.1038/nbt.1823  Casrai – dictionary and standards http://casrai.org/standards  List of biological standards, checklists and databases www.BioSharing.org
  • 97. Reflection: how fair is FAIR? Read the FAIR principles paper. Do you think they are applicable and feasible for HK? If it is feasible, what is needed to implement them? http://www.nature.com/articles/sdata201618 Reminder: Please comment in Moodle Forum. Scott will give feedback on Monday
  • 98. Reminder: Final Project • For the final project for this course need to choose from 3 assignment options (see moodle). • The assignment is due on the 15th May and it is worth 40% of your grade. • Time will be set aside for presenting on this during the final class on the 24th April: covering why you chose the option, what discipline/dataset/topic you are covering, and what work you've done so far (5 mins per student including any group feedback) Scott needs your slides by Monday morning for 5 min presentation.
  • 99. Looking ahead… • Final project due 10th May – Need to present preliminary version on 26th April to get feedback before completion. Send Scott slides by the 25th April so he can get them ready for the class

Editor's Notes

  1. http://schema.datacite.org/meta/kernel-4.0/example/datacite-example-dataset-v4.0.xml https://search.datacite.org/ (new) https://search.datacite.org/ui/ (old)