Open Data HK: open science meets open data. A primer from Scott Edmunds

Open science primer
meets

Scott Edmunds
@SCEdmunds
@GigaScience

Can this be considered open data?

http://biology.clc.uc.edu/fankhauser/labs/genetics/dna_isolation/thymus_dna.htm

Does this qualify as open source?

http://2011.igem.org/Team:UC_Davis

What is Open (Science) Data?

• Something very very very geeky
• Free & open access to data about the world
around us
Searchable, findable
o Machine-readable, app-makeable, Excel-usable
o Without restrictions/limitations
o

• This (examples)

About me:

• Scott Edmunds
• Molecular biology, sci editing & comms
• Scientific journal & (big) data publishing
• Reproducibility & open science

Journal, data-platform and database for
large-scale biological data
www.gigasciencejournal.com

About my employer:
• Formerly Beijing Genomics Institute
• Founded in 1999 (1% of HGP)
• China’s 1st citizen managed not-for-profit research
institute funded by commercial sequencing-as-a-service
(BGI Tech)
• Now largest genomic organization in the world
• HQ in Shenzhen, most data production in BGI HK (Tai Po)

Standing on the shoulders of giants

Open Data 1665?

Scholarly articles are merely advertisement of scholarship . The
actual scholarly artefacts, i.e. the data and computational
methods, which support the scholarship, remain largely
inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab
and reproducible research, 1995

OKFN: 8 types of open data

http://science.okfn.org/

Panton Principles

=
http://pantonprinciples.org/

Science Data Volumes
Astrophysics
Exabytes

HE Physics
100’s of Petabytes

Biology
Petabytes

Sequencing
Square Kilometer Array
Large Hadron Collider
Mass Spec

Imaging

The long tail of scientific data…
Esoteric formats, poorly structured,
Tabular, often spreadsheet based
Issues open data community well used to
(data cleaning, scraping, etc.,)

Open Data in Physics
1961 CERN pre-prints shelf

1991-date arXiv

http://cerncourier.com/cws/article/cern/28654
http://arxiv.org/

Open Data in Biology
1934: newsletter era

1980: database era 1987: online era

2010’s: “bioinformatics
bingo” era

BGI HK Chamber O’Illumina’s
The LHC of Biology?
20PB of storage

Genomics: open-data success story?

V

Sharing/reproducibility helped by
stability of:
1st Gen

2nd Gen

1. Platforms
1. Repositories
2. Standards

:

Genomics Data Sharing Policies…
Bermuda Accords 1996/1997/1998:
1. Automatic release of sequence assemblies within 24 hours.
2. Immediate publication of finished annotated sequences.
3. Aim to make the entire sequence freely available in the public domain for
both research and development in order to maximise benefits to society.

Fort Lauderdale Agreement, 2003:
1. Sequence traces from whole genome shotgun projects are to be
deposited in a trace archive within one week of production.
2. Whole genome assemblies are to be deposited in a public nucleotide
sequence database as soon as possible after the assembled sequence
has met a set of quality evaluation criteria.

Toronto International data release workshop, 2009:
The goal was to reaffirm and refine, where needed, the policies related to
the early release of genomic data, and to extend, if possible, similar data
release policies to other types of large biological datasets – whether from
proteomics, biobanking or metabolite research.

Sharing aids fields…
Rice v Wheat: consequences of publically available
genome data.
rice

700
600
500

400
300
200
100
0

wheat

Digitizing the world

Can we make everything open data?

The (non-) human centipede: first sequence

NO

PUBLISHER
NARRATIVE

CURATION/
INTEGRATION

SOURCE

DATA

USER
(SOCIAL)
MEDIA

EXTERNAL
DATABASES

Morphbank
ARRAYEXPRESS

DATA PRODUCTION
•
•
•
•
•

Genomics
Barcoding
Imaging
microCT
Video

What is open science? 5 flavours:

Benedikt Fecher and Sascha Friesike: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2272036

Biggest Challenge: Closed Access

WWW.RIGHTTORESEARCH.ORG

Handful of closed access STM publishers control market
Force libraries to buy “bundles”

Revenue >$9B
Average cost /article >$5000 USD
Publishers retain copyright
Prevent data mining of content
Withold information from 99.9% who need it!

Publishing: better than a gold mine

See: http://alexholcombe.wordpress.com/2013/01/09/scholarly-publishers-and-their-high-profits/

Increasing strain on library budgets
MIT library purchases v inflation 1986-2006
400%

Journal expenditure
350%

300%

Percentage Change

250%

200%

150%

Inflation

100%

50%

0%
1986

1988

1990

1992

1994

1996

1998

2000

2002

-50%
Year
Consumer Price Index % +

Serial Expenditures % +

# Books Purchased % +

Book Expenditures % +

# Serials Purchased % +

2004

The good news: the fightback has started…

http://thecostofknowledge.com/

The Solution: Open Access
Budapest Open Access Initiative:
“By “open access” to [peer-reviewed research literature], we mean its
free availability on the public internet, permitting any users to
read, download, copy, distribute, print, search, or link to the full texts
of these articles, crawl them for indexing, pass them as data to
software, or use them for any other lawful purpose, without
financial, legal, or technical barriers other than those inseparable
from gaining access to the internet itself. The only constraint on
reproduction and distribution, and the only role for copyright in this
domain, should be to give authors control over the integrity of their
work and the right to be properly acknowledged and cited.”

• Maximizes reuse and access
• Gives authors control over the integrity of their work and the right
to be properly acknowledged and cited.
• “Real” OA asks for no restrictions/limitations = CC-BY

Hong Kong: off the map
Push the button!

https://www.openaccessbutton.org/

Hong Kong: good with theses…

http://hub.hku.hk/

Hong Kong: still some work to go with OA

…Singapore beats us

Pragmatic/Infrastructure:
Crowdsourcing, wisdom of the masses

Wiki science:
GeneWiki
• 10,000 distinct gene pages.
• 1.42 million words and 78MB data.
• 50 million views & 15,000 edits per year.
http://en.wikipedia.org/wiki/Portal:Gene_Wiki

GitHub science:

A hypothetical Git workflow for a scientific collaboration involving 3 authors.
Karthik Ram: http://www.scfbm.org/content/8/1/7

Our crowdsourcing example:

To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public domain
under a CC0 license. Until the publication of research papers on the
assembly and whole-genome analysis of this isolate we would ask you to
cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J;
Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y;
Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X;
Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY2482 isolate genome sequencing consortium (2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen.
doi:10.5524/100001
http://dx.doi.org/10.5524/100001
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Downstream consequences:
1. Citations (~180) 2. Therapeutics (primers, antimicrobials)

3. Platform Comparisons

4. Example for faster & more open science

“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli
strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days
for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could
use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that
allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and
publish their work without wasting time on legal wrangling.”

1.3 The power of intelligently open data
The benefits of intelligently open data were powerfully
illustrated by events following an outbreak of a severe gastrointestinal infection in Hamburg in Germany in May 2011. This
spread through several European countries and the
US, affecting about 4000 people and resulting in over 50
deaths. All tested positive for an unusual and little-known
Shiga-toxin–producing E. coli bacterium. The strain was initially
analysed by scientists at BGI-Shenzhen in China, working
together with those in Hamburg, and three days later a draft
genome was released under an open data licence. This
generated interest from bioinformaticians on four continents. 24
hours after the release of the genome it had been assembled.
Within a week two dozen reports had been filed on an opensource site dedicated to the analysis of the strain. These
analyses provided crucial information about the strain’s
virulence and resistance genes – how it spreads and which
antibiotics are effective against it. They produced results in
time to help contain the outbreak. By July 2011, scientists
published papers based on this work. By opening up their early
sequencing results to international collaboration, researchers in
Hamburg produced results that were quickly tested by a wide
range of experts, used to produce new knowledge and
ultimately to control a public health emergency.

Pragmatic/Infrastructure:
Open Innovation Challenges

http://www.scientificamerican.com/openinnovation/

http://www.gov.hk/en/theme/psi/contest/contest_events.htm

Indie Science

Biohacker spaces
CoResearch labs
Crowdfunding
DIYbio
Open hardware
http://www.perlsteinlab.com/

Biggest crowdfunding successes

Utilizing students: iGEM

iGEM:

http://2011.igem.org/Team:UC_Davis

The “Peoples Parrot”
Puerto Rican Parrot Genome Project (Amazona vittata )
Rarest parrot, national bird of Puerto Rico

Community funded from artworks, fashion shows, beer brands, crowdfunding…
Genome annotated by students in community college as part of bioinformatics education
Paper and Data published in GigaScience and GigaDB

Taras K Oleksyk, et al., (2012) A Locally Funded Puerto Rican Parrot (Amazona vittata) Genome Sequencing Project Increases Avian Data and Advances Young
Researcher Education. GigaScience 2012, 1:14
Steven J. O’Brien. (2012): Genome empowerment for the Puerto Rican parrot – Amazona vittata. GigaScience 2012, 1:13
Oleksyk et al., (2012): Genomic data of the Puerto Rican Parrot (Amazona vittata) from a locally funded project. GigaScience.
http://dx.doi.org/10.5524/100039

Public: Citizen Science
Galaxy Zoo:

Zoonoverse:

887,355 “Zooites” and counting
https://www.zooniverse.org/

Public: Citizen Science
1987-1997

http://sabap2.adu.org.za/

Easy to get started…

http://crowdcrafting.org/

Public: Games with a Purpose

http://fold.it/
http://www.sciencegamecenter.org/

https://apps.facebook.com/fraxinusgame/

OpenSciDev

http://openscidev.com/

OpenSciDev
Questions asked:
1. What value framework is a prerequisite for open science?
2. How can open science support visibility and communication of
science outside formal academic structures?
3. How can open science create education?
4. How can the economic and social value of open science be
measured?

Currently working on:
• Writing working paper on these questions
• Building networks across Africa, Asia, Latin America and the
Caribbean.
• Setting up call for funding for OpenSciDev projects ($2-3M)
http://openscidev.com/

To summarize:
• Open data is more than just government data
(although research data mostly is government funded too)
• Need for OA advocates & policies in Hong Kong (role for ODHK?)
• Much science community can still learn about open licensing
• Much wider open data community can learn on community
engagement from Citizen Science, GWAP, etc.

• Asia (inc HK) behind US/EU on many of these activities, but can
we learn lessons from success of iGEM and “Jamboreee” model?
*…King+

Open Data HK: open science meets open data. A primer from Scott Edmunds

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Open Data HK: open science meets open data. A primer from Scott Edmunds

Similar to Open Data HK: open science meets open data. A primer from Scott Edmunds (20)

More from Scott Edmunds

More from Scott Edmunds (20)

Recently uploaded

Recently uploaded (20)

Open Data HK: open science meets open data. A primer from Scott Edmunds

Editor's Notes