Workshop - finding and accessing data - Cambridge August 22 2016

We are always looking for data
Finding and accessing human
genomic data for research
Cambridge, 22nd August 2016
Slides will be made available online
Tweets welcome #CamFindData

Outline of the day
- Data sources and data access (Charlotte)
- Case study: University of Cambridge
- Coffee break
- Introduction to Repositive (Fiona)
- Hands-on session: searching for data
- Round up and closure

On-line tools used during the workshop
To ask questions during the presentation and answer questions:
go to slido.com
enter event code: 1641
To leave feedback on the workshop:
http://tinyurl.com/feedback220816

We are on twitter:
@glyn_dk
@repositiveio
@DNAdigest
@CamOpenData
Cambridge, 22nd August 2016
Slides will be made available online 
Tweets welcome #CamFindData

1. What data are you looking for?
Join at slido.com with the event
code #1641
This workshop will focus on finding
and accessing human genomic data.
… why would you be looking for
genomic data for your research?

How much data do you need to publish a paper?
2001: 1 human genome
2012: 1000 Genomes (1092 genomes, since increased to ~2500)
2015: UK10K & deCODE (>100k induviduals)
Cancer Genome Atlas ~11,000 genomes
ExAC consortium 65,000 exomes
?

Case studies
Raquel, PhD Student, London,
UK.
Researching genes associated
with rare eye disorders.
Problems:
- Doesn’t know where to look
for data.
- Doesn't know if data even
exists.
“I gave up on finding the data -
it was very time consuming and
not proving fruitful – so I
started focusing more on
generating my own data.”
Mahantesh, Academic
Researcher, Taipei, Taiwan.
Studying pharmacogenomics in
cardiovascular epidemiology.
Problems:
- Needs lots of data.
- Knows it exists but struggles
with getting access to it.
“Often it’s very hard to get the
required number of cases and
controls to carry out research
in public health and
epidemiology.”
Jana, Company Biocurator,
Zurich, Switzerland.
Biocurating microarray and
RNA-Seq data.
Problems:
- Needs lots of data.
- Lots of data out there but
hard to filter down to ‘useful
/ relevant’ data.
“Many repositories don’t list the
metadata details I need to
know if a dataset is useful to
me, I can waste a lot of time
searching.”

What can I do?
PRO TIPS:
 Involve a statistician early on in your study design!
 Include more reference data in your analysis
 Search for collaborators who have the data you need
 Tell your colleagues and peers what type of data you
have in your lab
 Use external sources of data….

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Large amounts of data, but not accessible
≈ .5PB
Sequence
available
80+PB
Sequenced
every year
WGS data available
in public repos
Exponential
growth rate
Under-utilised data
has huge potential for
medical research

2. Data resources from around the world
Public repositories
• some you apply for access,
especially if data contains
clinical info or whole genome
PID
• some are open access: GEO,
SRA, PGP, OpenSNP, GigaDB, …
• some are consented for
general research use, some
have specific consent

How many data sources?
How many sources of human
genomics data do you know
about?

Hundreds of data sources
…but they aren’t easy to find!
http://dx.doi.org/10.1371/journal.pbio.1002418First 30 data sources listed here:
10
25
33 35
102
174
239
0
50
100
150
200
250
300
Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-16 Jun-16

11
155
2
2
4
4
7
780
0
5
10
15
20
25
30
35
40
45
GB FI NL FR DE CH EE BE DK ES SI IE SE
0
5
10
15
20
25
30
35
CA MD MA WA NY TX AZ DC NJ NC PA UT TN CO IN FL LA VA IL ME OH MO MI SC OR
1
1
1
1
1
1
Data sources across the globe
GEO location of 278
data sources analysed.
Found by tracking IP address
of the source.
These include:
 Public Repositories
 Universities
 Companies
 BioBanks
 Research consortiums

Data source content
Assay Types
Dedicated to…

More information about data sources
… in our recent paper:
http://tinyurl.com/plos-biology-repositive

3. Getting access to Restricted data
Benefits:
• Strict governance
• Individuals are protected
• Review of consent
• Applicant signs for full
responsibility for governance
Disadvantages:
• No control of data once access
is given
• High barrier for access – too
high?

Data accessibility
Can download the
data straight away
or after logging in.
Need to apply for
access to the data.
Has both Open and Restricted
access data within one
repository.
Access type of 225
sampled data sources.

Often a long process
Bottlenecks:
• Finding relevant and usable
data
• Getting authorisation to
access data
• Formatting data
• Storing and moving data
We studied the problem with
qualitative interviews followed
by a survey of researchers in
human genetics
T. A. van Schaik et al
The need to redefine genomic data sharing: a focus on
data accessibility, Applied & Translational Genomics, 2014
10.1016/j.atg.2014.09.013

Often a long process
Researchers spend months trying find and access genomic data, and often choose to not
access data at all

NIH / eRA Commons login
No
Yes
Organisation registered with eRA
Organisation has DUNS number
No
No
Write research proposal
Yes
+ 2-3 days
+ 1-2 weeks
+ 1 week
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 1-4 weeks
Science…
+ 1-2 days
PRO Tip: If you use human
genomic data, apply for the
GRU datasets in dbGaP, one
application – access to all the
GRU datasets.
dbGaP application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-dbgap/

Sanger eDAM Account
No
Write research proposal
+ 1 hour
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 2-7 days
Science…
+ 1-2 days
EGA application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-ega/

• Post doctoral researcher at University of Cambridge
Medical School
• Working on genetic inheritance and Cancer
• Using NGS data and bioinformatics
• After searching for data online she decided to apply for:
• 2 dbGaP datasets
• 3 EGA datasets
Cambridge specific Case Study
Blog Post:
Pending… will be on http://blog.repositive.io/

The Research Operations Office - will help you with the
contracts (DTAs) and signatures.
• Has a designated individual who processes all dbGaP
applications as they all abide by NIH legal restrictions and
regulations about how to handle the data once granted
access.
• For EGA applications, each DTA must get processed
separately because there is no consensus for the ‘contracts’
between each dataset.
Blog Post:

The nominated IT director - will be specific to your
department.
• They will need to confirm you can support the requirements of
the DTA.
• If the head of your departmental IT is not happy to sign – the
head of IT for the University will be able to sign it off.
Blog Post:

Top Tips:
Be prepared…
• Think about your storage space!
• Think about what sort of analysis and processing you are
going to do with the data once you do have it. After such a
long process, the approval could be too quick!!
• Designate time!
• Understand what you need before you start the application
process!
• You only have 1 year!

4. Not all data is restricted
Applying for access to restricted
data is a hard and time
consuming process.
Think about using open access
data!

Make the (research) world a better place by sharing in return 
Best practices: Share in return!

• If you expect data to be available to you
– you have to make your data available too!
• Encourage collaborations: power by numbers
1. Get credit – publish and make your data available
2. Give credit – cite data sources
3. Understand consent – for all uses of clinical data
Best practices

• Use all available tools to make your life easier:
• Data publications  visibility and citations for your data, e.g.
GigaScience and Scientific Data
• Figshare, Zenodo, Dryad for sharing open access data
• PhenomeCentral, Matchmaker exchange for rare disease research
• Repositive for finding data across repositories and make your own
data discoverable
Best practices: use the tools

• Digital consent: towards automatic processing of applications
• Dynamic consent and power to the patient, e.g.
PatientsKnowBest
• Privacy-preserving access to datasets: preserving control and
governance with data custodian, lower barrier for access
What the future holds

Workshop: Finding and accessing human
genomic data for research
Fiona Nielsen – August 22nd 2016

We are always looking for data
Genetics,
Cancer,
Rare disease
research
We need
access to the
right data at
the right time
DNA
interpretation
requires
lots of data

Data is not easy to find and access
FRAGMENTED
Poor visibility of available
genomic data
ADMIN BURDEN
Huge overhead to manage
data access
BAD CULTURE
Lack of data sharing habits in
research culture

We are enabling best practices
MAKE DATA
DISCOVERABLE
SIMPLIFY
WORKFLOWS
CONTRIBUTE TO
COMMUNITY
DNAdigest and Repositive – Connecting the world of genomic data
http://www.tinyurl.com/plos-biology-repositive

Connecting the world of genomic data

Live demo
http://discover.repositive.io

Team 2 minute presentation
1. Introduction
 What data did you try to find and why?
 Have you tried to search for this data before?
2. Methods
 The 5 main steps you took on Repositive to try and find this data.
3. Results
 Did you find the data on Repositive?
 What challenges did you encounter?
4. Conclusion
 Sum up your experience in 1 sentence.
1 2 3 4 5

Tell us your thoughts:
@repositiveio
@glyn_dk
And read more on http://repositive.io
Bugs and feedback to: Charlotte at Repositive.io

Workshop - finding and accessing data - Cambridge August 22 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Workshop - finding and accessing data - Cambridge August 22 2016

Similar to Workshop - finding and accessing data - Cambridge August 22 2016 (20)

More from Fiona Nielsen

More from Fiona Nielsen (12)

Recently uploaded

Recently uploaded (20)

Workshop - finding and accessing data - Cambridge August 22 2016

Editor's Notes