Finding & Accessing Human Genomic Datasets

We are always looking for data
Finding & Accessing
Human Genomic
Datasets
CRUK, 7th November 2016
Tweets welcome
#CamFindData
@repositiveio

Outline of the day
- Data sources and data access
- Case study: University of Cambridge
- Coffee break
- Introduction to Repositive
- Hands-on session: searching for data
- Round up and closure

On-line tools used during the workshop
To ask questions during the presentation and answer questions:
go to slido.com
enter event code: 7315

• 2001: First Human Genome Sequence
• 2005: Personal Genome Project
• 2008: UK10K
• 2013: UK 100K Project
• 2015: 1M Precision Medicine US
• 2016: AstraZeneca – HLI 2M
• Many other national and international projects
Genome Technology Evolution

•Consensus among researchers, clinicians,
politicians & the public that genomics will
transform biomedical research, healthcare
and lifestyle choices (Stephan Beck, UCL)
OPPORTUNITY

• Required by funders
• Cannot publish unless accession
number given
• Specialised
• ENA
• EGA
• dbGaP
• dbSNP…
• Generalist
• Dryad
• figshare
Public Repositories

• Open Access
• Eg. PGP, CC0
• Bermuda Accord
• Managed (Restricted or Controlled Access)
• Data Access Committee
• No effective agreement (policy vacuum)
• Global Alliance for Genomics & Health
• enable compatible, readily accessible, and scalable approaches for
sharing
GOVERNANCE Models

Open vs Managed Access
Open Access
75,000,000 per month
Managed Access
150 per month
500,000 fold difference (Stephan Beck, UCL)

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Large amounts of data, but not accessible
≈ .5PB
Open
Access
80+PB
Sequenced
Genome data
available in public
repos
Exponential
growth rate
Under-utilised data
has huge potential for
medical research

Access to Managed Data
Benefits:
• Strict governance
• Individuals are protected
• Review of consent
• Applicant signs for full
responsibility for governance
Disadvantages:
• No control of data once access
is given
• High barrier for access – too
high?

Often a long process
Bottlenecks:
• Finding relevant and usable
data
• Getting authorisation to
access data
• Formatting data
• Storing and moving data
We studied the problem with
qualitative interviews followed
by a survey of researchers in
human genetics
T. A. van Schaik et al
The need to redefine genomic data sharing: a focus on
data accessibility, Applied & Translational Genomics, 2014
http://tinyurl.com/schaik-dnadigest

NIH / eRA Commons login
No
Yes
Organisation registered with eRA
Organisation has DUNS number
No
No
Write research proposal
Yes
+ 2-3 days
+ 1-2 weeks
+ 1 week
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 1-4 weeks
Science…
+ 1-2 days
PRO Tip: If you use human
genomic data, apply for the
GRU datasets in dbGaP, one
application – access to all the
GRU datasets.
dbGaP application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-dbgap/

Sanger eDAM Account
No
Write research proposal
+ 1 hour
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 2-7 days
Science…
+ 1-2 days
EGA application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-ega/

• Finding specific relevant genomic data for research can
take up to six months for an untrained researcher
without dedicated tools
• Application & response time for data access
committees can vary widely depending on
• the type of dataset
• consent regulations of the study
• => there is no consensus for the ‘contracts’ between each dataset
FACTS

Researchers often choose to not access data at all

• Validate existing studies
• Avoid unnecessary duplication
• Compare to new studies
• Enhance new datasets
Why datasets are useful

Case studies
Raquel, PhD Student, London, UK.
Researching genes associated with rare eye disorders.
Problems:
- Doesn’t know where to look for data.
- Doesn't know if data even exists.
“I gave up on finding the data - it was very time consuming and not
proving fruitful – so I started focusing more on generating my own
data.”

Case studies
Mahantesh, Academic Researcher, Taipei, Taiwan.
Studying pharmacogenomics in cardiovascular epidemiology.
Problems:
- Needs lots of data.
- Knows it exists but struggles with getting access to it.
“Often it’s very hard to get the required number of cases and controls
to carry out research in public health and epidemiology.”

Case studies
Jana, Company Biocurator, Zurich, Switzerland.
Biocurating microarray and RNA-Seq data.
Problems:
- Needs lots of data.
- Lots of data out there but hard to filter down to ‘useful / relevant’
data.
“Many repositories don’t list the metadata details I need to know if a
dataset is useful to me, I can waste a lot of time searching.”

How many data sources?
How many sources of human
genomics data do you know
about?

11
155
2
2
4
4
7
780
0
5
10
15
20
25
30
35
40
45
GB FI NL FR DE CH EE BE DK ES SI IE SE
0
5
10
15
20
25
30
35
CA MD MA WA NY TX AZ DC NJ NC PA UT TN CO IN FL LA VA IL ME OH MO MI SC OR
1
1
1
1
1
1
Data sources across the globe
GEO location of 278
data sources analysed.
Found by tracking IP address
of the source.
These include:
 Public Repositories
 Universities
 Companies
 BioBanks
 Research consortiums

Data source content
Assay Types
Dedicated to…

Hundreds of data sources
…but they aren’t easy to find!
http://tinyurl.com/plos-biology-repositiveFirst 30 data sources listed here:
10
25
33 35
102
174
239
0
50
100
150
200
250
300
Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-16 Jun-16

• Post doctoral researcher at University of Cambridge
Medical School
• Working on genetic inheritance and Cancer
• Using NGS data and bioinformatics
• After searching for data online she decided to apply for:
• 2 dbGaP datasets
• 3 EGA datasets
Cambridge specific Case Study
Blog Post:
Pending… will be on http://blog.repositive.io/

The Research Operations Office - will help you with the
contracts (Data Transfer Agreements - DTAs) and signatures.
• Has a designated individual who processes all dbGaP
applications as they all abide by NIH legal restrictions and
regulations about how to handle the data once granted
access
• For EGA applications, each DTA must be processed
separately because there is no consensus for the ‘contracts’
between each dataset.
Blog Post:

The nominated IT director - will be specific to your
department.
• They will need to confirm you can support the requirements of
the DTA.
• If the head of your departmental IT is not happy to sign – the
head of IT for the University will be able to sign it off.
Blog Post:

Top Tips:
• Think about your storage space!
• Think about what sort of analysis and processing
you are going to do with the data once you do have
it. After such a long process, the approval could be
too quick.
• Understand what you need before you start the
application process!
• You may have access for a limited period

1-click to human genomic data access
to make finding data as easy as finding a book
on Amazon, book a hotel on Expedia!

Simpler workflow
for data access
Our expertise is data search platforms
Discover and
access
Search, see
related results
Find colleagues &
their data interests
Co-annotate data &
community feedback

We are enabling best practices
MAKE DATA
DISCOVERABLE
SIMPLIFY
WORKFLOWS
CONTRIBUTE TO
COMMUNITY
DNAdigest and Repositive – Connecting the world of genomic data
http://www.tinyurl.com/plos-biology-repositive

Connecting the world of genomic data

1. Form groups of 2-3 people
2. Select a leader & a spokeperson
3. Choose 1 data theme you are interested in
1. E.g, colon cancer, prostate cancer, breast cancer
4. Sign up at https://discover.repositive.io/
5. Search the Repositive with selected theme
Hands on

Team presentation: 2 minutes
1. Introduction
 What data did you try to find and why?
 Have you tried to search for this data before?
2. Methods
 The 5 main steps you took on Repositive to try and find this data.
3. Results
 Did you find the data on Repositive?
 What challenges did you encounter?
4. Conclusion
 Sum up your experience in 1 sentence.
1 2 3 4 5

Feedback on the workshop
Bugs and feedback to: Charlotte at Repositive.io
Please leave your feedback on the workshop:
http://tinyurl.com/feedback280916

http://discover.repositive.io
@repositive

Finding & Accessing Human Genomic Datasets

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Finding & Accessing Human Genomic Datasets

Similar to Finding & Accessing Human Genomic Datasets (20)

Recently uploaded

Recently uploaded (20)

Finding & Accessing Human Genomic Datasets

Editor's Notes