Finding and accessing human genomic data for research
University of Cambridge, United Kingdom | Seminar Room G
Monday, 22 August 2016 from 10:00 to 12:00 (BST)
Charlotte, Nadia and Fiona presented an overview of data sources around the world where you can find genomics data for your research and gave examples of the data access application for dbGaP and EGA with specific details relevant for University of Cambridge researchers.
Forensic Biology & Its biological significance.pdf
Workshop - finding and accessing data - Cambridge August 22 2016
1. We are always looking for data
Finding and accessing human
genomic data for research
Cambridge, 22nd August 2016
Slides will be made available online
Tweets welcome #CamFindData
2. Outline of the day
- Data sources and data access (Charlotte)
- Case study: University of Cambridge
- Coffee break
- Introduction to Repositive (Fiona)
- Hands-on session: searching for data
- Round up and closure
3. On-line tools used during the workshop
To ask questions during the presentation and answer questions:
go to slido.com
enter event code: 1641
To leave feedback on the workshop:
http://tinyurl.com/feedback220816
4. We are on twitter:
@glyn_dk
@repositiveio
@DNAdigest
@CamOpenData
Cambridge, 22nd August 2016
Slides will be made available online
Tweets welcome #CamFindData
5. 1. What data are you looking for?
Join at slido.com with the event
code #1641
This workshop will focus on finding
and accessing human genomic data.
… why would you be looking for
genomic data for your research?
6. How much data do you need to publish a paper?
2001: 1 human genome
2012: 1000 Genomes (1092 genomes, since increased to ~2500)
2015: UK10K & deCODE (>100k induviduals)
Cancer Genome Atlas ~11,000 genomes
ExAC consortium 65,000 exomes
?
7. Case studies
Raquel, PhD Student, London,
UK.
Researching genes associated
with rare eye disorders.
Problems:
- Doesn’t know where to look
for data.
- Doesn't know if data even
exists.
“I gave up on finding the data -
it was very time consuming and
not proving fruitful – so I
started focusing more on
generating my own data.”
Mahantesh, Academic
Researcher, Taipei, Taiwan.
Studying pharmacogenomics in
cardiovascular epidemiology.
Problems:
- Needs lots of data.
- Knows it exists but struggles
with getting access to it.
“Often it’s very hard to get the
required number of cases and
controls to carry out research
in public health and
epidemiology.”
Jana, Company Biocurator,
Zurich, Switzerland.
Biocurating microarray and
RNA-Seq data.
Problems:
- Needs lots of data.
- Lots of data out there but
hard to filter down to ‘useful
/ relevant’ data.
“Many repositories don’t list the
metadata details I need to
know if a dataset is useful to
me, I can waste a lot of time
searching.”
8. What can I do?
PRO TIPS:
Involve a statistician early on in your study design!
Include more reference data in your analysis
Search for collaborators who have the data you need
Tell your colleagues and peers what type of data you
have in your lab
Use external sources of data….
9. 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Large amounts of data, but not accessible
≈ .5PB
Sequence
available
80+PB
Sequenced
every year
WGS data available
in public repos
Exponential
growth rate
Under-utilised data
has huge potential for
medical research
10. 2. Data resources from around the world
Public repositories
• some you apply for access,
especially if data contains
clinical info or whole genome
PID
• some are open access: GEO,
SRA, PGP, OpenSNP, GigaDB, …
• some are consented for
general research use, some
have specific consent
11. How many data sources?
How many sources of human
genomics data do you know
about?
12. Hundreds of data sources
…but they aren’t easy to find!
http://dx.doi.org/10.1371/journal.pbio.1002418First 30 data sources listed here:
10
25
33 35
102
174
239
0
50
100
150
200
250
300
Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-16 Jun-16
14. 11
155
2
2
4
4
7
780
0
5
10
15
20
25
30
35
40
45
GB FI NL FR DE CH EE BE DK ES SI IE SE
0
5
10
15
20
25
30
35
CA MD MA WA NY TX AZ DC NJ NC PA UT TN CO IN FL LA VA IL ME OH MO MI SC OR
1
1
1
1
1
1
Data sources across the globe
GEO location of 278
data sources analysed.
Found by tracking IP address
of the source.
These include:
Public Repositories
Universities
Companies
BioBanks
Research consortiums
17. More information about data sources
… in our recent paper:
http://tinyurl.com/plos-biology-repositive
18. 3. Getting access to Restricted data
Benefits:
• Strict governance
• Individuals are protected
• Review of consent
• Applicant signs for full
responsibility for governance
Disadvantages:
• No control of data once access
is given
• High barrier for access – too
high?
19. Data accessibility
Can download the
data straight away
or after logging in.
Need to apply for
access to the data.
Has both Open and Restricted
access data within one
repository.
Access type of 225
sampled data sources.
20. Often a long process
Bottlenecks:
• Finding relevant and usable
data
• Getting authorisation to
access data
• Formatting data
• Storing and moving data
We studied the problem with
qualitative interviews followed
by a survey of researchers in
human genetics
T. A. van Schaik et al
The need to redefine genomic data sharing: a focus on
data accessibility, Applied & Translational Genomics, 2014
10.1016/j.atg.2014.09.013
21. Often a long process
Researchers spend months trying find and access genomic data, and often choose to not
access data at all
22. NIH / eRA Commons login
No
Yes
Organisation registered with eRA
Organisation has DUNS number
No
No
Write research proposal
Yes
+ 2-3 days
+ 1-2 weeks
+ 1 week
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 1-4 weeks
Science…
+ 1-2 days
PRO Tip: If you use human
genomic data, apply for the
GRU datasets in dbGaP, one
application – access to all the
GRU datasets.
dbGaP application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-dbgap/
23. Sanger eDAM Account
No
Write research proposal
+ 1 hour
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 2-7 days
Science…
+ 1-2 days
EGA application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-ega/
24. • Post doctoral researcher at University of Cambridge
Medical School
• Working on genetic inheritance and Cancer
• Using NGS data and bioinformatics
• After searching for data online she decided to apply for:
• 2 dbGaP datasets
• 3 EGA datasets
Cambridge specific Case Study
Blog Post:
Pending… will be on http://blog.repositive.io/
25. The Research Operations Office - will help you with the
contracts (DTAs) and signatures.
• Has a designated individual who processes all dbGaP
applications as they all abide by NIH legal restrictions and
regulations about how to handle the data once granted
access.
• For EGA applications, each DTA must get processed
separately because there is no consensus for the ‘contracts’
between each dataset.
Cambridge specific Case Study
Blog Post:
Pending… will be on http://blog.repositive.io/
26. The nominated IT director - will be specific to your
department.
• They will need to confirm you can support the requirements of
the DTA.
• If the head of your departmental IT is not happy to sign – the
head of IT for the University will be able to sign it off.
Cambridge specific Case Study
Blog Post:
Pending… will be on http://blog.repositive.io/
27. Top Tips:
Be prepared…
• Think about your storage space!
• Think about what sort of analysis and processing you are
going to do with the data once you do have it. After such a
long process, the approval could be too quick!!
• Designate time!
• Understand what you need before you start the application
process!
• You only have 1 year!
Cambridge specific Case Study
28. 4. Not all data is restricted
Applying for access to restricted
data is a hard and time
consuming process.
Think about using open access
data!
29. Make the (research) world a better place by sharing in return
Best practices: Share in return!
30. • If you expect data to be available to you
– you have to make your data available too!
• Encourage collaborations: power by numbers
1. Get credit – publish and make your data available
2. Give credit – cite data sources
3. Understand consent – for all uses of clinical data
Best practices
31. • Use all available tools to make your life easier:
• Data publications visibility and citations for your data, e.g.
GigaScience and Scientific Data
• Figshare, Zenodo, Dryad for sharing open access data
• PhenomeCentral, Matchmaker exchange for rare disease research
• Repositive for finding data across repositories and make your own
data discoverable
Best practices: use the tools
32. • Digital consent: towards automatic processing of applications
• Dynamic consent and power to the patient, e.g.
PatientsKnowBest
• Privacy-preserving access to datasets: preserving control and
governance with data custodian, lower barrier for access
What the future holds
33. Workshop: Finding and accessing human
genomic data for research
Fiona Nielsen – August 22nd 2016
34. We are always looking for data
Genetics,
Cancer,
Rare disease
research
We need
access to the
right data at
the right time
DNA
interpretation
requires
lots of data
35. Data is not easy to find and access
FRAGMENTED
Poor visibility of available
genomic data
ADMIN BURDEN
Huge overhead to manage
data access
BAD CULTURE
Lack of data sharing habits in
research culture
36. We are enabling best practices
MAKE DATA
DISCOVERABLE
SIMPLIFY
WORKFLOWS
CONTRIBUTE TO
COMMUNITY
DNAdigest and Repositive – Connecting the world of genomic data
http://www.tinyurl.com/plos-biology-repositive
39. Team 2 minute presentation
1. Introduction
What data did you try to find and why?
Have you tried to search for this data before?
2. Methods
The 5 main steps you took on Repositive to try and find this data.
3. Results
Did you find the data on Repositive?
What challenges did you encounter?
4. Conclusion
Sum up your experience in 1 sentence.
1 2 3 4 5
40. Tell us your thoughts:
@repositiveio
@glyn_dk
And read more on http://repositive.io
Bugs and feedback to: Charlotte at Repositive.io
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and vetting of users
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users
Examples of researchers looking for genomics data. All have problems, even though in different parts of the world, in different industries and with different research questions.
It has been shown that the combination of summary single-variant statistics from multiple data sets, rather than the joint analysis of a combined data set, does not result in an appreciable loss of information85, and that taking into account heterogeneity in effect size across studies can improve statistical power
Population scale genome sequencing projects have been launched all over the world
More than 80PB of human genomic data is being sequenced Every year
BUT
To date only around .5PB of data available in public repositories
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users
Further confounded by the data being highly fragmented.
Siloed in repositories and institutions around the world.
There are many public repositories, but It can be hugely confusing to know where to look for the right kind of data
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users
Excellence at your Research Subject is … excellent, but is it ENOUGH ?
To be successful, a candidate will be judged on being complete.
MESSAGE: FOSUC only on IF could expose you to risk
ODP trained, EURO-BASIN manager, – a boring title, for a diverse job, in an exciting research domain.
DIP into EACH step of the research cycle, from proposal formulation to providing the best return-on-investment to the funders.
So I`d like to share with you some experiences from the last few years of OS advocacy in the Marine Science Community
Our mission is to speed up research and diagnostics for genetic diseases by enabling efficient and ethical access to genomic research data
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and vetting of users
Data is fragmented in unconnected silos – makes it very difficult to discover data
Tracking data and working with data access requests is a time-consuming and bureaucratic exercise
Difficult to build a user community without best practices and tools/platforms where users can share their data experience / findings