This document summarizes a workshop about finding and accessing human genomic datasets. The workshop covered various data sources such as public repositories, case studies on accessing data from the University of Cambridge, and a demonstration of the Repositive platform which aims to simplify accessing genomic data through a single search. Hands-on sessions allowed participants to search for genomic data themes in small groups using Repositive and report their results. Overall the workshop aimed to educate researchers on challenges of accessing genomic data and introduce Repositive as a tool to help address fragmentation and simplify the workflow for discovering and accessing genomic datasets.
1. We are always looking for data
Finding & Accessing
Human Genomic
Datasets
CRUK, 7th November 2016
Tweets welcome
#CamFindData
@repositiveio
2. Outline of the day
- Data sources and data access
- Case study: University of Cambridge
- Coffee break
- Introduction to Repositive
- Hands-on session: searching for data
- Round up and closure
3. On-line tools used during the workshop
To ask questions during the presentation and answer questions:
go to slido.com
enter event code: 7315
4. We are always looking for data
Finding & Accessing
Human Genomic
Datasets
CRUK, 7th November 2016
Tweets welcome
#CamFindData
@repositiveio
5. • 2001: First Human Genome Sequence
• 2005: Personal Genome Project
• 2008: UK10K
• 2013: UK 100K Project
• 2015: 1M Precision Medicine US
• 2016: AstraZeneca – HLI 2M
• Many other national and international projects
Genome Technology Evolution
6. •Consensus among researchers, clinicians,
politicians & the public that genomics will
transform biomedical research, healthcare
and lifestyle choices (Stephan Beck, UCL)
OPPORTUNITY
8. • Required by funders
• Cannot publish unless accession
number given
• Specialised
• ENA
• EGA
• dbGaP
• dbSNP…
• Generalist
• Dryad
• figshare
Public Repositories
9. • Open Access
• Eg. PGP, CC0
• Bermuda Accord
• Managed (Restricted or Controlled Access)
• Data Access Committee
• No effective agreement (policy vacuum)
• Global Alliance for Genomics & Health
• enable compatible, readily accessible, and scalable approaches for
sharing
GOVERNANCE Models
10. Open vs Managed Access
Open Access
75,000,000 per month
Managed Access
150 per month
500,000 fold difference (Stephan Beck, UCL)
11. 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Large amounts of data, but not accessible
≈ .5PB
Open
Access
80+PB
Sequenced
Genome data
available in public
repos
Exponential
growth rate
Under-utilised data
has huge potential for
medical research
12. Access to Managed Data
Benefits:
• Strict governance
• Individuals are protected
• Review of consent
• Applicant signs for full
responsibility for governance
Disadvantages:
• No control of data once access
is given
• High barrier for access – too
high?
13. Often a long process
Bottlenecks:
• Finding relevant and usable
data
• Getting authorisation to
access data
• Formatting data
• Storing and moving data
We studied the problem with
qualitative interviews followed
by a survey of researchers in
human genetics
T. A. van Schaik et al
The need to redefine genomic data sharing: a focus on
data accessibility, Applied & Translational Genomics, 2014
http://tinyurl.com/schaik-dnadigest
14. NIH / eRA Commons login
No
Yes
Organisation registered with eRA
Organisation has DUNS number
No
No
Write research proposal
Yes
+ 2-3 days
+ 1-2 weeks
+ 1 week
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 1-4 weeks
Science…
+ 1-2 days
PRO Tip: If you use human
genomic data, apply for the
GRU datasets in dbGaP, one
application – access to all the
GRU datasets.
dbGaP application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-dbgap/
15. Sanger eDAM Account
No
Write research proposal
+ 1 hour
Yes
Submit proposal
+ 1-2 days
Access granted
Find/Download/Decrypt data
+ 2-7 days
Science…
+ 1-2 days
EGA application process
Blog Post:
http://blog.repositive.io/how-to-successfully-apply-for-access-to-ega/
16. • Finding specific relevant genomic data for research can
take up to six months for an untrained researcher
without dedicated tools
• Application & response time for data access
committees can vary widely depending on
• the type of dataset
• consent regulations of the study
• => there is no consensus for the ‘contracts’ between each dataset
FACTS
19. • Validate existing studies
• Avoid unnecessary duplication
• Compare to new studies
• Enhance new datasets
Why datasets are useful
20. Case studies
Raquel, PhD Student, London, UK.
Researching genes associated with rare eye disorders.
Problems:
- Doesn’t know where to look for data.
- Doesn't know if data even exists.
“I gave up on finding the data - it was very time consuming and not
proving fruitful – so I started focusing more on generating my own
data.”
21. Case studies
Mahantesh, Academic Researcher, Taipei, Taiwan.
Studying pharmacogenomics in cardiovascular epidemiology.
Problems:
- Needs lots of data.
- Knows it exists but struggles with getting access to it.
“Often it’s very hard to get the required number of cases and controls
to carry out research in public health and epidemiology.”
22. Case studies
Jana, Company Biocurator, Zurich, Switzerland.
Biocurating microarray and RNA-Seq data.
Problems:
- Needs lots of data.
- Lots of data out there but hard to filter down to ‘useful / relevant’
data.
“Many repositories don’t list the metadata details I need to know if a
dataset is useful to me, I can waste a lot of time searching.”
23. How many data sources?
How many sources of human
genomics data do you know
about?
24. 11
155
2
2
4
4
7
780
0
5
10
15
20
25
30
35
40
45
GB FI NL FR DE CH EE BE DK ES SI IE SE
0
5
10
15
20
25
30
35
CA MD MA WA NY TX AZ DC NJ NC PA UT TN CO IN FL LA VA IL ME OH MO MI SC OR
1
1
1
1
1
1
Data sources across the globe
GEO location of 278
data sources analysed.
Found by tracking IP address
of the source.
These include:
Public Repositories
Universities
Companies
BioBanks
Research consortiums
29. • Post doctoral researcher at University of Cambridge
Medical School
• Working on genetic inheritance and Cancer
• Using NGS data and bioinformatics
• After searching for data online she decided to apply for:
• 2 dbGaP datasets
• 3 EGA datasets
Cambridge specific Case Study
Blog Post:
Pending… will be on http://blog.repositive.io/
30. The Research Operations Office - will help you with the
contracts (Data Transfer Agreements - DTAs) and signatures.
• Has a designated individual who processes all dbGaP
applications as they all abide by NIH legal restrictions and
regulations about how to handle the data once granted
access
• For EGA applications, each DTA must be processed
separately because there is no consensus for the ‘contracts’
between each dataset.
Cambridge specific Case Study
Blog Post:
Pending… will be on http://blog.repositive.io/
31. The nominated IT director - will be specific to your
department.
• They will need to confirm you can support the requirements of
the DTA.
• If the head of your departmental IT is not happy to sign – the
head of IT for the University will be able to sign it off.
Cambridge specific Case Study
Blog Post:
Pending… will be on http://blog.repositive.io/
32. Top Tips:
• Think about your storage space!
• Think about what sort of analysis and processing
you are going to do with the data once you do have
it. After such a long process, the approval could be
too quick.
• Understand what you need before you start the
application process!
• You may have access for a limited period
Cambridge specific Case Study
35. 1-click to human genomic data access
to make finding data as easy as finding a book
on Amazon, book a hotel on Expedia!
36. Simpler workflow
for data access
Our expertise is data search platforms
Discover and
access
Search, see
related results
Find colleagues &
their data interests
Co-annotate data &
community feedback
37. We are enabling best practices
MAKE DATA
DISCOVERABLE
SIMPLIFY
WORKFLOWS
CONTRIBUTE TO
COMMUNITY
DNAdigest and Repositive – Connecting the world of genomic data
http://www.tinyurl.com/plos-biology-repositive
40. 1. Form groups of 2-3 people
2. Select a leader & a spokeperson
3. Choose 1 data theme you are interested in
1. E.g, colon cancer, prostate cancer, breast cancer
4. Sign up at https://discover.repositive.io/
5. Search the Repositive with selected theme
Hands on
41. Team presentation: 2 minutes
1. Introduction
What data did you try to find and why?
Have you tried to search for this data before?
2. Methods
The 5 main steps you took on Repositive to try and find this data.
3. Results
Did you find the data on Repositive?
What challenges did you encounter?
4. Conclusion
Sum up your experience in 1 sentence.
1 2 3 4 5
42. Feedback on the workshop
Bugs and feedback to: Charlotte at Repositive.io
Please leave your feedback on the workshop:
http://tinyurl.com/feedback280916
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and vetting of users
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users
Population scale genome sequencing projects have been launched all over the world
More than 80PB of human genomic data is being sequenced Every year
BUT
To date only around .5PB of data available in public repositories
Because interpretation requires LOTS of data
And although data exists around the world, it is siloed, and even if available, it is not accessible
This is Jenn, a genetic researcher –our target customer- seeking to interpret data from genetic diseases and cancer
She needs data from other patients to compare and interpret Mabels DNA
She also has data available in her own lab, but she cannot share because of concerns how to deal with secure access to sensitive data and data governance, e.g. vetting of users
Examples of researchers looking for genomics data. All have problems, even though in different parts of the world, in different industries and with different research questions.
Examples of researchers looking for genomics data. All have problems, even though in different parts of the world, in different industries and with different research questions.
Examples of researchers looking for genomics data. All have problems, even though in different parts of the world, in different industries and with different research questions.
Further confounded by the data being highly fragmented.
Siloed in repositories and institutions around the world.
Our mission is to speed up research and diagnostics for genetic diseases by enabling efficient and ethical access to genomic research data
Our vision is to make genomic data access as easy as finding a book on Amazon or book a hotel on Expedia
KEY POINTS:
Repositive builds tools for genomics data search & access.
We’re really good at it. We have the expertise in-house. It’s what we do.
Aside from building a highly functional tool, we’ve taken the time to prioritise User Experience, streamlining of user workflows & presentation.
Within a month of our formal platform launch we have over 600 registered users.
The Repositive platform is an online community and marketplace connecting data consumers with data providers.
On Repositive, Jenn has
Easy, Interactive search
Faster data access workflow
Easy access to new data collaborators
Benefiting from reading feedback on data from community, colleagues, to assess data quality and utility
The Repositive platform and technology will remove barriers to data sharing and will incentivise users to explore, contribute and collaborate in alignment with best practices