Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Finding and Accessing Human Genomics Datasets


Published on

A workshop given at Cancer Research UK, University of Cambridge on November 11th, 2016.

Published in: Science
  • Login to see the comments

  • Be the first to like this

Finding and Accessing Human Genomics Datasets

  1. 1. We are always looking for data Finding & Accessing Human Genomic Datasets CRUK, 7th November 2016 Tweets welcome #CamFindData @repositiveio
  2. 2. Outline of the day - Data sources and data access - Case study: University of Cambridge - Coffee break - Introduction to Repositive - Hands-on session: searching for data - Round up and closure
  3. 3. On-line tools used during the workshop To ask questions during the presentation and answer questions: go to enter event code: 7315
  4. 4. We are always looking for data Finding & Accessing Human Genomic Datasets CRUK, 7th November 2016 Tweets welcome #CamFindData @repositiveio
  5. 5. • 2001: First Human Genome Sequence • 2005: Personal Genome Project • 2008: UK10K • 2013: UK 100K Project • 2015: 1M Precision Medicine US • 2016: AstraZeneca – HLI 2M • Many other national and international projects Genome Technology Evolution
  6. 6. •Consensus among researchers, clinicians, politicians & the public that genomics will transform biomedical research, healthcare and lifestyle choices (Stephan Beck, UCL) OPPORTUNITY
  7. 7. Data should be made available
  8. 8. • Required by funders • Cannot publish unless accession number given • Specialised • ENA • EGA • dbGaP • dbSNP… • Generalist • Dryad • figshare Public Repositories
  9. 9. • Open Access • Eg. PGP, CC0 • Bermuda Accord • Managed (Restricted or Controlled Access) • Data Access Committee • No effective agreement (policy vacuum) • Global Alliance for Genomics & Health • enable compatible, readily accessible, and scalable approaches for sharing GOVERNANCE Models
  10. 10. Open vs Managed Access Open Access 75,000,000 per month Managed Access 150 per month 500,000 fold difference (Stephan Beck, UCL)
  11. 11. 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Large amounts of data, but not accessible ≈ .5PB Open Access 80+PB Sequenced Genome data available in public repos Exponential growth rate Under-utilised data has huge potential for medical research
  12. 12. Access to Managed Data Benefits: • Strict governance • Individuals are protected • Review of consent • Applicant signs for full responsibility for governance Disadvantages: • No control of data once access is given • High barrier for access – too high?
  13. 13. Often a long process Bottlenecks: • Finding relevant and usable data • Getting authorisation to access data • Formatting data • Storing and moving data We studied the problem with qualitative interviews followed by a survey of researchers in human genetics T. A. van Schaik et al The need to redefine genomic data sharing: a focus on data accessibility, Applied & Translational Genomics, 2014
  14. 14. NIH / eRA Commons login No Yes Organisation registered with eRA Organisation has DUNS number No No Write research proposal Yes + 2-3 days + 1-2 weeks + 1 week Yes Submit proposal + 1-2 days Access granted Find/Download/Decrypt data + 1-4 weeks Science… + 1-2 days PRO Tip: If you use human genomic data, apply for the GRU datasets in dbGaP, one application – access to all the GRU datasets. dbGaP application process Blog Post:
  15. 15. Sanger eDAM Account No Write research proposal + 1 hour Yes Submit proposal + 1-2 days Access granted Find/Download/Decrypt data + 2-7 days Science… + 1-2 days EGA application process Blog Post:
  16. 16. • Finding specific relevant genomic data for research can take up to six months for an untrained researcher without dedicated tools • Application & response time for data access committees can vary widely depending on • the type of dataset • consent regulations of the study • => there is no consensus for the ‘contracts’ between each dataset FACTS
  17. 17. Researchers often choose to not access data at all
  18. 18. WHY should we bother?
  19. 19. • Validate existing studies • Avoid unnecessary duplication • Compare to new studies • Enhance new datasets Why datasets are useful
  20. 20. Case studies Raquel, PhD Student, London, UK. Researching genes associated with rare eye disorders. Problems: - Doesn’t know where to look for data. - Doesn't know if data even exists. “I gave up on finding the data - it was very time consuming and not proving fruitful – so I started focusing more on generating my own data.”
  21. 21. Case studies Mahantesh, Academic Researcher, Taipei, Taiwan. Studying pharmacogenomics in cardiovascular epidemiology. Problems: - Needs lots of data. - Knows it exists but struggles with getting access to it. “Often it’s very hard to get the required number of cases and controls to carry out research in public health and epidemiology.”
  22. 22. Case studies Jana, Company Biocurator, Zurich, Switzerland. Biocurating microarray and RNA-Seq data. Problems: - Needs lots of data. - Lots of data out there but hard to filter down to ‘useful / relevant’ data. “Many repositories don’t list the metadata details I need to know if a dataset is useful to me, I can waste a lot of time searching.”
  23. 23. How many data sources? How many sources of human genomics data do you know about?
  24. 24. 11 155 2 2 4 4 7 780 0 5 10 15 20 25 30 35 40 45 GB FI NL FR DE CH EE BE DK ES SI IE SE 0 5 10 15 20 25 30 35 CA MD MA WA NY TX AZ DC NJ NC PA UT TN CO IN FL LA VA IL ME OH MO MI SC OR 1 1 1 1 1 1 Data sources across the globe GEO location of 278 data sources analysed. Found by tracking IP address of the source. These include:  Public Repositories  Universities  Companies  BioBanks  Research consortiums
  25. 25. Data source content Assay Types Dedicated to…
  26. 26. DATA is fragmented
  27. 27. Hundreds of data sources …but they aren’t easy to find! 30 data sources listed here: 10 25 33 35 102 174 239 0 50 100 150 200 250 300 Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-16 Jun-16
  28. 28. Cambridge specific Case Study
  29. 29. • Post doctoral researcher at University of Cambridge Medical School • Working on genetic inheritance and Cancer • Using NGS data and bioinformatics • After searching for data online she decided to apply for: • 2 dbGaP datasets • 3 EGA datasets Cambridge specific Case Study Blog Post: Pending… will be on
  30. 30. The Research Operations Office - will help you with the contracts (Data Transfer Agreements - DTAs) and signatures. • Has a designated individual who processes all dbGaP applications as they all abide by NIH legal restrictions and regulations about how to handle the data once granted access • For EGA applications, each DTA must be processed separately because there is no consensus for the ‘contracts’ between each dataset. Cambridge specific Case Study Blog Post: Pending… will be on
  31. 31. The nominated IT director - will be specific to your department. • They will need to confirm you can support the requirements of the DTA. • If the head of your departmental IT is not happy to sign – the head of IT for the University will be able to sign it off. Cambridge specific Case Study Blog Post: Pending… will be on
  32. 32. Top Tips: • Think about your storage space! • Think about what sort of analysis and processing you are going to do with the data once you do have it. After such a long process, the approval could be too quick. • Understand what you need before you start the application process! • You may have access for a limited period Cambridge specific Case Study
  33. 33. COFFEE BREAK Back in 10’
  34. 34. @repositiveio
  35. 35. 1-click to human genomic data access to make finding data as easy as finding a book on Amazon, book a hotel on Expedia!
  36. 36. Simpler workflow for data access Our expertise is data search platforms Discover and access Search, see related results Find colleagues & their data interests Co-annotate data & community feedback
  37. 37. We are enabling best practices MAKE DATA DISCOVERABLE SIMPLIFY WORKFLOWS CONTRIBUTE TO COMMUNITY DNAdigest and Repositive – Connecting the world of genomic data
  38. 38. Connecting the world of genomic data
  39. 39. 1. Form groups of 2-3 people 2. Select a leader & a spokeperson 3. Choose 1 data theme you are interested in 1. E.g, colon cancer, prostate cancer, breast cancer 4. Sign up at 5. Search the Repositive with selected theme Hands on
  40. 40. Team presentation: 2 minutes 1. Introduction  What data did you try to find and why?  Have you tried to search for this data before? 2. Methods  The 5 main steps you took on Repositive to try and find this data. 3. Results  Did you find the data on Repositive?  What challenges did you encounter? 4. Conclusion  Sum up your experience in 1 sentence. 1 2 3 4 5
  41. 41. Feedback on the workshop Bugs and feedback to: Charlotte at Please leave your feedback on the workshop:
  42. 42. @repositive