Slides for the talk I gave at PyCon Australia trying to simplify biology and genomics into something easily accessible for software developers and CompSci graduates.
I cover
1. What biological data looks like today
2. How the revolution in genomics sequencing technology is IN a hospital near you
3. How this is affecting patient treatment today
4. What are some of the major challenges in using this data in the clinic?
and ...
5. (1 slide about ) How my research fits into the paradigm of understanding human genetic variation.
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
Big data biology for pythonistas: getting in on the genomics revolution
1. BIG DATA BIOLOGY FOR PYTHONISTAS:
GETTING IN ON THE GENOMICS REVOLUTION
DARYA VANICHKINA
2. STRUCTURE OF MY TALK
▸ Whoami, and why now?
▸ The meaning biology of life
▸ The data
▸ The reality (case studies)
▸ Other areas that need development talent
4. WHY SHOULD *YOU* CARE? - IF YOU’RE A HUMAN BEING IN THE XXI CENTURY
5. BIOLOGY 101: A VERY SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE/HUMAN
THE CENTRAL DOGMA
5’ - ATG TCT TAC AAG TGC GTG - 3’
3’ - TAC AGA ATG TTC ACG CAC - 5’
GENETIC CODE
NUCLEUS
DNA
DOUBLE HELIX.
6. BIOLOGY 101: A VERY SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE/HUMAN
THE CENTRAL DOGMA
5’ - ATG TCT TAC AAG TGC GTG - 3’
3’ - TAC AGA ATG TTC ACG CAC - 5’
5’ - AUG UCU UAC AAG UGC GUG - 3’
5’ - AUG UCU UAC AAG UGC GUG - 3’
H2N - MET SER TYR LYS CYS VAL - COOH
GENETIC CODE
NUCLEUS
CYTOPLASM
DNA
RNA
PROTEIN
TRANSCRIPTION
TRANSLATION
DOUBLE HELIX. ATGC.
~6 BILLION/HUMAN CELL.
[37.2 TRILLION CELLS/BODY]
PACKAGED IN 23 PAIRS OF
CHROMOSOMES
20K CODING
GENES
7. BIOLOGY 201: A SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE
[A BIT] BEYOND THE CENTRAL DOGMA
5’ - ATG TCT TAmC AAG TGC GTG - 3’
3’ - TAC AGA ATG TTC ACG CAC - 5’
5’ - AUG UCU UAC AAG UGC GUG - 3’
5’ - AUG UCU UIC AAG UGC GUG - 3’
H2N - MET SER pTYR LYS CYS VAL - COOH
NUCLEUS
CYTOPLASM
DNA
RNA
PROTEIN
TRANSCRIPTION
TRANSLATION
5’ - AUGUCUUUCTTAUGCGUG - 3’
NCRNA
H2N - MET SER CYS LYS CYS VAL - COOH
8. WHAT THE DATA LOOKS LIKE
CODIFYING THE CENTRAL DOGMA
5’ - ATG TCT TAC AAG TGC GTG - 3’
3’ - TAC AGA ATG TTC ACG CAC - 5’
5’ - AUG UCU UAC AAG UGC GUG - 3’
5’ - AUG UCU UAC AAG UGC GUG - 3’
H2N - MET SER TYR LYS CYS VAL - COOH
GENETIC CODE
CYTOPLASM
DNA
[GENOME/
EXOME]
RNA
[TRANSCRIPTOME]
PROTEIN
TRANSCRIPTION
TRANSLATION
ATGC STRING!
AUGC STRING!
21 LETTER STRING!
9. WHAT DO YOU DO WITH THE DATA?
▸ Try to explain/understand diseases
(especially rare/Mendelian ones)
▸ Identify family relationships
▸ Identify ethnic origin
▸ Carrier status
▸ Targeted drug prescription, and rational
prediction of side effects
▸ Identify patients at risk of diseases, and
“catch” them earlier
THE THEORY
10. EUROPEAN EXAMPLE EXTRA INFO
▸ Taken from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735096/
figure/F1/
▸ a, A statistical summary of genetic data from 1,387 Europeans based on
principal component axis one (PC1) and axis two (PC2). Small coloured
labels represent individuals and large coloured points represent median
PC1 and PC2 values for each country. The inset map provides a key to the
labels. The PC axes are rotated to emphasize the similarity to the
geographic map of Europe. AL, Albania; AT, Austria; BA, Bosnia-
Herzegovina; BE, Belgium; BG, Bulgaria; CH, Switzerland; CY, Cyprus; CZ,
Czech Republic; DE, Germany; DK, Denmark; ES, Spain; FI, Finland; FR,
France; GB, United Kingdom; GR, Greece; HR, Croatia; HU, Hungary; IE,
Ireland; IT, Italy; KS, Kosovo; LV, Latvia; MK, Macedonia; NO, Norway; NL,
Netherlands; PL, Poland; PT, Portugal; RO, Romania; RS, Serbia and
Montenegro; RU, Russia, Sct, Scotland; SE, Sweden; SI, Slovenia; SK,
Slovakia; TR, Turkey; UA, Ukraine; YG, Yugoslavia. b, A magnification of
the area around Switzerland from a showing differentiation within
Switzerland by language. c, Genetic similarity versus geographic distance.
Median genetic correlation between pairs of individuals as a function of
geographic distance between their respective populations.
11. DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
@ERR030890.1 HWI-BRUNOP16X_0001:3:2:1148:1061#0/1
NNCAATGCTACTCTCAACAAGTTCACAGAGGAACTTAAGAAGTATGGAGTGACGNNTTTGGNTCGNGTTTGTGAT
+
##++**++++FFFFF5::88:=???FFFFFFFFFFFFFFFFF=F<8?############################
“Read”, 10 - 100+ million of these per dataset. Can be paired.
https://en.wikipedia.org/wiki/FASTQ_format
+OR +
DNA
12. DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
ERR030890.15421060272 chr1 564478 3 75M * 0 0
GTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTANN
1576:<F<FF=::??=5?DDFFFFF<FFF<?=?=;>>??=?=???66?=;FFFFFFFFFF=???6&)(*++**## AS:i:-2
XN:i:0 XM:i:2XO:i:0 XG:i:0 NM:i:2 MD:Z:73T0T0 YT:Z:UUXS:A:- NH:i:2 CC:Z:chrM CP:i:
3929 HI:i:0
Alignment programs (run independently) - bwa, bowtie2
Output: SAM file (sequence alignment/map)
# Example for 1 read:
https://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/Alignment
http://genome.sph.umich.edu/wiki/SAM
Official (obtuse) documentation https://samtools.github.io/hts-specs/SAMv1.pdf
Reference == genome
13. DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1
CTGATGTGCCGCCTCACTTCGGTGGT read1
TGATGTGCCGCCTCACTACGGTGGTG read2
GATGTGCCGCCTCACTTCGGTGGTGA read3
GCTGATGTGCCGCCTCACTACGGTG read4
GCTGATGTGCCGCCTCACTACGGTG read5
For visualising SAM - use http://software.broadinstitute.org/software/igv/
CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1
CCTCACCA------GTGAGGCGGCACATCA read1
TCACCA------GTGAGGCGGCACATCAGC read2
CACCTCACCA------GTGAGGCGGCACA read3
CTCACCA------GTGAGGCGGCACAGC read4
ACCTCACCA------GTGAGGCGGCAC read5
Mismatch Deletion [Insertion]
14. DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1
CTGATGTGCCGCCTCACTTCGGTGGT read1
TGATGTGCCGCCTCACTACGGTGGTG read2
GATGTGCCGCCTCACTTCGGTGGTGA read3
GCTGATGTGCCGCCTCACTACGGTG read4
GCTGATGTGCCGCCTCACTACGGTG read5
CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1
CCTCACCA------GTGAGGCGGCACATCA read1
TCACCA------GTGAGGCGGCACATCAGC read2
CACCTCACCA------GTGAGGCGGCACA read3
CTCACCA------GTGAGGCGGCACAGC read4
ACCTCACCA------GTGAGGCGGCAC read5
Mismatch (SNV) Deletion [Insertion]
Find difference to reference
https://usegalaxy.org/
3 - 5 million variants vs reference
15. BIOLOGY 101: A VERY SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE/HUMAN
CHROMOSOMAL MODE OF INHERITANCE
60 new mutations per generation, with a 20-year-old father transmitting ~ 25 mutations to his child, a 40-year-old father transmitting
around 65 (Kong et al Nature 2012 DOI:10.1038/nature11396; Francioli et al 2015 Nature Genetics DOI:10.1038/ng.3292)
16. DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1
CTGATGTGCCGCCTCACTTCGGTGGT read1
TGATGTGCCGCCTCACTACGGTGGTG read2
GATGTGCCGCCTCACTTCGGTGGTGA read3
GCTGATGTGCCGCCTCACTACGGTG read4
GCTGATGTGCCGCCTCACTACGGTG read5
CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1
CCTCACCA------GTGAGGCGGCACATCA read1
TCACCA------GTGAGGCGGCACATCAGC read2
CACCTCACCA------GTGAGGCGGCACA read3
CTCACCA------GTGAGGCGGCACAGC read4
ACCTCACCA------GTGAGGCGGCAC read5
Mismatch (SNV) Deletion [Insertion]
Homozygous/heterozygous
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|
0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
VCF file
17. DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1
CTGATGTGCCGCCTCACTTCGGTGGT read1
TGATGTGCCGCCTCACTACGGTGGTG read2
GATGTGCCGCCTCACTTCGGTGGTGA read3
GCTGATGTGCCGCCTCACTACGGTG read4
GCTGATGTGCCGCCTCACTACGGTG read5
CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1
CCTCACCA------GTGAGGCGGCACATCA read1
TCACCA------GTGAGGCGGCACATCAGC read2
CACCTCACCA------GTGAGGCGGCACA read3
CTCACCA------GTGAGGCGGCACAGC read4
ACCTCACCA------GTGAGGCGGCAC read5
Mismatch (SNV) Deletion [Insertion]
Homozygous/heterozygous
Good tutorial on this (VLSCI)
https://docs.google.com/document/d/1lfDYNzHjfDA1pHTHd-0w3xHhg7L4TipT1gRfzgiV8es/pub
http://vlsci.github.io/lscc_docs/tutorials/variant_calling_galaxy_1/variant_calling_galaxy_1/
http://vlsci.github.io/lscc_docs/tutorials/var_detect_advanced/var_detect_advanced/
samtools pileup, GATK, FreeBayes -> Variant Call Format (VCF)
18. DATA ANALYSIS
PIPELINE FOR PROCESSING GENOMIC DATA
SEQUENCE
GENOME
MAP READS TO
REFERENCE CALL VARIANTS INTERPRET
What do the differences actually mean?
What we currently do:
1. See if any of the observed variants match disease-associated mutations we’ve seen before
(databases like OMIM, dbSNP, ClinVar, SNPedia)
2. Predict whether mutation would “break” protein by introducing a “STOP” earlier in the
sequence, or shift the frame, or change a critical amino acid
19. BIOLOGY 201: BUT …
BUT THERE ARE MANY CHALLENGES THAT NEED TO BE ADDRESSED
22. CASE STUDIES
UK: GENOMICS ENGLAND .100 000 GENOMES FOR THE NHS
JESSICA WRIGHT
▸ Epilepsy, movement disorders, developmental delay
▸ Standard testing: MRI, lumbar puncture, EEGs and other
testing (including invasive tests) did not pinpoint a cause
▸ Genomic sequencing identified a de novo mutation in
Glut1, which codes for a protein responsible for
transporting glucose from the blood into the brain
▸ => Ketogenic diet (low carbohydrate, high fat diet)
23. CASE STUDIES
23&ME DIRECT TO CONSUMER GENETICS
▸ 23andme
▸ Illumina HumanOmniExpress-24 array
▸ opt-in research
▸ 36 FDA approved tests + ancestry vs original kit: 254 diseases/conditions
▸ Manuel Corpas - sample data of himself and his family (23&Me, Exome
sequencing)
24. CASE STUDIES
23&ME DIRECT TO CONSUMER GENETICS
▸ “Genetic information can reveal that someone you thought you were
related to is not your biological relative. This happens most frequently in
the case of paternity.”
▸ “Learning that your genotype is associated with an increased risk of a
particular condition can be difficult, especially if you have seen a friend or
family member struggle with a similar issue.”
▸ “Because genetic information is hereditary, knowing something about
your genetics also tells you something about those closely related to you.
Your family may or may not want to know this information as well, and
relationships with others can be affected by learning about your DNA.”
▸ Link & Siblings and half-siblings & Genome view
26. BIOLOGY 201: BUT …
BUT THERE ARE MANY (PRACTICAL) CHALLENGES THAT NEED TO BE ADDRESSED
▸ Speed (of mappers, cleaners, collapsers, annotators) is a *major* problem - in the real world,
outside of the Ivory Tower
▸ Tools are not designed to work together
▸ Technical reproducibility between centres
▸ Data sharing issues, and lack of consistent nomenclature and file format (and chr) horrors
▸ Getting it wrong can have devastating consequences (pathogenic variant later reclassified as
benign in prenatal diagnosis; athletes deemed to be erroneously at risk of cardiac failure)
▸ Differences in interpretation between pathologists/ doctors - and hence different patient
outcomes
28. THE ONE SLIDE ABOUT WHAT I ACTUALLY DO…
▸ GENCODE 25
▸ hg38
29. ADDITIONAL RESOURCES
▸ Galaxy tutorials and work-throughs (for when you’re starting out) https://
wiki.galaxyproject.org/Learn/GalaxyNGS101
▸ Broad Institute (Harvard/MIT) Public Lectures
▸ Genomics England Youtube
▸ PyCon talk by Titus Brown, with example of how to run bcbio on Ashkenazi trio dataset
▸ Bcbio sample datasets and analyses, especially the exome and whole genome variant
analysis, tumour vs normal comparisons [Good for trying out variant analysis, not so
good for RNA at the moment]
30. IF YOU WANT TO TRY THIS AT HOME…
WHERE TO GET DATA, AND HOW TO PROCESS IT
▸ Look for research study you’re interested in pubmed, and find where they link to the raw data
(Methods section and supplementary tables, with “weird" identifiers, in fastq)
▸ Data from all research studies *[must be] is usually* deposited in the European Nucleotide Archive
(ENA), where you can download it in fastq format.
▸ First, try to process it to reproduce the authors’ results. Galaxy provides a web interface that runs
many standard command-line tools and allows you to look at the output - good as “leading strings”
▸ Frameworks such as bcbio provide managed environments for analysis
▸ Most biological software runs on linux, and can be chained together using bash. I would go from an
exploratory analysis in Galaxy to an analysis that chains together existing tools via bash or a
complex bioinformatics pipeline management system (Wikipedia)
31. IF YOU WANT TO TRY THIS AT HOME…
DANGER, WILL ROBINSON! DANGER!
▸ BUT: Because of the latest technologies, you as a programming-literate
individual are in a better position to understand this data than most
▸ Understanding and playing with this data is addictive - and beautiful…
▸ This is coming to in a hospital near your
32. OTHER “BIOLOGY” OF INTEREST…
▸ “Algorithms stuff” (Talk tomorrow!)
▸ Biological image analysis (fMRI, microscopy)
▸ Contribute to projects such as galaxy and bcbio
▸ Machine learning of patient records
▸ Integrating IOT and wearables with medical data and patient records
▸ Cool stuff in cataloguing the genetic diversity of life, choosing which areas should be
made into national parks based on data, or understanding disease spread (ex. flu
across Asia)