SlideShare a Scribd company logo
1 of 56
Pierre Lindenbaum PhD UMR915 – Institut du thorax Nantes, France @yokofakun http://plindenbaum.blogspot.com [email_address] Analysing Exome Data with KNIME
2 exomes sequenced
[m/m] 1 st  case: for a given  mutation we expect... not( [m/m] )
Files
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A 42 columns
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_009 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A Genomic Position
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A Sample Name
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A RS## number
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A Ref. & Alt. alleles
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A Gene
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A Prediction
$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other :  $6 X1000Genome.obs :  $7 X1000Genome.desc :  $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos :  $28 index.cdna :  $29 index.prot :  $30 Taille.cdna : 1769 $31 Intron.start :  $32 Intron.end :  $33 codon.wild :  $34 aa.wild :  $35 codon.mut :  $36 aa.mut :  $37 cds.wild :  $38 cds.mut :  $39 prot.wild :  $40 prot.mut :  $41 mirna : no $42 region.splice : !N/A Homo/Hetero zygote
http://www.knime.org
 
Our workflow:
 
Read the data
Rename both “ Sample” Columns
Remove the sequences (save memory/speed)
Expect “not (snp_diff.*)” for
Expect “snp_diff.*” for
Merge data. Two columns “ SAMPLE_WILD” & “ SAMPLE_MUTATED”
Highlight low quality
Remove low quality
Must be in located in a Gene
Remove if known rs#
Remove if synonymous mutation
Remove wild allele from Alt. (cleanup)
Group by Gene
 
Keep mutations carried by both samples
Group by Gene Name & Visualize
 
Retrieve the SNPs for each Gene.
 
bash version... #remove rs #in gene #remove the low qualities #keep SNP_diff #only the non-synonymous or stop #remove DNA & prot sequences #order by GENE gunzip -c AllChrom.exome.snp.pool.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!="rs") print;}' |awk -F ' ' '{if($20!="") print;}' |awk -F ' ' '{if(index($19,"douteux")==0) print;}' |awk -F ' ' '{if(index($19,"_diff")!=0) print;}' |awk -F ' ' '{if(index($26,"nonsense")!=0 || index($26,"missense")!=0) print;}' |cut -d ' ' -f 1-27 |sort  -t ' ' -k20,20 > _jeter1.txt  #extract wild exome #remove rs #remove SNP_diff #in gene #order by gene gunzip -c AllChrom.exome.snp.u2437.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!="rs") print;}' |awk -F ' ' '{if(index($19,"douteux")==0) print;}' |awk -F ' ' '{if(index($19,"_diff")==0) print;}' |awk -F ' ' '{if($20!="") print;}' |cut -d ' ' -f 1-27 |sort  -t ' ' -k20,20 > _jeter3.txt  #join wild & mutated data by gene #check wild sample has no mutation in the pair of mutated snps #remove wild data join  -t ' ' -1 20 -2 20 _jeter1.txt _jeter3.txt |awk -F ' ' '{if($3==$29 && int($2) == int($28) ) print;}' |cut -d ' ' -f 1 |sort | uniq rm _jeter*.txt
In one gene: SNP1: [m/+] SNP2: [m/+] 2 nd  case: Composite heterozygous
The workflow:
Read [m] & [+] files Mutated sample Wild sample
Remove cDNA & protein sequences
Remove the SNPs having a rs#
Keep the heterozygous mutations
Remove poor quality
Keep the non-synonymous mutations
Create a new column: = chrom+”_”+position;
Rename the columns 'sample-id' (will generate two distinct columns after joining)
Left join on the  column 'chrom_col'
Keep the mutations that were NOT part of the wild sample.
Cleanup, remove some columns.
Duplicate the table to Create two lists of SNPs (5' & 3').
Join both tables on gene name.
Keep the SNPs having: pos(snp1) < pos(snp2)
Display the results
#remove rs #only keep the 'SNP_het' #remove the low qualities #remove SNP_het* #only the non-synonymous or stop #remove DNA & prot sequences #add chrom_position flag #sort gunzip -c AllChrom.exome.snp.pool.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!=&quot;rs&quot;) print;}' |awk -F ' ' '{if(index($19,&quot;douteux&quot;)==0) print;}' |awk -F ' ' '{if(index($19,&quot;_het&quot;)!=0) print;}' |awk -F ' ' '{if(index($26,&quot;nonsense&quot;)!=0 || index($26,&quot;missense&quot;)!=0) print;}' |cut -d ' ' -f 1-27 |awk -F ' ' '{printf(&quot;%s_%s%s&quot;,$2,$1,$0);}' |sort  -t ' ' -k1,1 > _jeter1.txt  #get all distinct chrom_pos in file cut -d ' ' -f 1 _jeter1.txt | sort -t ' ' -k1,1 | uniq > _jeter2.txt  #extract wild exome #keep chrom,position #add chrom_position flag #sort gunzip -c AllChrom.exome.snp.u2437.new.annotation.gz |cut -d ' ' -f 1,2 |awk -F ' ' '{printf(&quot;%s_%s&quot;,$2,$1);}' |sort  -t ' ' -k 1,1 | uniq  > _jeter3.txt  #get [m] chrom_pos not in [+] chrom_pos set comm -2 -3 _jeter2.txt  _jeter3.txt  > _jeter4.txt  #join uniq [m] chrom_pos & mutated data #remove chrom_pos #order by gene join  -t ' ' --check-order  -1 1 -2 1  _jeter1.txt _jeter4.txt|cut -d ' ' -f 2- |sort -t ' ' -k 20 > _jeter5.txt  #join to self using key= &quot;gene name&quot; #only keep if first mutation in same gene/chromosome and pos1< pos2 #keep some columns join  -t ' ' -j 20 _jeter5.txt _jeter5.txt |awk -F ' ' '{if($3==$29 && int($2) < int($28) ) print;}' |cut -d ' ' -f 1,2,3,20,26,28,46,52 > _jeter6.txt #extract gene names cut -d ' ' -f 1 _jeter6.txt | sort | uniq rm _jeter[12345].txt bash version...
Last step... http://en.wikipedia.org/wiki/File:Nobel_Prize.png
Thanks. Remember: you should learn how to use the Unix command line...

More Related Content

More from Pierre Lindenbaum

Next Generation Sequencing file Formats ( 2017 )
Next Generation Sequencing file Formats ( 2017 )Next Generation Sequencing file Formats ( 2017 )
Next Generation Sequencing file Formats ( 2017 )Pierre Lindenbaum
 
Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !Pierre Lindenbaum
 
"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)Pierre Lindenbaum
 
File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation SequencingPierre Lindenbaum
 
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebookBuilding a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebookPierre Lindenbaum
 
Introduction to mongodb for bioinformatics
Introduction to mongodb for bioinformaticsIntroduction to mongodb for bioinformatics
Introduction to mongodb for bioinformaticsPierre Lindenbaum
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing CoursePierre Lindenbaum
 
An implementation of Jan Aerts' LocusTree
An implementation of Jan Aerts' LocusTreeAn implementation of Jan Aerts' LocusTree
An implementation of Jan Aerts' LocusTreePierre Lindenbaum
 
Pourquoi et comment créer son Réseau
Pourquoi et comment créer son RéseauPourquoi et comment créer son Réseau
Pourquoi et comment créer son RéseauPierre Lindenbaum
 
RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...
RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...
RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...Pierre Lindenbaum
 

More from Pierre Lindenbaum (20)

Next Generation Sequencing file Formats ( 2017 )
Next Generation Sequencing file Formats ( 2017 )Next Generation Sequencing file Formats ( 2017 )
Next Generation Sequencing file Formats ( 2017 )
 
Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !
 
"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)
 
File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation Sequencing
 
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebookBuilding a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
 
XML for bioinformatics
XML for bioinformaticsXML for bioinformatics
XML for bioinformatics
 
20120423.NGS.Rennes
20120423.NGS.Rennes20120423.NGS.Rennes
20120423.NGS.Rennes
 
Sketching 20120412
Sketching 20120412Sketching 20120412
Sketching 20120412
 
Introduction to mongodb for bioinformatics
Introduction to mongodb for bioinformaticsIntroduction to mongodb for bioinformatics
Introduction to mongodb for bioinformatics
 
Biostar17037
Biostar17037Biostar17037
Biostar17037
 
Variation Toolkit
Variation ToolkitVariation Toolkit
Variation Toolkit
 
Bioinformatician 2.0
Bioinformatician 2.0Bioinformatician 2.0
Bioinformatician 2.0
 
Post doctoriales 2011
Post doctoriales 2011Post doctoriales 2011
Post doctoriales 2011
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course
 
MyWordle.java
MyWordle.javaMyWordle.java
MyWordle.java
 
Me & Biohackathon 2010
Me & Biohackathon 2010Me & Biohackathon 2010
Me & Biohackathon 2010
 
An implementation of Jan Aerts' LocusTree
An implementation of Jan Aerts' LocusTreeAn implementation of Jan Aerts' LocusTree
An implementation of Jan Aerts' LocusTree
 
Pourquoi et comment créer son Réseau
Pourquoi et comment créer son RéseauPourquoi et comment créer son Réseau
Pourquoi et comment créer son Réseau
 
Bibliography2.0
Bibliography2.0Bibliography2.0
Bibliography2.0
 
RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...
RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...
RoXaN, une nouvelle protéine cellulaire interagissant avec la protéine non-st...
 

Recently uploaded

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Analyzing Exome Data with KNIME

  • 1. Pierre Lindenbaum PhD UMR915 – Institut du thorax Nantes, France @yokofakun http://plindenbaum.blogspot.com [email_address] Analysing Exome Data with KNIME
  • 3. [m/m] 1 st case: for a given mutation we expect... not( [m/m] )
  • 5. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A 42 columns
  • 6. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_009 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Genomic Position
  • 7. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Sample Name
  • 8. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A RS## number
  • 9. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Ref. & Alt. alleles
  • 10. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Gene
  • 11. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Prediction
  • 12. $1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Homo/Hetero zygote
  • 14.  
  • 16.  
  • 18. Rename both “ Sample” Columns
  • 19. Remove the sequences (save memory/speed)
  • 22. Merge data. Two columns “ SAMPLE_WILD” & “ SAMPLE_MUTATED”
  • 25. Must be in located in a Gene
  • 28. Remove wild allele from Alt. (cleanup)
  • 30.  
  • 31. Keep mutations carried by both samples
  • 32. Group by Gene Name & Visualize
  • 33.  
  • 34. Retrieve the SNPs for each Gene.
  • 35.  
  • 36. bash version... #remove rs #in gene #remove the low qualities #keep SNP_diff #only the non-synonymous or stop #remove DNA & prot sequences #order by GENE gunzip -c AllChrom.exome.snp.pool.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!=&quot;rs&quot;) print;}' |awk -F ' ' '{if($20!=&quot;&quot;) print;}' |awk -F ' ' '{if(index($19,&quot;douteux&quot;)==0) print;}' |awk -F ' ' '{if(index($19,&quot;_diff&quot;)!=0) print;}' |awk -F ' ' '{if(index($26,&quot;nonsense&quot;)!=0 || index($26,&quot;missense&quot;)!=0) print;}' |cut -d ' ' -f 1-27 |sort -t ' ' -k20,20 > _jeter1.txt #extract wild exome #remove rs #remove SNP_diff #in gene #order by gene gunzip -c AllChrom.exome.snp.u2437.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!=&quot;rs&quot;) print;}' |awk -F ' ' '{if(index($19,&quot;douteux&quot;)==0) print;}' |awk -F ' ' '{if(index($19,&quot;_diff&quot;)==0) print;}' |awk -F ' ' '{if($20!=&quot;&quot;) print;}' |cut -d ' ' -f 1-27 |sort -t ' ' -k20,20 > _jeter3.txt #join wild & mutated data by gene #check wild sample has no mutation in the pair of mutated snps #remove wild data join -t ' ' -1 20 -2 20 _jeter1.txt _jeter3.txt |awk -F ' ' '{if($3==$29 && int($2) == int($28) ) print;}' |cut -d ' ' -f 1 |sort | uniq rm _jeter*.txt
  • 37. In one gene: SNP1: [m/+] SNP2: [m/+] 2 nd case: Composite heterozygous
  • 39. Read [m] & [+] files Mutated sample Wild sample
  • 40. Remove cDNA & protein sequences
  • 41. Remove the SNPs having a rs#
  • 45. Create a new column: = chrom+”_”+position;
  • 46. Rename the columns 'sample-id' (will generate two distinct columns after joining)
  • 47. Left join on the column 'chrom_col'
  • 48. Keep the mutations that were NOT part of the wild sample.
  • 50. Duplicate the table to Create two lists of SNPs (5' & 3').
  • 51. Join both tables on gene name.
  • 52. Keep the SNPs having: pos(snp1) < pos(snp2)
  • 54. #remove rs #only keep the 'SNP_het' #remove the low qualities #remove SNP_het* #only the non-synonymous or stop #remove DNA & prot sequences #add chrom_position flag #sort gunzip -c AllChrom.exome.snp.pool.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!=&quot;rs&quot;) print;}' |awk -F ' ' '{if(index($19,&quot;douteux&quot;)==0) print;}' |awk -F ' ' '{if(index($19,&quot;_het&quot;)!=0) print;}' |awk -F ' ' '{if(index($26,&quot;nonsense&quot;)!=0 || index($26,&quot;missense&quot;)!=0) print;}' |cut -d ' ' -f 1-27 |awk -F ' ' '{printf(&quot;%s_%s%s&quot;,$2,$1,$0);}' |sort -t ' ' -k1,1 > _jeter1.txt #get all distinct chrom_pos in file cut -d ' ' -f 1 _jeter1.txt | sort -t ' ' -k1,1 | uniq > _jeter2.txt #extract wild exome #keep chrom,position #add chrom_position flag #sort gunzip -c AllChrom.exome.snp.u2437.new.annotation.gz |cut -d ' ' -f 1,2 |awk -F ' ' '{printf(&quot;%s_%s&quot;,$2,$1);}' |sort -t ' ' -k 1,1 | uniq > _jeter3.txt #get [m] chrom_pos not in [+] chrom_pos set comm -2 -3 _jeter2.txt _jeter3.txt > _jeter4.txt #join uniq [m] chrom_pos & mutated data #remove chrom_pos #order by gene join -t ' ' --check-order -1 1 -2 1 _jeter1.txt _jeter4.txt|cut -d ' ' -f 2- |sort -t ' ' -k 20 > _jeter5.txt #join to self using key= &quot;gene name&quot; #only keep if first mutation in same gene/chromosome and pos1< pos2 #keep some columns join -t ' ' -j 20 _jeter5.txt _jeter5.txt |awk -F ' ' '{if($3==$29 && int($2) < int($28) ) print;}' |cut -d ' ' -f 1,2,3,20,26,28,46,52 > _jeter6.txt #extract gene names cut -d ' ' -f 1 _jeter6.txt | sort | uniq rm _jeter[12345].txt bash version...
  • 56. Thanks. Remember: you should learn how to use the Unix command line...