Analyzing Exome Data with KNIME

Pierre Lindenbaum PhD UMR915 – Institut du thorax Nantes, France @yokofakun http://plindenbaum.blogspot.com [email_address] Analysing Exome Data with KNIME

[m/m] 1 st case: for a given mutation we expect... not( [m/m] )

$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A 42 columns

$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_009 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Genomic Position

$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Sample Name

$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A RS## number

$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Ref. & Alt. alleles

$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Gene

$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Prediction

$1 Position.hg19 : 142653 $2 chrom : chr10 $3 sample.ID : sample1 $4 rs.name : rsXXXX $5 hapmap_ref_other : $6 X1000Genome.obs : $7 X1000Genome.desc : $8 Freq.HTZ.ExomesV1 : 0 $9 Freq.Hom.ExomesV1 : 0 $10 A : 0 $11 C : 5 $12 G : 0 $13 T : 3 $14 modified_call : CT $15 total : 9 $16 used : 8 $17 score : 18.30:12.00 $18 reference : C $19 type : SNP_het1 $20 Gene.name : ROXAN $21 Gene.start : 143652 $22 Gene.end : 293700 $23 strand : - $24 nbre.exon : 11 $25 refseq : NR_0090 $26 typeannot : 3-UTR $27 type.pos : $28 index.cdna : $29 index.prot : $30 Taille.cdna : 1769 $31 Intron.start : $32 Intron.end : $33 codon.wild : $34 aa.wild : $35 codon.mut : $36 aa.mut : $37 cds.wild : $38 cds.mut : $39 prot.wild : $40 prot.mut : $41 mirna : no $42 region.splice : !N/A Homo/Hetero zygote

Rename both “ Sample” Columns

Remove the sequences (save memory/speed)

Expect “not (snp_diff.*)” for

Merge data. Two columns “ SAMPLE_WILD” & “ SAMPLE_MUTATED”

Remove wild allele from Alt. (cleanup)

Keep mutations carried by both samples

Group by Gene Name & Visualize

Retrieve the SNPs for each Gene.

bash version... #remove rs #in gene #remove the low qualities #keep SNP_diff #only the non-synonymous or stop #remove DNA & prot sequences #order by GENE gunzip -c AllChrom.exome.snp.pool.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!="rs") print;}' |awk -F ' ' '{if($20!="") print;}' |awk -F ' ' '{if(index($19,"douteux")==0) print;}' |awk -F ' ' '{if(index($19,"_diff")!=0) print;}' |awk -F ' ' '{if(index($26,"nonsense")!=0 || index($26,"missense")!=0) print;}' |cut -d ' ' -f 1-27 |sort -t ' ' -k20,20 > _jeter1.txt #extract wild exome #remove rs #remove SNP_diff #in gene #order by gene gunzip -c AllChrom.exome.snp.u2437.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!="rs") print;}' |awk -F ' ' '{if(index($19,"douteux")==0) print;}' |awk -F ' ' '{if(index($19,"_diff")==0) print;}' |awk -F ' ' '{if($20!="") print;}' |cut -d ' ' -f 1-27 |sort -t ' ' -k20,20 > _jeter3.txt #join wild & mutated data by gene #check wild sample has no mutation in the pair of mutated snps #remove wild data join -t ' ' -1 20 -2 20 _jeter1.txt _jeter3.txt |awk -F ' ' '{if($3==$29 && int($2) == int($28) ) print;}' |cut -d ' ' -f 1 |sort | uniq rm _jeter*.txt

In one gene: SNP1: [m/+] SNP2: [m/+] 2 nd case: Composite heterozygous

Read [m] & [+] files Mutated sample Wild sample

Remove cDNA & protein sequences

Keep the heterozygous mutations

Keep the non-synonymous mutations

Create a new column: = chrom+”_”+position;

Rename the columns 'sample-id' (will generate two distinct columns after joining)

Left join on the column 'chrom_col'

Keep the mutations that were NOT part of the wild sample.

Duplicate the table to Create two lists of SNPs (5' & 3').

Join both tables on gene name.

Keep the SNPs having: pos(snp1) < pos(snp2)

#remove rs #only keep the 'SNP_het' #remove the low qualities #remove SNP_het* #only the non-synonymous or stop #remove DNA & prot sequences #add chrom_position flag #sort gunzip -c AllChrom.exome.snp.pool.new.annotation.gz |awk -F ' ' '{if(substr($4,1,2)!="rs") print;}' |awk -F ' ' '{if(index($19,"douteux")==0) print;}' |awk -F ' ' '{if(index($19,"_het")!=0) print;}' |awk -F ' ' '{if(index($26,"nonsense")!=0 || index($26,"missense")!=0) print;}' |cut -d ' ' -f 1-27 |awk -F ' ' '{printf("%s_%s%s",$2,$1,$0);}' |sort -t ' ' -k1,1 > _jeter1.txt #get all distinct chrom_pos in file cut -d ' ' -f 1 _jeter1.txt | sort -t ' ' -k1,1 | uniq > _jeter2.txt #extract wild exome #keep chrom,position #add chrom_position flag #sort gunzip -c AllChrom.exome.snp.u2437.new.annotation.gz |cut -d ' ' -f 1,2 |awk -F ' ' '{printf("%s_%s",$2,$1);}' |sort -t ' ' -k 1,1 | uniq > _jeter3.txt #get [m] chrom_pos not in [+] chrom_pos set comm -2 -3 _jeter2.txt _jeter3.txt > _jeter4.txt #join uniq [m] chrom_pos & mutated data #remove chrom_pos #order by gene join -t ' ' --check-order -1 1 -2 1 _jeter1.txt _jeter4.txt|cut -d ' ' -f 2- |sort -t ' ' -k 20 > _jeter5.txt #join to self using key= "gene name" #only keep if first mutation in same gene/chromosome and pos1< pos2 #keep some columns join -t ' ' -j 20 _jeter5.txt _jeter5.txt |awk -F ' ' '{if($3==$29 && int($2) < int($28) ) print;}' |cut -d ' ' -f 1,2,3,20,26,28,46,52 > _jeter6.txt #extract gene names cut -d ' ' -f 1 _jeter6.txt | sort | uniq rm _jeter[12345].txt bash version...

Last step... http://en.wikipedia.org/wiki/File:Nobel_Prize.png

Thanks. Remember: you should learn how to use the Unix command line...

Analyzing Exome Data with KNIME

Recommended

Recommended

More Related Content

More from Pierre Lindenbaum

More from Pierre Lindenbaum (20)

Recently uploaded

Recently uploaded (20)

Analyzing Exome Data with KNIME