SlideShare a Scribd company logo
1 of 3
Download to read offline
[1]
1st KEYSTONE Training Summer School
Hackathon Session
23rd - 24th July 2015 - University of Malta
Jean-Paul Ebejer,Joel Azzopardi, Joseph Bonello, Charlie Abela
Contact info: jean.p.ebejer@um.edu.mt
This data hackathon consists of processing "large" files consisting of biological data to
extract useful information.
Challenge Overview
You will be given a text file containing the complete human genome
(GRCh38.p3.genome.fa, 3.2GB), and an index file (gencode.v23.annotation.gff31
,
1.2GB) which contains annotations of features in that genome. These files have been
downloaded from the GenCode project2
. This .fa file (known as a FASTA file) contains
genomic sequence data (bases G, T, A, and C) for each chromosome. A FASTA file is
typically made up of a number of records, each consisting of a header followed by a
sequence. The chromosomes are therefore defined with a header line which starts with a
'>' character (e.g. >chr10 10). The genomic sequence of the chromosome starts in the
next line after the header and continues until either another header line is found or the end
of file is reached.
BRCA1 and BRCA2 are two genes that code for tumour suppression and DNA repair
proteins. When a mutation occurs in these two genes, the production of these proteins
may be affected negatively. This increases the likelihood of developing breast and ovarian
cancer in females. In this hackathon you will deal with data regarding these genes.
1
The GFF3 file format is described here: http://www.gencodegenes.org/data_format.html
2
http://www.gencodegenes.org/
[2]
Tasks
The programming tasks will focus around the gene BRCA2 (gene identifier:
ENSG00000139618.14) which is located on chromosome 13.
1. Using the index file as a guide, find the protein-coding sequences for the BRCA2
gene and extract these (displaying them to the user). The chromosome number
(column 1) in the annotation file (.gff3) should be "chr13" for chromosome 13. The
feature type (column 3) in the annotation file should be 'CDS' for coding sequence.
The ninth column in the annotation file contains key-value pairs with additional
information. Using this field extract the entries for BRCA2 (where gene_id is
"ENSG00000139618.14" and transcript_type is "protein_coding"). Use the start
(column 4) and end (column 5) values in the annotation file to extract the BRCA2
DNA sequence from the FASTA file. In order to do this, you need to first extract
the sequence of chromosome 13 from the FASTA file. The header line for this entry
is ">chr13 13". (10 points)
2. Translate the genomic sequence you have extracted above into the corresponding
protein sequence. This occurs by reading in triplets of DNA bases (known as
codons) and translating these into amino acids3
. The genomic phase (Column 8) in
the annotation file determines the starting offset to read the sequence in. You can
test the generated (translated) protein sequence by using it as a query in BLAST4
using protein blast (blastp) and "Homo Sapiens" as the target organism. The BRCA2
protein should be your top result hit. (10 points)
3. In previous tasks we have searched for the sequence given the gene identifier (or
name) and the chromosome where the gene is located. Your task is now to execute
a reverse search - given a genomic sequence, find the possible gene name(s) and
chromosome location. In this task we assume exact matches between the query and
the sequence. The result should also show the total number of matches of the
query to the genome. (10 points)
4. Generate a simple visualization which shows where these genomic queries are
foundin the human genome. (10 points)
This could look something like the following (where each dash character represents
10,000,000 genomic bases):
* * * **
chr1 -------------------------
* *
chr2 -------------------------
...
*
chr14 ----------
...
3
The DNA-to-Amino Acid translation table may be found here:
https://en.wikipedia.org/wiki/DNA_codon_table
4
http://blast.ncbi.nlm.nih.gov/
[3]
5. As an added task, allow for fuzzy searches. A five-letter genomic query 'GATAG'
with 1 mismatch allowed will also match 'AATAG' and 'GATAA' but not 'AATAA'.
The number of mismatches allowed should be parameterized and cannot be more
than 70% of the length of the query. (10 points)
6. You have been given a protein sequencein file protein.txt. Find the genomic
counterpart in the genome FASTA file. (10 points)
The boring bit: Judging Criteria, Rules and
Prizes!
The idea behind this competition is for students to apply the theoretical concepts they have
learnt during the summer school to a practical scenario. The following points apply to the
competition:
 Participation to this hackathon is reserved to students attending the 1st Keystone
Training Summer School at the University of Malta.
 Participation is limited to groups of at least two and at most three students.
Registrations should be made by emailing: jean.p.ebejer@um.edu.mt
 Prizes are of the order of €100 for each participant in the group that places first;
€70 for those placing second; and €40 for those placing third.
 At the end of the Hackathon each group will have to present their approaches
(design decisions etc.) in a five minute presentation.
 Also, a live demonstration of the working tasks has to be given to one of the judges.
 Points will be allocated not only for the implementation of the specified tasks, but
also based on execution speed of the solution and on the most innovative ideas
when tackling "big" data.
 For this hackathon you may use any OS and programming language of your choice.
 Each group will be given a Microsoft Azure account. You can start as many virtual
machines as you like, but the instance type is required to be A35
.
 Finally, the judging panel's decision is final.
Sponsors
COST is supported by the EU
Framework Programme
Horizon 2020
5
http://azure.microsoft.com/en-us/pricing/details/virtual-machines/

More Related Content

Viewers also liked

Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceAggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceJosé Ramón Ríos Viqueira
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesLaura Po
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 

Viewers also liked (6)

Curse of Dimensionality and Big Data
Curse of Dimensionality and Big DataCurse of Dimensionality and Big Data
Curse of Dimensionality and Big Data
 
Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceAggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document Relevance
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sources
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
School intro
School introSchool intro
School intro
 
Information Retrieval Evaluation
Information Retrieval EvaluationInformation Retrieval Evaluation
Information Retrieval Evaluation
 

Similar to 1st KeyStone Summer School - Hackathon Challenge

Hb 1486-001 1074970 qsg-gene_readdataanalysis_1112
Hb 1486-001 1074970 qsg-gene_readdataanalysis_1112Hb 1486-001 1074970 qsg-gene_readdataanalysis_1112
Hb 1486-001 1074970 qsg-gene_readdataanalysis_1112Elsa von Licy
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packagesRavi Gandham
 
Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)IJCI JOURNAL
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationElijah Willie
 
Working with Dictionaries and ListsSets Modules you can use.pdf
Working with Dictionaries and ListsSets Modules you can use.pdfWorking with Dictionaries and ListsSets Modules you can use.pdf
Working with Dictionaries and ListsSets Modules you can use.pdfadvancesystem
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisSANJANA PANDEY
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 
Quick start with pallaral meta on window10 virtual desktop-virtualbox linux u...
Quick start with pallaral meta on window10 virtual desktop-virtualbox linux u...Quick start with pallaral meta on window10 virtual desktop-virtualbox linux u...
Quick start with pallaral meta on window10 virtual desktop-virtualbox linux u...Thomas K. Y. Lam
 
Kulakova sbb2014
Kulakova sbb2014Kulakova sbb2014
Kulakova sbb2014Ek_Kul
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
Arthropod es tpipeline_poster
Arthropod es tpipeline_posterArthropod es tpipeline_poster
Arthropod es tpipeline_posterTamizhmuhil
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformaticianChristian Frech
 
Polymer Brush Data Processor
Polymer Brush Data ProcessorPolymer Brush Data Processor
Polymer Brush Data ProcessorCory Bethrant
 

Similar to 1st KeyStone Summer School - Hackathon Challenge (20)

Hb 1486-001 1074970 qsg-gene_readdataanalysis_1112
Hb 1486-001 1074970 qsg-gene_readdataanalysis_1112Hb 1486-001 1074970 qsg-gene_readdataanalysis_1112
Hb 1486-001 1074970 qsg-gene_readdataanalysis_1112
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packages
 
Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Genome comparision
Genome comparisionGenome comparision
Genome comparision
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan Presentation
 
Working with Dictionaries and ListsSets Modules you can use.pdf
Working with Dictionaries and ListsSets Modules you can use.pdfWorking with Dictionaries and ListsSets Modules you can use.pdf
Working with Dictionaries and ListsSets Modules you can use.pdf
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
Intro to Python programming and iPython
Intro to Python programming and iPython Intro to Python programming and iPython
Intro to Python programming and iPython
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Quick start with pallaral meta on window10 virtual desktop-virtualbox linux u...
Quick start with pallaral meta on window10 virtual desktop-virtualbox linux u...Quick start with pallaral meta on window10 virtual desktop-virtualbox linux u...
Quick start with pallaral meta on window10 virtual desktop-virtualbox linux u...
 
Project report
Project reportProject report
Project report
 
Kulakova sbb2014
Kulakova sbb2014Kulakova sbb2014
Kulakova sbb2014
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
Arthropod es tpipeline_poster
Arthropod es tpipeline_posterArthropod es tpipeline_poster
Arthropod es tpipeline_poster
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
 
Polymer Brush Data Processor
Polymer Brush Data ProcessorPolymer Brush Data Processor
Polymer Brush Data Processor
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

1st KeyStone Summer School - Hackathon Challenge

  • 1. [1] 1st KEYSTONE Training Summer School Hackathon Session 23rd - 24th July 2015 - University of Malta Jean-Paul Ebejer,Joel Azzopardi, Joseph Bonello, Charlie Abela Contact info: jean.p.ebejer@um.edu.mt This data hackathon consists of processing "large" files consisting of biological data to extract useful information. Challenge Overview You will be given a text file containing the complete human genome (GRCh38.p3.genome.fa, 3.2GB), and an index file (gencode.v23.annotation.gff31 , 1.2GB) which contains annotations of features in that genome. These files have been downloaded from the GenCode project2 . This .fa file (known as a FASTA file) contains genomic sequence data (bases G, T, A, and C) for each chromosome. A FASTA file is typically made up of a number of records, each consisting of a header followed by a sequence. The chromosomes are therefore defined with a header line which starts with a '>' character (e.g. >chr10 10). The genomic sequence of the chromosome starts in the next line after the header and continues until either another header line is found or the end of file is reached. BRCA1 and BRCA2 are two genes that code for tumour suppression and DNA repair proteins. When a mutation occurs in these two genes, the production of these proteins may be affected negatively. This increases the likelihood of developing breast and ovarian cancer in females. In this hackathon you will deal with data regarding these genes. 1 The GFF3 file format is described here: http://www.gencodegenes.org/data_format.html 2 http://www.gencodegenes.org/
  • 2. [2] Tasks The programming tasks will focus around the gene BRCA2 (gene identifier: ENSG00000139618.14) which is located on chromosome 13. 1. Using the index file as a guide, find the protein-coding sequences for the BRCA2 gene and extract these (displaying them to the user). The chromosome number (column 1) in the annotation file (.gff3) should be "chr13" for chromosome 13. The feature type (column 3) in the annotation file should be 'CDS' for coding sequence. The ninth column in the annotation file contains key-value pairs with additional information. Using this field extract the entries for BRCA2 (where gene_id is "ENSG00000139618.14" and transcript_type is "protein_coding"). Use the start (column 4) and end (column 5) values in the annotation file to extract the BRCA2 DNA sequence from the FASTA file. In order to do this, you need to first extract the sequence of chromosome 13 from the FASTA file. The header line for this entry is ">chr13 13". (10 points) 2. Translate the genomic sequence you have extracted above into the corresponding protein sequence. This occurs by reading in triplets of DNA bases (known as codons) and translating these into amino acids3 . The genomic phase (Column 8) in the annotation file determines the starting offset to read the sequence in. You can test the generated (translated) protein sequence by using it as a query in BLAST4 using protein blast (blastp) and "Homo Sapiens" as the target organism. The BRCA2 protein should be your top result hit. (10 points) 3. In previous tasks we have searched for the sequence given the gene identifier (or name) and the chromosome where the gene is located. Your task is now to execute a reverse search - given a genomic sequence, find the possible gene name(s) and chromosome location. In this task we assume exact matches between the query and the sequence. The result should also show the total number of matches of the query to the genome. (10 points) 4. Generate a simple visualization which shows where these genomic queries are foundin the human genome. (10 points) This could look something like the following (where each dash character represents 10,000,000 genomic bases): * * * ** chr1 ------------------------- * * chr2 ------------------------- ... * chr14 ---------- ... 3 The DNA-to-Amino Acid translation table may be found here: https://en.wikipedia.org/wiki/DNA_codon_table 4 http://blast.ncbi.nlm.nih.gov/
  • 3. [3] 5. As an added task, allow for fuzzy searches. A five-letter genomic query 'GATAG' with 1 mismatch allowed will also match 'AATAG' and 'GATAA' but not 'AATAA'. The number of mismatches allowed should be parameterized and cannot be more than 70% of the length of the query. (10 points) 6. You have been given a protein sequencein file protein.txt. Find the genomic counterpart in the genome FASTA file. (10 points) The boring bit: Judging Criteria, Rules and Prizes! The idea behind this competition is for students to apply the theoretical concepts they have learnt during the summer school to a practical scenario. The following points apply to the competition:  Participation to this hackathon is reserved to students attending the 1st Keystone Training Summer School at the University of Malta.  Participation is limited to groups of at least two and at most three students. Registrations should be made by emailing: jean.p.ebejer@um.edu.mt  Prizes are of the order of €100 for each participant in the group that places first; €70 for those placing second; and €40 for those placing third.  At the end of the Hackathon each group will have to present their approaches (design decisions etc.) in a five minute presentation.  Also, a live demonstration of the working tasks has to be given to one of the judges.  Points will be allocated not only for the implementation of the specified tasks, but also based on execution speed of the solution and on the most innovative ideas when tackling "big" data.  For this hackathon you may use any OS and programming language of your choice.  Each group will be given a Microsoft Azure account. You can start as many virtual machines as you like, but the instance type is required to be A35 .  Finally, the judging panel's decision is final. Sponsors COST is supported by the EU Framework Programme Horizon 2020 5 http://azure.microsoft.com/en-us/pricing/details/virtual-machines/