SlideShare a Scribd company logo
1 of 27
BIOINFORMARICS
SEQUENCE FILE
FORMATS
Presented By: Alphy Joseph
Date: 03 March 2016
Important file formats
•Genbank
•FASTA
•PIR
•ALN/ClustalW2
•GCG/MSF
Early Data Formats
•These early databases stored sequence
data in a file. The file held the sequence
in ASCII (plain)text and had a
descriptive filename.
• This method became limiting when
researchers wanted to include
annotations and information about the
source of the sequence.
• Difficulty in searching for sequences
was also an issue.
Flat File Storage Data
Formats
•When GenBank, EMBL and DDBJ
formed a collaboration (1986),
sequence databases had moved to a
defined flat file format with a shared
feature table format and annotation
standards.
•The PIR also adopted a similar format
for protein sequences
•The flat file formats from the
sequence databases are still used to
access and display sequence and
annotation. They are also convenient
for storage of local copies.
FASTA Format
• Bioinformaticists have developed a
standard format for nucleotide and
protein sequences that allows them to
be read by a wide range of programs.
This format is called FASTA format.
•FASTA format each nucleotide or
amino acid is represented using a
single letter.
•The first line of a FASTA is the
comment line, identified with either the
greater than symbol ‘>’. This line
identifies the sequence and includes the
accession number from NCBI,
Genbank or another repository.
•The remaining lines contain the
sequence,in lines of 80 or 120
characters per line.
PIR FORMAT
•A sequence in PIR format consists of:
–One line starting with
•a ">" (greater-than) sign, followed
by
•a two-letter code describing the
sequence type (P1, F1, DL, DC, RL,
RC, or XX), followed by
•a semicolon, followed by
•the sequence identification
–One line containing a textual
description of the sequence.
–One or more lines containing the
sequence itself. The end of the
sequence is marked by a "*"
(asterisk) character.
–Optionally, this can be followed by
one or more lines describing the
sequence. Software that is
supposed to read only the sequence
should ignore these.
•A file in PIR format may comprise
more than one sequence.
•The PIR format is also often referred
to as the NBRF format.
ALN/ClustalW
• The first line in the file must start with
the words "CLUSTALW". Other
information in the first line is ignored.
• One or more empty lines.
• One or more blocks of sequence data. Each
block consists of:
– One line for each sequence in the alignment.
Each line consists of:
•the sequence name
•white space
•up to 60 sequence symbols.
•optional - white space followed by a cumulative
count of residues for the sequences
– A line showing the degree of
conservation for the columns of the
alignment in this block.
– One or more empty lines
•Some rules about representing
sequences:
•Case doesn't matter.
•Sequence symbols should be from a
valid alphabet.
•Gaps are represented using hyphens
("-").
•The characters used to represent the
degree of conservation are
* -all residues or nucleotides in that
column are identical
: - conserved substitutions have been
observed
. -semi-conserved substitutions have
been observed
- no match.
GCG/MSF
•msf formatted multiple sequence files
are most often created when using
programs of the GCG suite.
• msf files include the sequence name
and the sequence itself, which is
usually aligned with other sequences
in the file.
• You can specify a single sequence or
many sequences within an msf file.
•Some of the hallmarks of a msf
formatted sequence are the same as a
single sequence gcg format file:
•Begins with the line (all uppercase) !!
NA_MULTIPLE_ALIGNMENT 1.0
for nucleic acid sequences or !!
AA_MULTIPLE_ALIGNMENT 1.0
for amino acid sequences.
• Do not edit or delete the file type if
its present.
•A description line which contains
informative text describing what is in
the file. You can add this information
to the top of the MSF file using a text
editor.
•A dividing line which contains the
number of bases or residues in the
sequence, when the file was created,
and importantly, two dots (..) which
act as a divider between the
descriptive information and the
•msf files contain some other
information as well:
•Name/Weight: The name of each
sequence included in the alignment, as
well as its length and checksum (both
non-editable) and weight (editable).
•Separating Line. Must include two
slashes (//) to divide the name/weight
information from the sequence
alignment.
•Multiple Sequence Alignment. Each
sequence named in the above
Name/Weight lines is included. The
alignment allows you to view the
relationship among sequences
THANK YOU

More Related Content

What's hot (20)

TrEMBL
TrEMBLTrEMBL
TrEMBL
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)
 
Scop database
Scop databaseScop database
Scop database
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
UniProt
UniProtUniProt
UniProt
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
Protein database
Protein databaseProtein database
Protein database
 
Blast
BlastBlast
Blast
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
Ddbj
DdbjDdbj
Ddbj
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Express sequence tags
Express sequence tagsExpress sequence tags
Express sequence tags
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformatics
 
Gen bank databases
Gen bank databasesGen bank databases
Gen bank databases
 
NCBI
NCBINCBI
NCBI
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary database
 

Viewers also liked

Intro to Open Babel
Intro to Open BabelIntro to Open Babel
Intro to Open Babelbaoilleach
 
Computational biology bls 303
Computational biology bls 303Computational biology bls 303
Computational biology bls 303Bruno Mmassy
 
molecular file formats in bioinformatics
molecular file formats in bioinformaticsmolecular file formats in bioinformatics
molecular file formats in bioinformaticsnadeem akhter
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES nadeem akhter
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataAbhik Seal
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformaticsnadeem akhter
 

Viewers also liked (12)

Intro to Open Babel
Intro to Open BabelIntro to Open Babel
Intro to Open Babel
 
Computational biology bls 303
Computational biology bls 303Computational biology bls 303
Computational biology bls 303
 
molecular file formats in bioinformatics
molecular file formats in bioinformaticsmolecular file formats in bioinformatics
molecular file formats in bioinformatics
 
Design your own test automation tool
Design your own test automation toolDesign your own test automation tool
Design your own test automation tool
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical data
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Biological Databases
Biological DatabasesBiological Databases
Biological Databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
Biological databases
Biological databasesBiological databases
Biological databases
 

Similar to Sequence file formats

Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Databricks
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquetManpreet Khurana
 
16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf3operatordcslipiPeng
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignmentAfra Fathima
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedTony Rogerson
 
ELF(executable and linkable format)
ELF(executable and linkable format)ELF(executable and linkable format)
ELF(executable and linkable format)Seungha Son
 
(Very u seful) different file format
(Very u seful) different file format(Very u seful) different file format
(Very u seful) different file formatJitendra Chinchore
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview externalmattlieber
 
SRA-System (7).ppsx
SRA-System (7).ppsxSRA-System (7).ppsx
SRA-System (7).ppsxlaibayyy38
 
picard_poster_12_16_15
picard_poster_12_16_15picard_poster_12_16_15
picard_poster_12_16_15David E. Kling
 
Bibliographic format ISO 2709
Bibliographic format ISO 2709 Bibliographic format ISO 2709
Bibliographic format ISO 2709 Shahil mohammed
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisSANJANA PANDEY
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 

Similar to Sequence file formats (20)

Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Avro intro
Avro introAvro intro
Avro intro
 
1650607.ppt
1650607.ppt1650607.ppt
1650607.ppt
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquet
 
16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf16119 - Get to Know Your Data Sets (1).pdf
16119 - Get to Know Your Data Sets (1).pdf
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
 
ELF(executable and linkable format)
ELF(executable and linkable format)ELF(executable and linkable format)
ELF(executable and linkable format)
 
(Very u seful) different file format
(Very u seful) different file format(Very u seful) different file format
(Very u seful) different file format
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
SRA-System (7).ppsx
SRA-System (7).ppsxSRA-System (7).ppsx
SRA-System (7).ppsx
 
FS Mod2@AzDOCUMENTS.in.pdf
FS Mod2@AzDOCUMENTS.in.pdfFS Mod2@AzDOCUMENTS.in.pdf
FS Mod2@AzDOCUMENTS.in.pdf
 
picard_poster_12_16_15
picard_poster_12_16_15picard_poster_12_16_15
picard_poster_12_16_15
 
Ch6
Ch6Ch6
Ch6
 
Bibliographic format ISO 2709
Bibliographic format ISO 2709 Bibliographic format ISO 2709
Bibliographic format ISO 2709
 
Data.ppt
Data.pptData.ppt
Data.ppt
 
Oracle
OracleOracle
Oracle
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 

Recently uploaded

Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
Ai in communication electronicss[1].pptx
Ai in communication electronicss[1].pptxAi in communication electronicss[1].pptx
Ai in communication electronicss[1].pptxsubscribeus100
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Biological classification of plants with detail
Biological classification of plants with detailBiological classification of plants with detail
Biological classification of plants with detailhaiderbaloch3
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsCharlene Llagas
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermicultureTakeleZike1
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 

Recently uploaded (20)

Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
Ai in communication electronicss[1].pptx
Ai in communication electronicss[1].pptxAi in communication electronicss[1].pptx
Ai in communication electronicss[1].pptx
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Biological classification of plants with detail
Biological classification of plants with detailBiological classification of plants with detail
Biological classification of plants with detail
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and Functions
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermiculture
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 

Sequence file formats

  • 1. BIOINFORMARICS SEQUENCE FILE FORMATS Presented By: Alphy Joseph Date: 03 March 2016
  • 3. Early Data Formats •These early databases stored sequence data in a file. The file held the sequence in ASCII (plain)text and had a descriptive filename. • This method became limiting when researchers wanted to include annotations and information about the source of the sequence. • Difficulty in searching for sequences was also an issue.
  • 4. Flat File Storage Data Formats •When GenBank, EMBL and DDBJ formed a collaboration (1986), sequence databases had moved to a defined flat file format with a shared feature table format and annotation standards. •The PIR also adopted a similar format for protein sequences
  • 5. •The flat file formats from the sequence databases are still used to access and display sequence and annotation. They are also convenient for storage of local copies.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. FASTA Format • Bioinformaticists have developed a standard format for nucleotide and protein sequences that allows them to be read by a wide range of programs. This format is called FASTA format. •FASTA format each nucleotide or amino acid is represented using a single letter.
  • 11. •The first line of a FASTA is the comment line, identified with either the greater than symbol ‘>’. This line identifies the sequence and includes the accession number from NCBI, Genbank or another repository. •The remaining lines contain the sequence,in lines of 80 or 120 characters per line.
  • 12.
  • 13. PIR FORMAT •A sequence in PIR format consists of: –One line starting with •a ">" (greater-than) sign, followed by •a two-letter code describing the sequence type (P1, F1, DL, DC, RL, RC, or XX), followed by •a semicolon, followed by •the sequence identification
  • 14. –One line containing a textual description of the sequence. –One or more lines containing the sequence itself. The end of the sequence is marked by a "*" (asterisk) character. –Optionally, this can be followed by one or more lines describing the sequence. Software that is supposed to read only the sequence should ignore these.
  • 15. •A file in PIR format may comprise more than one sequence. •The PIR format is also often referred to as the NBRF format.
  • 16.
  • 17. ALN/ClustalW • The first line in the file must start with the words "CLUSTALW". Other information in the first line is ignored. • One or more empty lines. • One or more blocks of sequence data. Each block consists of: – One line for each sequence in the alignment. Each line consists of: •the sequence name •white space •up to 60 sequence symbols. •optional - white space followed by a cumulative count of residues for the sequences
  • 18. – A line showing the degree of conservation for the columns of the alignment in this block. – One or more empty lines •Some rules about representing sequences: •Case doesn't matter. •Sequence symbols should be from a valid alphabet. •Gaps are represented using hyphens ("-").
  • 19. •The characters used to represent the degree of conservation are * -all residues or nucleotides in that column are identical : - conserved substitutions have been observed . -semi-conserved substitutions have been observed - no match.
  • 20.
  • 21. GCG/MSF •msf formatted multiple sequence files are most often created when using programs of the GCG suite. • msf files include the sequence name and the sequence itself, which is usually aligned with other sequences in the file. • You can specify a single sequence or many sequences within an msf file.
  • 22. •Some of the hallmarks of a msf formatted sequence are the same as a single sequence gcg format file: •Begins with the line (all uppercase) !! NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !! AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. • Do not edit or delete the file type if its present.
  • 23. •A description line which contains informative text describing what is in the file. You can add this information to the top of the MSF file using a text editor. •A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the
  • 24. •msf files contain some other information as well: •Name/Weight: The name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable). •Separating Line. Must include two slashes (//) to divide the name/weight information from the sequence alignment.
  • 25. •Multiple Sequence Alignment. Each sequence named in the above Name/Weight lines is included. The alignment allows you to view the relationship among sequences
  • 26.