3. Early Data Formats
•These early databases stored sequence
data in a file. The file held the sequence
in ASCII (plain)text and had a
descriptive filename.
• This method became limiting when
researchers wanted to include
annotations and information about the
source of the sequence.
• Difficulty in searching for sequences
was also an issue.
4. Flat File Storage Data
Formats
•When GenBank, EMBL and DDBJ
formed a collaboration (1986),
sequence databases had moved to a
defined flat file format with a shared
feature table format and annotation
standards.
•The PIR also adopted a similar format
for protein sequences
5. •The flat file formats from the
sequence databases are still used to
access and display sequence and
annotation. They are also convenient
for storage of local copies.
6.
7.
8.
9.
10. FASTA Format
• Bioinformaticists have developed a
standard format for nucleotide and
protein sequences that allows them to
be read by a wide range of programs.
This format is called FASTA format.
•FASTA format each nucleotide or
amino acid is represented using a
single letter.
11. •The first line of a FASTA is the
comment line, identified with either the
greater than symbol ‘>’. This line
identifies the sequence and includes the
accession number from NCBI,
Genbank or another repository.
•The remaining lines contain the
sequence,in lines of 80 or 120
characters per line.
12.
13. PIR FORMAT
•A sequence in PIR format consists of:
–One line starting with
•a ">" (greater-than) sign, followed
by
•a two-letter code describing the
sequence type (P1, F1, DL, DC, RL,
RC, or XX), followed by
•a semicolon, followed by
•the sequence identification
14. –One line containing a textual
description of the sequence.
–One or more lines containing the
sequence itself. The end of the
sequence is marked by a "*"
(asterisk) character.
–Optionally, this can be followed by
one or more lines describing the
sequence. Software that is
supposed to read only the sequence
should ignore these.
15. •A file in PIR format may comprise
more than one sequence.
•The PIR format is also often referred
to as the NBRF format.
16.
17. ALN/ClustalW
• The first line in the file must start with
the words "CLUSTALW". Other
information in the first line is ignored.
• One or more empty lines.
• One or more blocks of sequence data. Each
block consists of:
– One line for each sequence in the alignment.
Each line consists of:
•the sequence name
•white space
•up to 60 sequence symbols.
•optional - white space followed by a cumulative
count of residues for the sequences
18. – A line showing the degree of
conservation for the columns of the
alignment in this block.
– One or more empty lines
•Some rules about representing
sequences:
•Case doesn't matter.
•Sequence symbols should be from a
valid alphabet.
•Gaps are represented using hyphens
("-").
19. •The characters used to represent the
degree of conservation are
* -all residues or nucleotides in that
column are identical
: - conserved substitutions have been
observed
. -semi-conserved substitutions have
been observed
- no match.
20.
21. GCG/MSF
•msf formatted multiple sequence files
are most often created when using
programs of the GCG suite.
• msf files include the sequence name
and the sequence itself, which is
usually aligned with other sequences
in the file.
• You can specify a single sequence or
many sequences within an msf file.
22. •Some of the hallmarks of a msf
formatted sequence are the same as a
single sequence gcg format file:
•Begins with the line (all uppercase) !!
NA_MULTIPLE_ALIGNMENT 1.0
for nucleic acid sequences or !!
AA_MULTIPLE_ALIGNMENT 1.0
for amino acid sequences.
• Do not edit or delete the file type if
its present.
23. •A description line which contains
informative text describing what is in
the file. You can add this information
to the top of the MSF file using a text
editor.
•A dividing line which contains the
number of bases or residues in the
sequence, when the file was created,
and importantly, two dots (..) which
act as a divider between the
descriptive information and the
24. •msf files contain some other
information as well:
•Name/Weight: The name of each
sequence included in the alignment, as
well as its length and checksum (both
non-editable) and weight (editable).
•Separating Line. Must include two
slashes (//) to divide the name/weight
information from the sequence
alignment.
25. •Multiple Sequence Alignment. Each
sequence named in the above
Name/Weight lines is included. The
alignment allows you to view the
relationship among sequences