The document discusses bioinformatics tools used for analyzing biological data. It begins with an introduction to bioinformatics and then describes several categories of tools: biological databases for storing genomic and protein data; homology tools for sequence alignment and comparison; protein function analysis tools; structural analysis tools; and sequence manipulation and analysis tools. Common tools discussed include BLAST, FASTA, ClustalW, and databases like GenBank. The document concludes by covering applications of bioinformatics in areas like molecular modeling, medicine, and computation.
1. BIOINFORMATIC TOOLS
By
KAUSHAL KUMAR SAHU
Assistant Professor (Ad Hoc)
Department of Biotechnology
Govt. Digvijay Autonomous P. G. College
Raj-Nandgaon ( C. G. )
2. INTRODUCTION
DEFINITION OF BIOINFORMATICS
HISTORY
OBJECTIVE OF BIOINFORMATIC
TOOLS OF BIOINFORMATICS
PROCEDURE AND TOOLS OF BIOINFORMATIC
o BIOLOGICAL DATABASES
o HOMOLOGY AND SIMILARITY TOOLS (SEQUENCE
ALIGNMENT)
o PROTEIN FUNCTION ANALYSIS TOOLS
o STRUCTURAL ANALYSIS TOOLS
o SEQUENCE MANIPULATION TOOLS
o SEQUENCE ANALYSIS TOOLS
APPLICATION
CONCLUSION
REFERENCES
3. Bioinformatics is a newly emerged scientific discipline for the
computational analysis and storage of biological data. The
word bioinformatics has been derived from two words.
Bio means biology
Informatics (a French word) meaning ‘data processing’.
Bioinformatics simplifies the work of biologist in handling and
analysing vast data.
several computational method are used for this purpose
include agriculture, medicine and pharmaceuticals, computer
database and algorithms of research in life science.
4. Bioinformatics can be defined as the storage, analysis, and
searching of data(e.g. nucleic acid sequences for the genes and
RNAs, amino acid sequence and structural information of
protein).
The Institute Pasteur, Paris (France) defined bioinformatics
more precisely as the mathematical, statistical and computing
methods that aim to solve biological problems using DNA and
amino acid sequences, and related information.
i.e. converting “data” to “information
5. 1977 – Φ-X174 Phage Genome sequenced
1990 – Paper published in the Journal of Molecular Biology
describes sequence alignment search algorithm
1990s – Software used to find fragment overlap for the Human
Genome Project
1992 – NCBI takes over GenBank DNA sequence database in
response to the growing number of gene patents
1994 – “Entrez” Global Query Cross-Database Search System
allows users to search GenBank database
1996 – NCBI-BLAST created to provide powerful searches against
the Gen Bank database
6. To introduce the bioinformatics discipline
To introduce the major tools used for sequence and structure
analysis and explain in general how they work
7. Homology and Comparative Modeling
Protein or gene homology is shared nucleotide or amino acid
sequences or domains shared between different proteins regardless of
whether from same or different organism
Gene or Protein Identification
Searching databases for nucleotide or amino acid sequences that
match sequences in unknown samples
8. These are software programs that are designed for extracting
the meaningful information from the mass of molecular
biology/biological databases and to carry out sequence and
structural analysis.
After the formation of the databases, tools become available to
search sequences databases.
The bioinformatics tools can be categorized in to the following
categories:
a) Biological databases
b) Homology and similarity tools (Sequence alignment tool)
c) Protein function analysis tools
d) Structural analysis tools
e) Sequence manipulation tools
f) Sequence analysis tools
9. This biological database usually contain genomic, proteomic
and metabolic data. The data include nucleotide sequences of
genes or amino acid sequences.
Some of the major biological database are:
a) Major Nucleotide Sequences Database.
b) Major Mutation Databases.
c) Major Gene Expression Databases.
d) Major Microbial Genomic Databases.
e) Major Organism Specific Genome Database.
f) Major protein Database.
EMBL (European Molecular Biology Laboratory nucleotide sequence database at EBI,
Hinxton, UK)
NDB (Nucleic Acid structure Database at Rutgers University, USA)
Entrez/Genome (NCBI, USA)
10. Homologous sequences are sequences that are related by divergence
from a common ancestor. Thus the degree of similarity between two
sequences can be measured.
This set of tools can be used to identify similarities between novel
query sequences of unknown structure and function and database
sequences whose structure and functions have been elucidated.
11. o It is a program for sequence similarity searching developed
at the NCBI.
o It identifies genes and genetic features.
o A BLAST search enables a researcher to compare a query
sequence with a database of sequence and identify database
sequence that resemble the query sequence.
12. Nucleotide-nucleotide BLAST (BLASTN):
Basic nucleotide sequence searches
The BLAST that we used for our sequences
Protein-protein BLAST (BLASTP):
Similar technology used to search amino acid sequences
Position-Specific relative BLAST (PSI-BLAST):
A more advance protein BLAST useful for analyzing relationships
between divergently evolved proteins
13. BLASTX and BLASTN variants:
Use translation for proteins and nucleotides, respectively, in the
search
MegaBLAST:
Used for BLAST several sequences at once to cut down on
processing load and server reporting-time
blastp compares an amino acid query sequence against a protein
sequence database
blastn compares a nucleotide query sequence against a nucleotide
sequence database
blastx compares a nucleotide query sequence translated in all
reading frames against a protein sequence database
tblastn compares a protein query sequence against a nucleotide
sequence database dynamically translated in all reading frames
14. Query Coverage
The percent of the query sequence matched by the database entry
Max Ident
The percent identity, i.e. the percent that the genes match up within
the limits of the full match (e.g. deletions or additions reduce this
value)
15.
16.
17. FASTA is a DNA and protein sequence alignment software
package.
It is used for a fast protein or fast nucleotide comparison.
This program achieves a high level of sensitivity for similarity
searching at high speed.
18. EMBOSS:
EMBOSS (European Molecular Biology Open Software Suite) is a
software-analysis package. It can work with data in a range of formats
and also retrieve sequence data transparently from the Web. Extensive
libraries are also provided with this package, allowing other scientists
to release their software as open source. It provides a set of sequence-
analysis programs, and also supports all UNIX platforms.
Clustalw:
It is a fully automated sequence alignment tool for DNA and protein
sequences. It returns the best match over a total length of input
sequences, be it a protein or a nucleic acid.
RasMol:
It is a powerful research tool to display the structure of DNA,
proteins, and smaller molecules. Protein Explorer, a derivative of
RasMol, is an easier to use program.
PROSPECT:
PROSPECT (PROtein Structure Prediction and Evaluation Computer
ToolKit) is a protein-structure prediction system that employs a
computational technique called protein threading to construct a
protein's 3-D model
19. DNA Sequencing
Sequence Formats
Sequence Homology Software Tools
Aligning Tools
Annotated Information
Protein Folding
20. Sanger Method
New nucleotide chains of DNA being replicated by
DNA Polymerase are stopped when di-deoxy
nucleotides (added in the reaction mixture in ~1/100
ratio) are incorperated into the chain
21. Fluorescent dyes are bound to the ddNTPs,
allowing the molecule to detected when it is
excited by a laser
Terminated DNA chains are run on a gel, and
fragments are resolved by size
By combining the fluorescence readings from each
size nucleotide chain, the DNA sequence is
computed
22.
23. First Things First – Sequence File Formats:
Most common for nucleotides: FASTA / Multi-FASTA
“>” followed by any unicode text, entire line read as
sequence title
>E. coli Globin-coupled chemotaxis sensory transducer
(TM domain)
ATGGACCTGATCACAAATGCGATTTAGAGACCTG
ATCACAAATGCGATGACCTGATCACAAATGCGAT
GACCTGATCACAAATGCGATGTAAACCTGATCAC
AAATGCGATGACCTGATCACAAATGCGATCTAAA
CCTGATCACAAATGCGATGACCTGATCACAAATG
CGATTAA
24. Clustal (free)
ClustalX – Software
ClustalW – Web
Functionality is similar, but difference is in interface, tools,
and speed of algorithms
http://www.ebi.ac.uk/clustalw/
25. Lowest energy state folding
Distributed computing is used for mid-sized proteins
Folding@Home
Human Proteome Folding Project
Rosetta@Home
Predictor@Home
26. These groups of programs allow comparing protein sequence
to the secondary protein databases that contain information on
motifs, signatures and protein domains.
Interproscan
Search protein sequences.
PPSearch
Searches protein motifs.
Radar
Protein repeats detection
27. 3-dimensional structures of proteins, nucleic acids, molecular
complexes etc
3-d data is available due to techniques such as NMR and X-
Ray crystallography
COPIA(Consensus Pattern Identification and Analysis)
It is a protein structure analysis tool for discovering
motifs in a family of protein sequences. Such motifs can then
be used to determine membership to the family of new
proteins sequences, predict secondary and tertiary structures
and functions of proteins.
28. These are software programs for analyzing and formatting
DNA and protein sequences.
RepeatMasker
It is a program that screens the DNA for interspersed
repeats.
Webcut
It is an online tool for restriction analysis, silent mutation
analysis, and SNP analysis.
Translate
It is a tool which allows the translation of a nucleotide
sequence to a protein sequence.
29. This set of tools allow to carry out further more detailed
analysis of query sequence including evolutionary analysis,
identification of mutation.
Align
This tool is used to compare two sequences.
DNA Scanner
It is a tool that scans DNA for number of different
properties such as biophysical, potential for protein
interaction.
30. Data such as experimental microarray images-
gene expression data
Proteomic data- protein expression data
Metabolic pathways, protein-protein interaction
data, regulatory networks
31. Each Database contains specific information
Like other biological systems also these databases
are interrelated
33. Some of the applications related to biological
information analysis are:
Bioinformatics is used in primer design.
Bioinformatics is used to attempt to predict the function of
actual gene products.
Molecular modeling/structural biology is a growing field
which can be considered part of bioinformatics.
There are other fields- for example, medical imaging/ image
analysis, that might be considered part of bioinformatics.
There is also a whole other discipline of biologically inspired
computation: genetic algorithms, etc.
34. Bioinformatics is building on the recognition of the importance
of information transmission, accumulation and processing in
biological systems.
Software tools for bioinformatics range from simple
command-line tools, to more complex graphical programs and
standalone web-services available from various bioinformatics
companies or public institutions.
35. S.C.Rastogi – Bioinformatics: concepts, Skills and
Applications, (2003)
C.S.V.Murthy – Bioinformatics, First Edition, (2003)
David W.Mount- Bioinformatics sequence genome analysis
second edition
http://Bioinformatics%20-
%20Tools,%20softwares%20&%20Programmes.htm
http://Bioinformatics%20-
%20Wikipedia,%20the%20free%20encyclopedia.htm