Unit-IV; Professional Sales Representative (PSR).pptx
Biological databases
1. INTRODUCTION TO
BIOLOGICAL DATABASES
SARFARAZ HUSSAIN
Department of Bioinformatics
& Biotechnology GCU-
Faisalabad
NOTE: Most slides derived from NCBI’s field guide
/ sarfraz1412@gmail.com
2. WHAT YOU NEED TO LEARN:
What is a database and what are the features of
an ideal db?
What are the relationships/differences between
primary and derived sequence databases?
What are the benefits of RefSeq?
Why is data integration useful?
3. THE NATIONAL CENTER FOR
BIOTECHNOLOGY INFORMATION
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Bethesda,MD
5. WHAT ARE DATABASES?
Structured collection of information.
Consists of basic units called records or entries.
Each record consists of fields, which hold pre-
defined data related to the record.
For example, a protein database would have
protein entries as records and protein properties
as fields (e.g., name of protein, length, amino-
acid sequence)
6. THE ‘PERFECT’ DATABASE
Comprehensive, but easy to search.
Annotated, but not “too annotated”.
A simple, easy to understand structure.
Cross-referenced.
Minimum redundancy.
Easy retrieval of data.
7. THE CENTRAL DOGMA & BIOLOGICAL
DATA
Protein structures
-Experiments
-Models (homologues)
Literature information
Original DNA Sequences
(Genomes)
Protein Sequences
-Inferred
-Direct sequencing
Expressed DNA sequences
( = mRNA Sequences
= cDNA sequences)
Expressed Sequence Tags
(ESTs)
8. NCBI DATABASES AND SERVICES
GenBank primary sequence database
Free public access to biomedical literature
PubMed free Medline (3 million searches per day)
PubMed Central full text online access
Entrez integrated molecular and literature databases
9. TYPES OF MOLECULAR DATABASES
Primary Databases
Original submissions by experimentalists
Content controlled by the submitter
Examples: GenBank, Trace, SRA, SNP, GEO
Derivative Databases
Derived from primary data
Content controlled by third party (NCBI)
Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets,
UniGene, Homologene, Structure, Conserved Domain
10. PubMed is a free search engine accessing
primarily the MEDLINE database of references
and abstracts on life sciences and biomedical
topics. The United States National Library of
Medicine (NLM) at the
National Institutes of Health maintains the
database as part of the Entrez system of
information retrieval.
11. Pubmed: click on the drop down menu select the
pubmed option. Type any topic which you want to
find in the search box.
12. After typing the topic of our interest lots of
research papers will appear on window from
where we select the specific papers for our study.
13. PRIMARY VS. DERIVATIVE SEQUENCE
DATABASES
GenBankGenBank
SequencingSequencing
CentersCenters
GA
GAGA
ATT
ATT
C
CGAGA
ATT
ATT
C
C
AT
GAGA
ATT
C
C GAGA
ATT
C
C
TTGACA
ATTGACTA
ACGTGC
TTGACA
CGTGA
ATTGACTA
TATAGCCG
ACGTGC
ACGTGC
ACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTA
ATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTATAGCCG
TATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG
CATT
GAGA
ATT
C
C GAGA
ATT
C
C LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated ONLY
by submitters
14. SEQUENCE DATABASES AT NCBI
Primary
GenBank: NCBI’s primary sequence database
Trace Archive: reads from capillary sequencers
Sequence Read Archive: next generation data
Derivative
GenPept (GenBank translations)
Outside Protein (UniProt—Swiss-Prot, PDB)
NCBI Reference Sequences (RefSeq)
15. GENBANK - PRIMARY SEQUENCE DB
Nucleotide only sequence database
Archival in nature
Historical
Reflective of submitter point of view (subjective)
Redundant
Data
Direct submissions (traditional records)
Batch submissions
FTP accounts (genome data)
16. GENBANK - PRIMARY SEQUENCE DB (2)
Three collaborating databases
1. GenBank
2. DNA Database of Japan (DDBJ)
3. European Molecular Biology Laboratory (EMBL) Database
17. TRADITIONAL GENBANK RECORD
ACCESSION U07418
VERSION U07418.1 GI:466461
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession
•Stable
•Reportable
•Universal
Accession
•Stable
•Reportable
•Universal
Version
Tracks changes in sequence
Version
Tracks changes in sequence
GI number
NCBI internal use
GI number
NCBI internal use
well annotatedwell annotated
the sequence is the datathe sequence is the data
19. FEATURES Location/Qualifiers
source 1..2484
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="3"
/map="3p22-p23"
gene 1..2484
/gene="MLH1"
CDS 22..2292
/gene="MLH1"
/note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession
Number P14242), S. cerevisiae MLH1 (GenBank Accession
Number U07187), E. coli MUTL (Swiss-Prot Accession Number
P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession
Number P14161) and Streptococcus pneumoniae (Swiss-Prot
Accession Number P14160)"
/codon_start=1
/product="DNA mismatch repair protein homolog"
/protein_id="AAC50285.1"
/db_xref="GI:463989"
/translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS
TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE
ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA
TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
GENPEPT: GENBANK CDS
TRANSLATIONS
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
20. REFSEQ: DERIVATIVE SEQUENCE DATABASE
Curated transcripts and proteins
Model transcripts and proteins
Assembled Genomic Regions
Chromosome records
Human genome
microbial
organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
23. REFSEQS: ANNOTATION REAGENTS
Genomic DNAGenomic DNA
((NCNC,, NT, NWNT, NW))
Model mRNAModel mRNA (XM)(XM)
(XR)(XR)
Curated mRNACurated mRNA (NM)(NM)
(NR)(NR)
Model proteinModel protein (XP)(XP)
Curated ProteinCurated Protein (NP)(NP)
Scanning....
= ?
GenBank
Sequences
RefSeq
24. REFSEQ BENEFITS
Non-redundancy
Updates to reflect current sequence data and biology
Data validation
Format consistency
Distinct accession series
Stewardship by NCBI staff and collaborators
27. ENTREZ: A DISCOVERY SYSTEM
Gene
Taxonomy
PubMed
abstracts
Nucleotide
sequences
Protein
sequences
3-D
Structure
3 -D
Structure
Word weight
VAST
BLASTBLAST
Phylogeny
Hard Link
Neighbors
Related Sequences
Neighbors
Related Sequences
BLink
Domains
Neighbors
Related Structures
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
28. GLOBAL QUERY: ALL NCBI DATABASES
The Entrez system: 38 (and counting) integrated databasesThe Entrez system: 38 (and counting) integrated databases
29. TRADITIONAL METHOD: THE LINKS MENU
DNA Sequence
Nucleotide – Protein Link
Related Proteins
Protein – Structure Link
3-D Structure
30. THE PROBLEM
Rapidly growing databases with complex and changing
relationships
Rapidly changing interfaces to match the above
Result
Many people don’t know:
Where to begin
Where to click on a Web page
Why it might be useful to click there
38. ‘TAKE HOME MESSAGE’ ADVANTAGES
OF DATA INTEGRATION
More relevant inter-related information in one
place
Makes it easier to find additional relevant
information related to your initial query
Potentially find information indirectly linked, but
relevant to your subject of interest
uncover non-obvious genetic features that explain
phenotype or disease
Easier to build a ‘story’ based on multiple pieces
of biological evidence
Editor's Notes
NCBI homepage. Logo will take you back to home page.
About NCBI provides introduction to the NCBI and contains basic information on genetics and bioinformatics.
Primary databases serve as a repository of experimentalist sequences (GenBank).
Derivative databases are sources of edited/curated sequences (RefSeq…reference sequences, UniGene...genes compared to genetic loci on genomes)
~11,000 sequences are submitted per day.
Goal= nonredundant set of genes/proteins for each organism represented
Model= comes from analysis of genomic content from organism assembly
Reannotation of microbial genomes, for example.
Two letter prefix, underscore, numeric portion.
NT=contig assemblies produced by NCBI
NW=supercontig assembles from WGS