SlideShare a Scribd company logo
1 of 38
INTRODUCTION TO
BIOLOGICAL DATABASES
SARFARAZ HUSSAIN
Department of Bioinformatics
& Biotechnology GCU-
Faisalabad
NOTE: Most slides derived from NCBI’s field guide
/ sarfraz1412@gmail.com
WHAT YOU NEED TO LEARN:
 What is a database and what are the features of
an ideal db?
 What are the relationships/differences between
primary and derived sequence databases?
 What are the benefits of RefSeq?
 Why is data integration useful?
THE NATIONAL CENTER FOR
BIOTECHNOLOGY INFORMATION
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Bethesda,MD
WEB ACCESS: WWW.NCBI.NLM.NIH.GOV
New Homepage
Common footerCommon footer
New pages!New pages!
WHAT ARE DATABASES?
 Structured collection of information.
 Consists of basic units called records or entries.
 Each record consists of fields, which hold pre-
defined data related to the record.
 For example, a protein database would have
protein entries as records and protein properties
as fields (e.g., name of protein, length, amino-
acid sequence)
THE ‘PERFECT’ DATABASE
 Comprehensive, but easy to search.
 Annotated, but not “too annotated”.
 A simple, easy to understand structure.
 Cross-referenced.
 Minimum redundancy.
 Easy retrieval of data.
THE CENTRAL DOGMA & BIOLOGICAL
DATA
Protein structures
-Experiments
-Models (homologues)
Literature information
Original DNA Sequences
(Genomes)
Protein Sequences
-Inferred
-Direct sequencing
Expressed DNA sequences
( = mRNA Sequences
= cDNA sequences)
Expressed Sequence Tags
(ESTs)
NCBI DATABASES AND SERVICES
 GenBank primary sequence database
 Free public access to biomedical literature
 PubMed free Medline (3 million searches per day)
 PubMed Central full text online access
 Entrez integrated molecular and literature databases
TYPES OF MOLECULAR DATABASES
 Primary Databases
 Original submissions by experimentalists
 Content controlled by the submitter
 Examples: GenBank, Trace, SRA, SNP, GEO
 Derivative Databases
 Derived from primary data
 Content controlled by third party (NCBI)
 Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets,
UniGene, Homologene, Structure, Conserved Domain
 PubMed is a free search engine accessing
primarily the MEDLINE database of references
and abstracts on life sciences and biomedical
topics. The United States National Library of
Medicine (NLM) at the
National Institutes of Health maintains the
database as part of the Entrez system of
information retrieval.
 Pubmed: click on the drop down menu select the
pubmed option. Type any topic which you want to
find in the search box.
 After typing the topic of our interest lots of
research papers will appear on window from
where we select the specific papers for our study.
PRIMARY VS. DERIVATIVE SEQUENCE
DATABASES
GenBankGenBank
SequencingSequencing
CentersCenters
GA
GAGA
ATT
ATT
C
CGAGA
ATT
ATT
C
C
AT
GAGA
ATT
C
C GAGA
ATT
C
C
TTGACA
ATTGACTA
ACGTGC
TTGACA
CGTGA
ATTGACTA
TATAGCCG
ACGTGC
ACGTGC
ACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTA
ATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTATAGCCG
TATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG
CATT
GAGA
ATT
C
C GAGA
ATT
C
C LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated ONLY
by submitters
SEQUENCE DATABASES AT NCBI
 Primary
 GenBank: NCBI’s primary sequence database
 Trace Archive: reads from capillary sequencers
 Sequence Read Archive: next generation data
 Derivative
 GenPept (GenBank translations)
 Outside Protein (UniProt—Swiss-Prot, PDB)
 NCBI Reference Sequences (RefSeq)
GENBANK - PRIMARY SEQUENCE DB
 Nucleotide only sequence database
 Archival in nature
 Historical
 Reflective of submitter point of view (subjective)
 Redundant
 Data
 Direct submissions (traditional records)
 Batch submissions
 FTP accounts (genome data)
GENBANK - PRIMARY SEQUENCE DB (2)
 Three collaborating databases
1. GenBank
2. DNA Database of Japan (DDBJ)
3. European Molecular Biology Laboratory (EMBL) Database
TRADITIONAL GENBANK RECORD
ACCESSION U07418
VERSION U07418.1 GI:466461
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession
•Stable
•Reportable
•Universal
Accession
•Stable
•Reportable
•Universal
Version
Tracks changes in sequence
Version
Tracks changes in sequence
GI number
NCBI internal use
GI number
NCBI internal use
well annotatedwell annotated
the sequence is the datathe sequence is the data
DERIVATIVE SEQUENCE
DATABASES
FEATURES Location/Qualifiers
source 1..2484
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="3"
/map="3p22-p23"
gene 1..2484
/gene="MLH1"
CDS 22..2292
/gene="MLH1"
/note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession
Number P14242), S. cerevisiae MLH1 (GenBank Accession
Number U07187), E. coli MUTL (Swiss-Prot Accession Number
P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession
Number P14161) and Streptococcus pneumoniae (Swiss-Prot
Accession Number P14160)"
/codon_start=1
/product="DNA mismatch repair protein homolog"
/protein_id="AAC50285.1"
/db_xref="GI:463989"
/translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS
TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE
ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA
TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
GENPEPT: GENBANK CDS
TRANSLATIONS
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
REFSEQ: DERIVATIVE SEQUENCE DATABASE
 Curated transcripts and proteins
 Model transcripts and proteins
 Assembled Genomic Regions
 Chromosome records
 Human genome
 microbial
 organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
SELECTED REFSEQ ACCESSION
NUMBERS
Curated mRNA
Curated Protein
Curated non-coding RNA
Predicted mRNA
Predicted Protein
Predicted non-coding RNA
Reference Genomic Sequence
Microbial replicons, organelle genomes,
Alternate assemblies
Contig
WGS Supercontig
GENBANK TO REFSEQ
REFSEQS: ANNOTATION REAGENTS
Genomic DNAGenomic DNA
((NCNC,, NT, NWNT, NW))
Model mRNAModel mRNA (XM)(XM)
(XR)(XR)
Curated mRNACurated mRNA (NM)(NM)
(NR)(NR)
Model proteinModel protein (XP)(XP)
Curated ProteinCurated Protein (NP)(NP)
Scanning....
= ?
GenBank
Sequences
RefSeq
REFSEQ BENEFITS
 Non-redundancy  
 Updates to reflect current sequence data and biology
 Data validation
 Format consistency
 Distinct accession series
 Stewardship by NCBI staff and collaborators
OTHER DERIVATIVE DATABASES
 Expressed Sequences
 dbSNP
 Structure
 Gene
 and more…
ENTREZ
FINDING RELEVANT
INFORMATION IN NCBI
DATABASES
ENTREZ: A DISCOVERY SYSTEM
Gene
Taxonomy
PubMed
abstracts
Nucleotide
sequences
Protein
sequences
3-D
Structure
3 -D
Structure
Word weight
VAST
BLASTBLAST
Phylogeny
Hard Link
Neighbors
Related Sequences
Neighbors
Related Sequences
BLink
Domains
Neighbors
Related Structures
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
GLOBAL QUERY: ALL NCBI DATABASES
The Entrez system: 38 (and counting) integrated databasesThe Entrez system: 38 (and counting) integrated databases
TRADITIONAL METHOD: THE LINKS MENU
DNA Sequence
Nucleotide – Protein Link
Related Proteins
Protein – Structure Link
3-D Structure
THE PROBLEM
 Rapidly growing databases with complex and changing
relationships
 Rapidly changing interfaces to match the above
Result
 Many people don’t know:
 Where to begin
 Where to click on a Web page
 Why it might be useful to click there
GLOBAL NCBI (ENTREZ) SEARCH
colon cancercolon cancer
GLOBAL ENTREZ SEARCH
RESULTS
ENTREZ TIP: START SEARCHES IN
GENE
Other Entrez DBs
HomoloGene
Entrez
Protein
Gene
UniGene
BLink
Homologene:
Gene Neighbors
PRECISE RESULTS
MLH1[Gene Name] AND Human[Organism]MLH1[Gene Name] AND Human[Organism]
MLH1 GENE RECORD
MLH1:LINKS TO SEQUENCE
GENEVIEW: HUMAN MLH1 VARIATIONS
ATPase domain
‘TAKE HOME MESSAGE’ ADVANTAGES
OF DATA INTEGRATION
 More relevant inter-related information in one
place
 Makes it easier to find additional relevant
information related to your initial query
 Potentially find information indirectly linked, but
relevant to your subject of interest
 uncover non-obvious genetic features that explain
phenotype or disease
 Easier to build a ‘story’ based on multiple pieces
of biological evidence

More Related Content

What's hot

What's hot (20)

Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
EMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology LaboratoryEMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology Laboratory
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Cath
CathCath
Cath
 
EMBL-EBI
EMBL-EBIEMBL-EBI
EMBL-EBI
 
NCBI
NCBINCBI
NCBI
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
 
Prosite
PrositeProsite
Prosite
 
Gen bank databases
Gen bank databasesGen bank databases
Gen bank databases
 
Protein database
Protein databaseProtein database
Protein database
 
NCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology InformationNCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology Information
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformatics
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Protein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentProtein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural Alignment
 
Ddbj
DdbjDdbj
Ddbj
 
Gen bank
Gen bankGen bank
Gen bank
 
Protein information resource (PIR)
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)
 

Viewers also liked

BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
nadeem akhter
 
molecular file formats in bioinformatics
molecular file formats in bioinformaticsmolecular file formats in bioinformatics
molecular file formats in bioinformatics
nadeem akhter
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical data
Abhik Seal
 
Computational biology bls 303
Computational biology bls 303Computational biology bls 303
Computational biology bls 303
Bruno Mmassy
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
nadeem akhter
 

Viewers also liked (15)

Biological databases
Biological databasesBiological databases
Biological databases
 
Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
 
Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
 
Intro to Open Babel
Intro to Open BabelIntro to Open Babel
Intro to Open Babel
 
molecular file formats in bioinformatics
molecular file formats in bioinformaticsmolecular file formats in bioinformatics
molecular file formats in bioinformatics
 
Design your own test automation tool
Design your own test automation toolDesign your own test automation tool
Design your own test automation tool
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical data
 
Computational biology bls 303
Computational biology bls 303Computational biology bls 303
Computational biology bls 303
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Biological Databases
Biological DatabasesBiological Databases
Biological Databases
 

Similar to Biological databases

Biological databases.pptx
Biological databases.pptxBiological databases.pptx
Biological databases.pptx
PagudalaSangeetha
 

Similar to Biological databases (20)

Databases_L2.pptx
Databases_L2.pptxDatabases_L2.pptx
Databases_L2.pptx
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Introduction to Biological databases
Introduction to Biological databasesIntroduction to Biological databases
Introduction to Biological databases
 
Data Retrieval Systems
Data Retrieval SystemsData Retrieval Systems
Data Retrieval Systems
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 
Presentation on Biological database By Elufer Akram @ University Of Science ...
Presentation on Biological database  By Elufer Akram @ University Of Science ...Presentation on Biological database  By Elufer Akram @ University Of Science ...
Presentation on Biological database By Elufer Akram @ University Of Science ...
 
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Biological databases.pptx
Biological databases.pptxBiological databases.pptx
Biological databases.pptx
 
Data retriveal ,srg and dbget
Data retriveal ,srg and dbgetData retriveal ,srg and dbget
Data retriveal ,srg and dbget
 
02. Biological sequence databases.pptx
02. Biological sequence databases.pptx02. Biological sequence databases.pptx
02. Biological sequence databases.pptx
 
biological databases.pptx
biological databases.pptxbiological databases.pptx
biological databases.pptx
 
Databases.ppt
Databases.pptDatabases.ppt
Databases.ppt
 
BIOINFORMATICS AND DATABASES IN BIOINFORMATICS.pdf
BIOINFORMATICS  AND  DATABASES IN BIOINFORMATICS.pdfBIOINFORMATICS  AND  DATABASES IN BIOINFORMATICS.pdf
BIOINFORMATICS AND DATABASES IN BIOINFORMATICS.pdf
 
COMPUNATIONAL BIOLOGY AND DATABASES IN BIOINFORMATICS.pptx
COMPUNATIONAL BIOLOGY AND DATABASES IN BIOINFORMATICS.pptxCOMPUNATIONAL BIOLOGY AND DATABASES IN BIOINFORMATICS.pptx
COMPUNATIONAL BIOLOGY AND DATABASES IN BIOINFORMATICS.pptx
 
Biological database
Biological databaseBiological database
Biological database
 
NCBI
NCBINCBI
NCBI
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Recently uploaded (20)

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 

Biological databases

  • 1. INTRODUCTION TO BIOLOGICAL DATABASES SARFARAZ HUSSAIN Department of Bioinformatics & Biotechnology GCU- Faisalabad NOTE: Most slides derived from NCBI’s field guide / sarfraz1412@gmail.com
  • 2. WHAT YOU NEED TO LEARN:  What is a database and what are the features of an ideal db?  What are the relationships/differences between primary and derived sequence databases?  What are the benefits of RefSeq?  Why is data integration useful?
  • 3. THE NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION Created in 1988 as a part of the National Library of Medicine at NIH – Establish public databases – Research in computational biology – Develop software tools for sequence analysis – Disseminate biomedical information Bethesda,MD
  • 4. WEB ACCESS: WWW.NCBI.NLM.NIH.GOV New Homepage Common footerCommon footer New pages!New pages!
  • 5. WHAT ARE DATABASES?  Structured collection of information.  Consists of basic units called records or entries.  Each record consists of fields, which hold pre- defined data related to the record.  For example, a protein database would have protein entries as records and protein properties as fields (e.g., name of protein, length, amino- acid sequence)
  • 6. THE ‘PERFECT’ DATABASE  Comprehensive, but easy to search.  Annotated, but not “too annotated”.  A simple, easy to understand structure.  Cross-referenced.  Minimum redundancy.  Easy retrieval of data.
  • 7. THE CENTRAL DOGMA & BIOLOGICAL DATA Protein structures -Experiments -Models (homologues) Literature information Original DNA Sequences (Genomes) Protein Sequences -Inferred -Direct sequencing Expressed DNA sequences ( = mRNA Sequences = cDNA sequences) Expressed Sequence Tags (ESTs)
  • 8. NCBI DATABASES AND SERVICES  GenBank primary sequence database  Free public access to biomedical literature  PubMed free Medline (3 million searches per day)  PubMed Central full text online access  Entrez integrated molecular and literature databases
  • 9. TYPES OF MOLECULAR DATABASES  Primary Databases  Original submissions by experimentalists  Content controlled by the submitter  Examples: GenBank, Trace, SRA, SNP, GEO  Derivative Databases  Derived from primary data  Content controlled by third party (NCBI)  Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene, Structure, Conserved Domain
  • 10.  PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintains the database as part of the Entrez system of information retrieval.
  • 11.  Pubmed: click on the drop down menu select the pubmed option. Type any topic which you want to find in the search box.
  • 12.  After typing the topic of our interest lots of research papers will appear on window from where we select the specific papers for our study.
  • 13. PRIMARY VS. DERIVATIVE SEQUENCE DATABASES GenBankGenBank SequencingSequencing CentersCenters GA GAGA ATT ATT C CGAGA ATT ATT C C AT GAGA ATT C C GAGA ATT C C TTGACA ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC ACGTGC ACGTGC TTGACA TTGACA CGTGA CGTGA CGTGA ATTGACTA ATTGACTA ATTGACTA ATTGACTA TATAGCCG TATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG CATT GAGA ATT C C GAGA ATT C C LabsLabs AlgorithmsAlgorithms UniGene CuratorsCurators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated ONLY by submitters
  • 14. SEQUENCE DATABASES AT NCBI  Primary  GenBank: NCBI’s primary sequence database  Trace Archive: reads from capillary sequencers  Sequence Read Archive: next generation data  Derivative  GenPept (GenBank translations)  Outside Protein (UniProt—Swiss-Prot, PDB)  NCBI Reference Sequences (RefSeq)
  • 15. GENBANK - PRIMARY SEQUENCE DB  Nucleotide only sequence database  Archival in nature  Historical  Reflective of submitter point of view (subjective)  Redundant  Data  Direct submissions (traditional records)  Batch submissions  FTP accounts (genome data)
  • 16. GENBANK - PRIMARY SEQUENCE DB (2)  Three collaborating databases 1. GenBank 2. DNA Database of Japan (DDBJ) 3. European Molecular Biology Laboratory (EMBL) Database
  • 17. TRADITIONAL GENBANK RECORD ACCESSION U07418 VERSION U07418.1 GI:466461 ACCESSION U07418 VERSION U07418.1 GI:466461 Accession •Stable •Reportable •Universal Accession •Stable •Reportable •Universal Version Tracks changes in sequence Version Tracks changes in sequence GI number NCBI internal use GI number NCBI internal use well annotatedwell annotated the sequence is the datathe sequence is the data
  • 19. FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS GENPEPT: GENBANK CDS TRANSLATIONS >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
  • 20. REFSEQ: DERIVATIVE SEQUENCE DATABASE  Curated transcripts and proteins  Model transcripts and proteins  Assembled Genomic Regions  Chromosome records  Human genome  microbial  organelle ftp://ftp.ncbi.nih.gov/refseq/release/
  • 21. SELECTED REFSEQ ACCESSION NUMBERS Curated mRNA Curated Protein Curated non-coding RNA Predicted mRNA Predicted Protein Predicted non-coding RNA Reference Genomic Sequence Microbial replicons, organelle genomes, Alternate assemblies Contig WGS Supercontig
  • 23. REFSEQS: ANNOTATION REAGENTS Genomic DNAGenomic DNA ((NCNC,, NT, NWNT, NW)) Model mRNAModel mRNA (XM)(XM) (XR)(XR) Curated mRNACurated mRNA (NM)(NM) (NR)(NR) Model proteinModel protein (XP)(XP) Curated ProteinCurated Protein (NP)(NP) Scanning.... = ? GenBank Sequences RefSeq
  • 24. REFSEQ BENEFITS  Non-redundancy    Updates to reflect current sequence data and biology  Data validation  Format consistency  Distinct accession series  Stewardship by NCBI staff and collaborators
  • 25. OTHER DERIVATIVE DATABASES  Expressed Sequences  dbSNP  Structure  Gene  and more…
  • 27. ENTREZ: A DISCOVERY SYSTEM Gene Taxonomy PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure 3 -D Structure Word weight VAST BLASTBLAST Phylogeny Hard Link Neighbors Related Sequences Neighbors Related Sequences BLink Domains Neighbors Related Structures Pre-computed and pre-compiled data. •A potential “gold mine” of undiscovered relationships. •Used less than expected. Pre-computed and pre-compiled data. •A potential “gold mine” of undiscovered relationships. •Used less than expected.
  • 28. GLOBAL QUERY: ALL NCBI DATABASES The Entrez system: 38 (and counting) integrated databasesThe Entrez system: 38 (and counting) integrated databases
  • 29. TRADITIONAL METHOD: THE LINKS MENU DNA Sequence Nucleotide – Protein Link Related Proteins Protein – Structure Link 3-D Structure
  • 30. THE PROBLEM  Rapidly growing databases with complex and changing relationships  Rapidly changing interfaces to match the above Result  Many people don’t know:  Where to begin  Where to click on a Web page  Why it might be useful to click there
  • 31. GLOBAL NCBI (ENTREZ) SEARCH colon cancercolon cancer
  • 33. ENTREZ TIP: START SEARCHES IN GENE Other Entrez DBs HomoloGene Entrez Protein Gene UniGene BLink Homologene: Gene Neighbors
  • 34. PRECISE RESULTS MLH1[Gene Name] AND Human[Organism]MLH1[Gene Name] AND Human[Organism]
  • 37. GENEVIEW: HUMAN MLH1 VARIATIONS ATPase domain
  • 38. ‘TAKE HOME MESSAGE’ ADVANTAGES OF DATA INTEGRATION  More relevant inter-related information in one place  Makes it easier to find additional relevant information related to your initial query  Potentially find information indirectly linked, but relevant to your subject of interest  uncover non-obvious genetic features that explain phenotype or disease  Easier to build a ‘story’ based on multiple pieces of biological evidence

Editor's Notes

  1. NCBI homepage. Logo will take you back to home page. About NCBI provides introduction to the NCBI and contains basic information on genetics and bioinformatics.
  2. Primary databases serve as a repository of experimentalist sequences (GenBank). Derivative databases are sources of edited/curated sequences (RefSeq…reference sequences, UniGene...genes compared to genetic loci on genomes)
  3. ~11,000 sequences are submitted per day.
  4. Goal= nonredundant set of genes/proteins for each organism represented Model= comes from analysis of genomic content from organism assembly Reannotation of microbial genomes, for example.
  5. Two letter prefix, underscore, numeric portion. NT=contig assemblies produced by NCBI NW=supercontig assembles from WGS