1 Basics: 1 n 2
- Name at least three searchable types (categories) of information that are contained in
Title, author, journal issue, text words.
- What are MeSH terms and what is their purpose?
MeSH (Medical Subject Headings) is controlled vocabulary thesaurus used for
indexing articles in MEDLINE/PubMed. MeSH terms are organized in hierarchical
structure that allows searching at various levels of specificity. They follow hierarchy format and NOT DAG so NOT ontologies.
Q34: How can a search result be “expanded” (I refer to the PubMed help, where “expansion of search results” is a separate point)
Answer: If this question means to expand the search result if I have retrieved too few citations, then here are the following steps you need to do.
· Click the Related citations See all link for a relevant citation to display a pre-calculated set of PubMed citations closely related to the article.
· Remove extraneous or specific terms from the search box.
· Try using alternative terms to describe the concepts you are searching.
- Explain “information retrieval” with an example involving Medline and MeSH terms
- When do we speak of synonyms and when do we speak of homonyms?
Synonyms are different words with identical meaning.
Homonyms are identical words with different meaning.
- Sketch the major concepts and the conceptual schema of MedLine
Journal, Author Name, Article,Publication date, References
- Explain the differences between OMIM and MedLine
Both OMIM and Medline are literature-based databases. However the OMIM database
is a catalog of human genes and genetic disorders. An entry in OMIM is a review
focusing on a disease, its phenotypic appearance and the genes involved in the
molecular etiology of the disease. Whereas Medline is a bibliographic database
covering a broad scope of biosciences.
OMIM is a bibliographic database which contains information on all known mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype.
Bibliographic database that cites abstracts from Biomedical journals
A database which provides access to curated data gathered from public scientific literature as well as other sources
Medline, a database of indexed abstracts from scientific biomedical literature
Each entry is obtained and compiled from several reference sources.
Each entry corresponds to a single journal article.
OMIM does not employ MeSH terms
Uses controlled vocabulary called as MeSH (Medical subject headings)
OMIM is a heavily curated database
MEDLINE is also a curated database.
OMIM is focused on human disease and gathers any kind of information which helps to understand the cause of disease.
Medline contains details of mutagenesis experiments whose relevance might be yet to be established.
- What are the three root concepts of GO?
Molecular function : the elemental activities of a gene product at the molecular level, such as binding or catalysis
Biological processes: operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms.
Cellular Component: the parts of a cell or its extracellular environment.
Biological processes: operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms.
Cellular Component: the parts of a cell or its extracellular environment.
- What means “annotation”?
A combination of comments, notations, references, and citations, either in free format
or utilising a controlled vocabulary, that together describe all the experimental and
inferred information about a gene or protein. Annotations can also be applied to the
description of other biological systems. Batch, automated annotation of bulk biological
sequence is one of the key uses of Bioinformatics tools.
- Which controlled vocabularies do you know besides GO?
MeSH, HGNC, sequence Ontology, Brenda enzyme source ontology, EMAP,
SwissProt keywords, MAGE-OM
· MeSH (Medical subject Headings)
· EC (Enzyme nomenclature)
- Name three of the most important / most informative entity-types that can be found in
EMBL or EntrezGene
Entity-types: organism, molecule, sequence.
- Which objects in biology correspond to these entity-types?
Organism: animals, plants, fungi, bacteria, protozoa.
Molecule: DNA, RNA.
Sequence: nucleic acid sequence, i.e. adenine, thymine, guanine, cytosine in case of
DNA; adenine, uracil, guanine, cytosine in case of RNA.
- Define a gene and name three attributes
A locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and or other functional sequence region.
Gene is a unit of DNA which performs one function. Usually, this is equated with the production of one RNA or one protein. A gene contains coding regions, introns, untranslated regions and control regions.
Intron & exon positions
Protein binding sites
- Who is the owner of an entry in EMBL?
The person who submits (submitter) the gene sequence (or any entry) to the EMBL database is the owner. The authority to change that entry is also with the submitter.
- Name the most important data types / entity types that you have to provide with an
entry in EMBL
1 Submitter Information
2 Release Date Information
3 Sequence Data, Description and Source Information
4 Reference Citation Information
5 Feature Information (e.g. coding regions, regulatory signals)
- What means “vector clipping”?
“Vector clipping” means separating the DNA segment of interest from the vectors
DNA, when the DNA of interest is contaminated with the vectors DNA. EBI provides
a vector screening service using BLAST algorithm for “vector clipping” procedure.
- How come that so many genes have more than one name?
There has not been well established and curated gene naming system for a long time.
This may have lead to situations where the same gene might have been discovered by
different people and thus different names have been assigned. Also a gene may get
more than one name in situation where firstly partial gene fragments are discovered
and are taken for the whole gene. In this case, after finding out that the partial
sequences belong to one gene, several different names of one gene occur.
- Explain the difference between a data repository and a curated database
Data repository is a not curated place to store data. The responsibility for accuracy of
the data in a repository lies on the submitter, e.g. EMBL. In a curated database team of
specialists check the incoming entries to avoid ambiguities and redundancy, e.g.
- What is a knowledge base?
- What are mRNA, hnRNA, cDNA, rRNA and tRNA and how are they represented in
· DNA. The role of mRNA is to move the information contained in DNA to the
translation machinery (ribosomes).
· hnRNA is a precursor RNA, i.e. an RNA transcript before it is processed into
mRNA, rRNA, tRNA, or other cellular RNA species, any RNA species that is
not yet the mature RNA product.
· cDNA (complementary DNA) is a piece of DNA copied from a mature mRNA.
· rRNA is ribosomal RNA. It is a component of the ribosomes, the protein
synthetic factories in the cell.
· tRNA is a transfer RNA. It transfers an amino acid to the ribosome, so that the
amino acid would be added to a polypeptide chain.
· Small nuclear ribonucleic acid (snRNA) is a class of small RNA molecules that are found within the nucleus of eukaryotic cells.
In EMBL under SRS interface there is a field molecule, which comprises these values.
- What means “coding sequence” and what is a “non-coding sequence”?
“Coding sequence” is the portion of a gene or an mRNA which actually codes for a
protein. Introns are not coding sequences; nor are the 5' or 3' untranslated regions. The
coding sequence in a cDNA or mature mRNA includes everything from the ATG (or
AUG) initiation codon through to the stop codon, inclusive.
“Non-coding sequence” is a sequence that is not translated into protein, e.g. introns,
promotors, transcription factor binding sites, all the sites that do not code mRNA.
how is information on the exon-intron-structure of a gene represented in an EMBL- entry?
Information about exon-intron structure in EMBL is stored in the “Features” field:
“Key” defines intron/exon, “Location” defines the location of intron/exon, e.g. Intron
- Give examples for “values” of the entity type “molecule” in EMBL
genomic DNA, genomic RNA, mRNA, other DNA, other RNA, pre-RNA, rRNA,
snoRNA, snRNA, tRNA, unassigned DNA, unassigned RNA, viral cRNA
- Who issues an accession number?
Curator if the database is curated, if not, then it is issued automatically. (For EMBL,
- What is the difference between an accession number and a database identifier?
An AC is assigned to each sequence upon inclusion into special database like uniprotKB. AC are stable from release to release. If several entries of one type are merged into one, for reason of minimizing redundancy.AC of all relevant entries are kept. Each entry has one primary AC and optional secondary AC.
The entry name (ID) is aunique identifier, often containing biologically relevant information. It is sometimes necessary for reason of consistency to change IDs (eg:1: to ensure that related entries have similar names 2: an entry is promoted from uniprot embl section with computationally-annotated records to the swiss=prot section with fully curated records, however the AC is always conserved.
- Why is EMBL synchronized with DDBJ and NCBI/GenBank?
To make the effective sharing of scientific information possible. To have an access to
all the submitted data through the gateways of all three databases. To create a common
system, with minimum ambiguity.
- Why is EMBL split in EMBL, EMBL updates and EMBL (whole genome shotgun) at
the SRS interface?
The EMBL is updated very frequently. In order to maintain fast query engine
reindexing must be accomplished when every new entry is submitted. However EMBL
comprises many entries reindexing of which takes a lot of time. Thus, the splitting is
done in order to decrease time needed for reindexing.
- What is a contig?
refers to the longer fragment of DNA built from short fragments . e.g: long contigous DNAsquenc assembled from shutgon sequencing .
A contig (from contiguous) is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequence and to overlapping physical segments (fragments) contained in clones depending on the context
- What means “whole genome assembly”?
Whole Genome assembly refers to the process of sequencing a large number of short
DNA sequences, all of which were generated by a shotgun sequencing project, and
putting them back together to create a representation of the original DNA sequence.
- What differentiates EMBL from ENSEMBL?
EMBL is sequence (molecule) focused database and ENSEMBL is genome (organism)
focused database. The Ensembl project provides automated genome annotation and
subsequent visualisation of the annotated genomes.
- Sketch the process of sequencing and identify possible sources for errors
DNA extraction-fragmentation-clone into vectors-transform in bac and grow-sequence library-assemle contiguous fragments.
Possible sources for errors:
· Errors due to experimental process:
Having no bands or weak bands due to having no or far less DNA than necessary in your tube, having no primer or no efficient interaction between primer and template, having contamination in template which later causes poor resolution, having the sequence to look good in some places but not in others which can be caused by salt in the DNA, too much DNA in the reaction, an unknown impurity "poisoning" the Taq processivity, an unknown contaminant increasing the binding of dyes in the enzyme's active site, secondary structure in the template, remaining of unincorporated dyes in the sample or sample itself may have a contaminant that binds unincorporated dyes.
Errors in a determined DNA sequence can be caused by flaws in the translation operations of the electrophoresis signal or quirks that arose during the experiment itself. This becomes visible in the wide diversity of data that is obtained even when using a single chemistry type, let alone different ones: under- and over- oscillations of the signals, unseparated curves (compression artefacts), and signal peaks or dropouts are frequent. Incorrect signal analysis raises errors in the base calling process of the signals and constitutes a limiting factor in the automation of assembly processes. Basically, there are three types of errors introduced into the data by electrophoresis and subsequent base-calling: insertions, deletions and mismatches.
· Errors due to biology:
While errors due to the data acquisition process itself are problematic enough, the processes that precede it involve multiple steps of biological handling and add an additional level of complexity to the task. One of the larger inconveniences is due to the method used to amplify small DNA clones which consist of adding an amplification vector and inserting the resulting construct into host cells. This vector/payload construct leads to an unpleasant consequence: any DNA sequence determined is likely to contain some part of the sequencing vector itself at the start - and sometimes the end - of the determined sequence. These stretches must of course be electronically removed as they do not belong to the target DNA that is to be sequenced. Unfortunately, the vector sequences are at the very front and rear of the sequence, which are the most error prone parts. Due to these errors, simple pattern matching algorithms often fail to recognise the sequencing vector completely.
The self-replication of the host-cells itself induces two further kind of errors: 1) errors in the base replication itself, which leads most of the time to small point mutations (SNPs, Single Nucleotide Polymorphisms) or 2) errors on a larger scale where the vector can ``loose'' its sequence payload, recombine with other plasmids or even recombine with some sequence parts of the host cell.
- Name at least three attributes listed under “Features” in a standard EMBL entry
LIBEST_026595 Nitella hyalina EST library
Culture harvested from various time points during the day and across the life cycle.
- Sketch the major parts of a typical EMBL entry: what categories does a “normal”
EMBL entry have?
1.general information(Primary Acc,Acc,sequence length, Entry creation date, Modification date, ID,..)
2.description(description, keywords, Organism, Organism classification)
4. Sequence ( characteristics, sequence)
- What are “EST sequences” ? and which database comprises information on EST
EST (expressed sequence tags) are short pieces of cDNA sequence. Tags can be
allocated to some certain position (tag markers). ESTs consist of 100-400 base pairs.
ESTs are produced via shotgun sequencing. Making many ESTs of one long DNA
sequence allows to reconstruct this sequence. UniLib, UniGene comprise EST
- What is a 3´UTR ?
The untranslated region at the 3'-end of an mRNA, i.e. following the coding region. It
contains the polyadenylation signal, as well as binding sites for proteins that affect the
mRNA's stability or location in the cell.
- A polyadenylation signal sequence, which marks the termination of the transcript about 30 base pairs downstream of the signal, followed by a few hundred adenine residues .
- Binding sites for proteins that influence the stability or the transport of the mRNA.
- What is a transcription factor site and why would you collect information on these
sites in a database?
Transcription factors are proteins that interact with DNA and initiate or inhibit the
process of transcription upon binding to DNA. A TF site is a region on the DNA to
which a TF can bind. It is important to know these sites to understand how (and
which) TF's can influence the regulation of specific genes.
- Explain the fundamentals of gene regulation (activation of transcription; features of
- DNA that mediate and control transcription)
For transcription to start RNA polymerase must bind to the promoter (in prokaryotes).
In eukaryotes TFs must bind to the TF sites and only then polymerase is able to
recognize the promoter region.
- Explain how in silico prediction of transcription factor binding sites can be validated
through molecular biology experiments
You can use pulldown assays to verify predicted binding sites. Attach a magnetic bead
to the nucleotide sequence in question, allow it to bind to proteins, pull out the
compounds using magnetic force, wash of unbound proteins and check whether the
sequence bound to the TF's as predicted (using gel electrophoresis, MS, western
plotting, ...). Other methods: Yeast 2 hybrid, site-directed mutagenesis.
- What is a 5´UTR?
The untranslated region at the 5'-end of an mRNA. It contains several functional
elements, like binding sites for proteins that alter the RNA's stability or location in the
cell, as well as sequences that promote the initiation of translation.
- How does the transcriptional machinery know about the beginning and the end of a
“gene” (a transcript)?:Start-Stop codons??
- Which properties (features) define a class of transcription factors in TFFACTOR?
The CLASS (CL) feature defines that.
- Name at least three different classes (types) of transcription factors
- How are evidences for the presence of a certain domain or motif represented in
TFFACTOR? (I refer to the FT line in TFFACTOR entries)
FT line code of the TFFACTOR represents a certain domain or motif. FT field means “feature table”, which lists the first and the last position feature.
75. What are microarrays?
A microarray (= gene chip, gene array) is a device for the large-scale, simultaneous
measurement of gene expression in a sample of mRNA. It consists of a small solid
support (similar to a computer chip, hence the alternative names) onto which a
collection of polypeptides have been fixed, chosen in such a way that they selectively
hybridise with cDNA of interest. Spots specific to a gene are distributed in an ordered
manner over the chip in some sort of array. Several thousands of these spots might be
present on just one MA.
The area containing exactly one defined species of biomolecule is called an “element” (or feature). The immobilization of thousands of “elements” (features) can be done at very high density, allowing to monitor hybridization or binding events of a very high numbers of biomolecules simultaneously.
76.What are alternative gene expression determination technologies?
Low-to-mid plex technologies (older techniques):
· Western blot
· Northern blot
· Fluorescent in-situ hybridization
· Real time PCR
· Next Generation Sequencing Technology
· Expressed Sequence Tag (EST) analysis
· Serial Analysis Gene Expression (SAGE)
77.Explain the microarray workflow in the laboratory and map the major MAGE-OM
classes to the workflow
· Extraction of the sample, cell, (BioMaterial) from studied tissues of an organism
(BioSource) and preparation (BioSample) via a protocol (Treatment).
· Extraction of total RNA and purification to mRNA
· Labeling of extract (LabeledExtract) using dyes or other markers (Compound).
· Clean-up of the extract (BioAssayTreatment)
· Hybridisation (also a BioAssayCreation (subclass Hybridization)).
· Scanning (Image), spot finding, quantification (FeatureExtraction producing
· BioAssay and BioAssayData) gives us Features (subclass of DesignElement).
· Further analysis and display (MeasuredBioAssayData).
78.What is the difference between one-colour (one channel) and two-colour (two
channel) microarray assays?
Two-colour (two channel) microarrays are typically hybridized with cDNA prepared from two samples to be compared and that are labeled with two different dyes (Fluorescent dyes Cy3green and Cy5 red). The two Cy-labeled cDNA samples are mixed and hybridized to a single microarray that is then scanned in a microarray scanner to visualize fluorescence of the two fluorophores after excitation with a laser beam of a defined wavelength. Relative intensities of each fluorophore may then be used in ratio-based analysis to identify up-regulated and down-regulated genes not used for absolute level of gene expression.
In One-colour (one channel) microarrays, the arrays provide intensity data for each probe or probe set indicating a relative level of hybridization with the labeled target. However, they do not truly indicate abundance levels of a gene but rather relative abundance when compared to other samples or conditions when processed in the same experiment. Each RNA molecule encounters protocol and batch-specific bias during amplification, labeling, and hybridization phases of the experiment making comparisons between genes for the same microarray uninformative.
79.What are the consequences of one-channel versus two-channel hybridization for
normalization and comparison between chip experiments?
One of the advantages of the One-colour system lies in the fact that an aberrant sample cannot affect the raw data derived from other samples, because each array chip is exposed to only one sample (as opposed to a Two-color system in which a single low-quality sample may drastically impinge on overall data precision even if the other sample was of high quality).
Another benefit is that data are more easily compared to arrays from different experiments so long as batch effects have been accounted for.
One drawback to the one-color system, however, is that, when compared to the two-color system, twice as many microarrays are needed to compare samples within an experiment.
Each RNA molecule encounters protocol and batch-specific bias during amplification, labeling, and hybridization phases of the experiment making comparisons between genes for the same microarray uninformative
80.Give an example for a one-colour microarray platform
Examples- Affymetrix "Gene Chip", Illumina "Bead Chip", Agilent single-channel arrays, the Applied Microarrays "CodeLink" arrays
81.Give an example for a two-colour microarray platform
82.Explain the following MAGE-OM classes:
· Array : Physical substrate its annotatons and features
· Biomaterial: Superclass of all bilogically important substances (ef. Cell, DNA)
· bioSource : (Class of BioMaterial)The original sourse material before treatment
· Hybridization: (Sub-class of BioAssayCreation) The event of hybridization of Biosample with the Microarray
· Feature: (Sub-class of Design Element) Intended position on the Araay
· Feature extraction: (Class of BioEvents) Extracting the numerical data from hybridized MA images.
· Compound: (Class) may consists of various simple and complex compounds found.
· Ontology-entry: A single entry from a ontology or CV
83.What are the key concepts used in the conceptual model of ArrayExpress?
ArrayExpress is based on MAGE-ML:
Superclass: BioMaterial; subclasses: BioSample, BioSource, Labeled Extract;
Superclass: BioEvent; subclasses: BioAssay Creation, BioAssay Treatment, Feature
Class: Design Element; subclasses: Feature, Reporter, CompositeSequences;
84.Why is it essential to capture all data that describe the origin of the biosample?
Microarray experiments provide information about gene expression, which is a
dynamic process. Gene expression differs in different types of cells, it also may change
in time. Gene expression depends on the biosource and biosample preparation (e.g.
what drugs where used, how long the sample was prepared etc). Thus in order to get
precise information that could be compared with information attained from other
experiments description of the biosample is necessary.
85.Outline the major differences between the conceptual design of GEO and
Both GEO and ArrayExpress are MIAMI compliant and both use similar schemas
based on MAGE-OM. However until recently the basic difference between these
databases was that in ArrayExpress it was no possibility to search for the data of a
gene of interest. GEO has GEO Profiles for that purpose. On the other, hand currently
ArrayExpress already has a prototype of such program, thus the difference between
GEO and ArrayExpress is decreasing.
Another difference is that ArrayExpress contains only data from the microarray
experiments, whereas GEO in addition comprises the data from non-array techniques
such as serial analysis of gene expression (SAGE) and mass spectrometry proteomic
86.What are “abundantly expressed” genes?
House-keeping genes. Housekeeping gene – A gene that is (theoretically) expressed in all cells because it provides basic functions needed for sustenance of all cell types. Also, genes involved in metabolism are abundantly expressed in cells.
87.What is the typical distribution of all mRNA species expressed in a cell?
Quantitative distribution: Abundantly-expressed genes’ mRNA comprises ~90% of all
the quantity of mRNA in the cell, whereas only ~10% of mRNA belongs to regulatory
Qualitative distribution: From qualitative point of view, regulatory genes’ mRNA has
bigger variety in the cell rather then abundantly-expressed mRNA.
88.Is this distribution cell-type specific?
There are two ways to describe distribution of mRNA in the cell: Quantitative - not
cell type specific (always the same shape of the curve, a few genes are expressed a lot,
the rest a little); Qualitative - cell type specific (apart from housekeeping-genes, the
genes that are expressed are specific to each cell type and internal and external conditions).
- Name at least two foreign keys that could link a microarray database to a nucleotide
sequence database or UniProt KB
Gene Expression Atlas
- Is there an accession number for microarray data?
Experiments and array designs in ArrayExpress are given unique accession numbers in the format of
- E-XXXX-n for experiments
- A-XXXX-n for array designs
NCBI Gene Expression Omnibus (GEO)
- Are images taken from microarray scanners part of the database schema of
No, the raw data collected at the source generated by the scanner machine for microarrays does not include images, only .txt or .gpr files. They are under the file name of “Data files and data matrices - raw data”. Images are difficult to use in queries, nor can they easily facilitate meta-analysis of combined datasets. One of the most important uses of microarray images are in quality control.
- How are experimental series represented in GEO?
GEO is conceptually divided into three components: Platform (for the physical MA),
sample (for one hybridization) and series (for the experiment). There's is a 1-to-n
relationship from platform to sample and another one from series to sample, hence
allowing to easily represent a series of experiments with many hybridizations on the
same type of MA (or several types).
- If you ever visited ArrayExpress, you should have read about “Gene Expression
Atlas”. What does the “Gene Expression Atlas” comprise?’
the Gene Expression Atlas is database servicing queries for condition-specific gene expression patterns (e.g. genes over-expressed in a particular tissue or disease state) as well as broader exploratory searches for biologically interesting genes/samples. The Atlas replaces the ArrayExpress Data Warehouse.
When you search the atlas, you provide some general query parameters:
- which genes you are interested in
- the direction of differential expression: up, down or both
- which organism the gene belongs to
- what conditions (assay and sample attributes that are experimental factors)
- In ArrayExpress, you will find data sets with the designator “tiling array” or “genome
tiling experiment”. What is the difference between a “classical” microarray
experiment and a genome tiling experiment? Explain!
- What Boolean operators are allowed for querying ArrayExpress (advanced search)?
Enter two or more keywords in the search box with the operators AND, OR or NOT.
AND is the default search term; a search for 'prostate breast' will return hits with a match to 'prostate' AND 'breast'.
Search terms of more than one word must be entered inside quotes otherwise only the first word will be searched for. E.g. transcription AND Rattus norvegicus will effectively be a search for transcription AND Rattus.
- What fields can be searched in GEO and what fields can be browsed?
| || || |
| || |
| || || |
| || |
| || || || |
| || |
| || || || || || |
| || || || || |
| || || |
| || || |
| || || || |
| || || |
| || || || || || || || |
| || || || || || || |
- What are GEO profiles?
The database stores gene expression profiles derived from curated . Each Profile is presented as a chart that displays the expression level of one gene across all Samples within a DataSet. Experimental context is provided in the bars along the bottom of the charts making it possible to see at a glance whether a gene is differentially expressed across different experimental conditions. Profiles have various types of links including internal links that connect genes that exhibit similar behaviour, and external links to relevant records in other NCBI databases.
GEO Profiles can be searched using many different attributes including keywords, gene symbols, gene names, GenBank accession numbers, or Profiles flagged as being differentially expressed.
The database stores original submitter-supplied records (Series, Samples and Platforms) as well as curated DataSets. See the for information about these different records types and how they are are related to each other.
Curated DataSets form the basis of GEO's advanced data display and analysis features, including tools to identify differences in gene expression levels and cluster heatmaps. are derived from GEO DataSets. Not all original submitter-supplied records have been assembled into curated DataSets yet.
The GEO DataSets database can be searched using many different attributes including keywords, organism, DataSet type and authors. Examples and full details about how to search for GEO DataSets of interest are provided in the page.
- What is the major source of knowledge on proteins? UNIPROT
99.Which parts of UniProt can be distinguished and what is the purpose of the
partitioning of UniProt?
Three major parts:
1. UniProt Knowledge Base (consisting in turn of SwissProt, TrEMBL and PIR)
is the curated DB of all knowledge about proteins (names, sequence,
taxonomic and bibliographic data + annotations: e.g. functional info,
posttranslational modifications, diseases, structural info, etc.)
2. UniRef (100/90/50): Gives clustered sets of genes (with different clustering
thresholds) to speed up searches and find similarities.
3. UniPArc: A comprehensive, non-redundant repository about the history of
4. UNIPROT-METAGENOMICS AND ENVIRONMENTAL
Purpose of partitioning: To clearly distinguish between the curated knowledge in the
KB and the archive... Each part has its own purpose.
100. What are the original root databases of UniProt KB?
SwissProt (the supercurators), TrEMBL (computationally translated ORFs from
EMBL), PIR (protein info. Resource)
101. Why is UniProt KB called a “curated” database?
Because there are curators to take care of it. Each entry is checked by one or more
experts, double-checked with other entries (minimally redundant!) and annotated
carefully with state-of-the-art knowledge (.g. functional info, posttranslational
modifications, diseases, structural info, cross-references). N.B: a part of the UniProt
KB is actually not yet curated, but only annotated computationally (-> TrEMBL)
102. Why is it called a “knowledge base”?
UniProtKB is the central hub for the collection of functional information
on proteins, with:-
· consistent, and
· rich annotation.
In addition to capturing the core data mandatory for each UniProtKB entry (principally, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information aspossible is added. This includes widely accepted biological ontologies,
classifications and cross-references, and clear indications of the quality of annotation in the form of evidence attribution of experimental and computational data.
103. What is meant by the field “synonyms” in UniProt KB?
Other names by which the protein is known. There can be MANY, since many
different naming conventions exists (some naming is done on the basis of species,
other w.r.t. to families and still others with respect to function or diseases, etc.).
104. Describe the workflow underlying TREMBL
The protein sequences in TREMBL are obtained by translation of nucleotide sequences (from various nucleotide databases.
Translation is done by trying all 6 reading frame and guessing the correct frame from:
· guess it’s a protein if it’s long enough
· if it looks like a protein sequence with Met as the start codon
· has a termination site at a reasonable long distance
After the probable sequence is obtained the Evidence of the existence of protein is then checked by MALDI-TOF. The protein composition is checked and then matched with the predicted sequence.
[Before entry into TREMBL the protein sequence is stored in UNIPRAC]
Amino acid sequences of the proteins are generated computationally by translating
open reading frames from EMBL nucleotide sequence database. Since not a single
variant of translation exists, trEMBL data is not as reliable as Swiss-Prot data.
105. What is the difference between the Swissprot keywords and GO-terms?
GO terms are always a part of one of three sub-ontologies (BP, CC, MF), hence their
scope is, in principal, more limited (that should really narrow the applicability in this
case, though). They are organised hierachically.
SwissProt keywords, on the other hand, are merely indices (along several dimensions, like functional and structural categories). They are comments picked from CV.
This section lists selected keyword(s), derived from a thesaurus of controlled vocabulary with a hierarchical structure. Keywords summarise the content of a UniProtKB entry and facilitates the search of proteins of interest.
This subsection of the ‘Ontologies’ section lists selected terms derived from the Gene Ontology (GO) project.
Keywords can be used to retrieve subsets of protein entries based on functional, structural, or other categories.
Their scope is limited along only three categories
Classified along several indices:
Biological process Cellular component Coding sequence diversity Developmental stae Disease Domain Ligand Molecular function PTM
Classified along 3 indices
Ø Molecular function
Ø Biological process
Ø Cellular location
106. Describe the workflow of the generation of a new Swissprot/UniProt KB entry
A new entry is taken from TrEMBL and, generally,
the first step is to get a copy of the article(s) given in the reference
lines. Then the sequence is aligned, using FastA or Blast, against all
existing Swiss-Prot and TrEMBL entries. This allows us, quickly and easily,
to assess if and how the sequence relates to existing families in SWISS-
PROT. The next step is to read the article(s), assess the information
given and add relevant comments and features to the entry.
When a gene has been identified from probing with the gene from another
organism and that gene encodes a characterized protein the description line will be copied over from the corresponding protein sequence entry. When present in the existing entry and it is not species specific, the function and other comment lines are added.
The submission of a new protein sequence to UniProtKB can be done by SPIN. SPIN is
the web-based tool for submitting directly sequenced protein sequences and their
biological annotations to the UniProt Knowledgebase.
107. Why is SwissProt / UniProt KB called a “semantic hub” for molecular biology data?
108. Which sort of references do you know exist in the UniProt KB schema?
References to the three types of sequence-related DBs (NA sequences, protein
sequences, protein tertiary structures) as well as to specialised data collections.
Examples are TF-FACTOR/-SITE, OMIM, EMBL, GeneBank, PDB, BLOCKS, Pfam
and many more.
109. How is the problem of synonymous names for proteins dealt with in SwissProt /UniProtKB?
110. Where would you find the biologically most relevant information in a UniProt KB entry?
In General Annotation of the uniprot entry. Arguably, the most relevant information, like function and description, similarities,
etc., can be found in the annotations of each entry (which are unfortunately, mostly in
111. What sort of features described in SwissProt do you know?
Protein name and synonyms, description and function, sequence, taxonomic
information, literature references, posttranslational modifications, cross-references,
protein families, domains and sites, secondary, tertiary and quaternary structure,
comments, related diseases.
112. Sketch a simple schema comprising the major SwissProt entity types (object classes represented in SwissProt / UniProtKB)
Major entity types mentioned in Swiss prot are:-
1)Names and origin
113. Why are genes and proteins subject of patents?
a) Some proteins can work as a drug and the special methods of their purification is a real technical invention.
b)Proteins are important to researchers because they are the links between genes and pharmaceutical development.
b)Proteins are important to researchers because they are the links between genes and pharmaceutical development.
114. What is the purpose of patents?
To allow the patent holder to draw some financial benefit from his findings, while
protecting his righteous claim for the intellectual property for the invention (at least for
a given time frame). This motivates individuals and companies to make their research
115. Do patents contain sequence information? yes
Patents of genes contain nucleic acid sequence information; patents of proteins contain
amino acid sequence information
116. What is a “claim” in a patent?
A “Claim” in a patent is a set of phrases following the description of an invention and
describing the composition of invention, thus defining the extension of protection
provided by the patent, e.g. “Protein alcohol dehydrogenase, comprising two
117. Which types of information from patent literature (patent protein database) is notcontained in UniProt KB?
UniProt KB and a patent of a protein both contain information about protein’s amino
acid sequence, protein’s name, description, name of the founder (patent owner),
publication date. On the other hand, UniProt KB does not contain the information,
which is specific for patents, such as patent’s number, rights that the patent ensures the
patent owner etc.
118. What is the purpose of the IPI (International Protein Index) database and what makes it distinct from the UniProt approach?
IPI provided a top level guide to the main databases that described the proteomes of higher eukaryotic organisms. IPI:
- effectively maintains a database of cross references between the primary data sources
- provides minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript)
What are the major source databases underlying IPI?: UniProt (SwissProt and TrEMBL), RefSeq, Emsembl.
- What is InterPro and in how far is the InterPro scope different from the IPI approach?
InterPro is the integrated resource for protein domains and functional sites. As the
name suggests, it's aim is to provide integrate information from several databases for
the functional description of proteins and their classification into groups based on
structural properties. Its memeber databases are ProSite, PRINTS, Pfam, ProDom,
Smart, TigrFam, PIR, SuperFamily.
IPI does not attempt to create such a categorization nor for the functional inference,
but only tries to create an overview of the complete proteome of a limited number of
- What is the basis of protein-families? When do we speak of protein families?
- What types of sequence alignment do you know? … and what does PROSITE have to do with this?
Sequence alignment can be performed on a global or a local scale. Global alignment
tries to completely align two sequences, while the latter just looks for high-scoring
ProSite contains patterns and matrices describing motifs with known functional and
structural properties. By comparing the a given sequence, we can find out which
family a protein belongs to and maybe draw conclusions to its structure and function.
- What is the difference between PROSITE, BLOCK, PFAM and PRINTS? Which ones of the above contribute to InterPro?
All but BLOCKS contribute to InterPro. However, BLOCKS uses ProSite as its datasource.
o PROSITE is a database of protein families and domains. It containsinformation about the conserved regions in the proteins, i.e. motifs.
o PRINTS comprises the information about fingerprints of the proteins, i.e. it contains groups of motifs that allow to encode a protein more precise than a single motif. Searches through SwissProt and trEMBL.
o BLOCKS is similar to PRINTS. It also contains information about the blocksof conserved regions in the proteins (ungapped segments of the most highlyconserved regions). Searches through SwissProt.
o PFAM is a database that contains multiple sequence alignments and hiddenMarkov models covering many common protein domains and families. It has a cartoon-like representation of the protein domains’ architecture.
o SMART was the first who started using cartoon like representations. Two modes: Normal – info from SwissProt, trEMBL, Ensembl; Genomic – proteomes of organisms with fully sequenced genomes.
124. Define a motif?
A conserved element of a protein sequence that usually correlates with a particular
function. It is a pattern with a biological meaning.
Define a pattern? element of protein sequence, which has a higher than random probability of
occurrence in the proteins.
125. What role does the annotation of InterPro domains with GO terms play for the
prediction of new protein sequences?
Annotation of InterPro domains with GO terms allows to categorize a newly
discovered protein sequence in terms of molecular function, biological process or
cellular localization. It might also allow to reversely draw conclusions on which
components of protein families are responsible for a certain function and hence to a better understanding of the process itself.
126. What is a regular expression?: A controlled language to describe arbitrary strings patterns. In biology, it can be used to define and recognize patterns in AA or NA sequences.
127. Which experimental procedures produce the data for entries in PDB? : X-ray crystallography, NMR (nucleic magnetic resonance).Also electron microscopy, atom force microscopy, however very few structures r predicted using these methods are in the PDB.
128. Why is it necessary to crystallize proteins in order to obtain structural data?
X-ray scatter from a single molecule is very weak. In a crystal, many molecules are
oriented in the same direction, thus making the X-ray scattering stronger (the waves
can add up in phase and increase the signal). Therefore, a crystal acts as an amplifier
129. What mathematical approach is taken to reconstruct the 3D structure of a protein from X-ray experiments?
Inverse Fourier transform of the diffraction pattern gives electron density.
130. Which type of screening procedure is based on protein structure information?
Docking of the proteins. Knowing the protein structure allows predict the properties of
131. Is PDB a curated database?
Yes. The submitted data must be validated. The validation report is created
automatically, later it is checked by the curator. The submitter and curator discuss the
issues of the validation report.
132. What types of structure can be deposited to the PDB? (see PDB documentation for
133. What is the Chemical Component Dictionary in PDB? What is its purpose?
The Chemical Component Dictionarya is as an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules.
Interaction databases: 8,9, 10
136. Draft the three major experimental techniques used to investigate protein-protein
· Yeast two hybrid: The protein we are interested (A) in is coupled to a
transcription factor, which requires a co-factor to bind for activation. Proteins
whose interactions with A we are interested in are coupled to this co-factor.
Consequently, for proteins that interact, the TF will be activated expressing
gene products that make it resistant to some sort of poison. We can than
remove all yeast organisms that are not resistant and subsequently analyse the
binding proteins, e.g. By gel electrophoresis + MS.
· Immunoprecipation/ Pulldown: Protein A is couple to a magnetic bead or
detected via antibodies. It is then possible to extract this specific protein from
the cell together with its naturally interacting proteins and to subsequently
· Tandem Affinity Purification: The protein of interest is marked with a TAPtag.
First it binds to the immunoglobulin IgG, the protein complex is extracted
and the TAP-tag is cleaved. The second binding is to the calmodulin beads.After that the purified protein complex is eluted.
137. Sketch the conceptual model underlying the REACTOME database
REACTOME uses a frame-based data model (that is very similar to a object-oriented
class hierarchy). Top-level classes are (ReferenceEntity,) PhysicalEntity (subclasses:
EntitiyWithAccessionedSequence, GenomeEncodedEntity, SimpleEntity, Complex,
EntitySet), CatalystActivity and Event (subclasses: ReactionlikeEvent (Reaction,
BlackBoxEvent, Polymerisation, Depolymerisation) and Pathway). It's important to
notice that the same molecule in different cellular compartments or differently posttranslationally
modified variants will be represented by several instances.
138. Name at least three major classes of interaction types and define two subclasses (couldalso be instances) for each one of these interaction types.
Polymerisation Lattice formation
Depolymerisation Disintegration of the matrix layer
139. Develop a strategy for the comparison of networks of interacting proteins: how would
you compare networks?
140. Does the domain structure of a protein allow to predict its interaction partner?
Theoretically it is possible to calculate whether the protein with known structure of the
domains can interact with another protein, which domain structure is also known.
However, such predictions must be treated with precaution, because even if the
interaction is possible theoretically, it may never occur in vivo. If these predictions are
based on computationally predicted structure, the prediction of interaction is even less
reliable, since even slight mistakes in structure prediction might have crucial effects
141. Give two examples for predicates and rules that can be established for a given
interaction of your choice (e.g. protease and substrate; kinase and kinase-substrate).
Protein A and protein B forms a complex C.
Protein A binds protein B and influences its activity and disassociates later. (enzyme)
142. Give two examples how protein-protein-interactions are described in scientific text
Proteins bind to each other through a combination of hydrophobic bonding, van der Waals forces, and salt bridges at specific binding domains on each protein.
· If a protein interacts with a ligand : protein-ligand interaction
· If a protein interacts with another protein: protein-protein interaction
· If a protein binds to DNA: protein-DNA interactions
· If a protein binds to RNA: protein-RNA interactions
143. Write down all possible terms in scientific text that indicate protein-protein-
interactions or other types of molecular interactions
§ “...X binds Y...”, “...X interacts with Y...”, “... X phosphorylates Y ...”, “... ligands X
§ and Y...”, “... X and Y form Z...”, “... X inhibits the reaction Y of Z...”, etc
144. What is a SPOKE expansion and how does it differ from a MATRIX expansion
network? (visit the INTACT documentation for the answer)
134Define a simple, conceptual design for a database representing information on protein-protein interactions
135Extend this conceptual design towards any given possible biochemical interaction (e.g.interaction of small molecules (metabolites) and proteins (enzymes))
The same schema as in protein-protein interaction can be used, only some additional
entity types (e.g. metabolites, enzymes etc) and relationships between them (e.g. binds
to, interacts with etc) have to be defined.
Enzyme and metabolic pathway databases :10
- Give a short explanation of the principles of the Enzyme Classification (EC)
In the EC system for enzyme nomenclature, the first numerical character stands for the the main class of the enzyme. The various classes are:
1-oxidoreductases, 2-tranferases, 3-hydrolases, 4- lyases, 5-isomerases, 6- ligases. The next two characters stand for the properties of the substrate
The second digit describes the substrate, the third the acceptor and the fourth is the arbitrary serial no of the enzyme in its subclass.
- Describe the difference between KEGG and the ENZYME database
ENZYME is just a repository for information regarding enzyme nomenclature. It only
contains entries for each EC-number assigned enzyme, along with recommended and
alternative names and some information about the catalytic activity, co-factors and
links to SwissProt.
KEGG, on the other hand, is a large project comprising a number of different
databases for gene and genome related information, enzymatic pathways and bioactive
chemicals. As a part of one of its subbranches (KEGG Ligands), KEGG also
comprises a EC-number-based nomenclature database, but apart from only giving
names and minimal information about enzymes, the KEGG Enzyme database is fully
integrated into the other DB's of the project and cross-references to orthologues, genes,
structures and other databases.
- Describe the commonalities between both databases
They both have entries based on unique enzymes with an EC-number, provide
recommended and alternative names, basic information on catalytic activity, co-factors
and some DB cross-references (e.g. SwissProt, IUBMB EC).
- What attributes of an enzyme would you need for models of metabolite flux reactions?
Metabolic flux = The rate of turnover of molecules through a metabolic pathway or enzyme.
a) Allosteric regulation or other regulatory mechanisms for activation and inhibition of enzyme
b) Its specific activity and Kmvalue for each of the substrates ie the lineweaver burke plot for the enzyme.
c) Information about inhibitors- competitive, non competitive and uncompetitive.
However, for modelling the metabolite flux, certain steady state assumptions need to be made and the enzyme kinetics can be used in part to construct models of the metabolite flux.
- Do KEGG or ENZYME comprise the relevant information?
Kegg is a much more comprehensive database which houses multiple databases under it. Since it contains such a set of crosslinked databases it covers a wider range of data about the enzyme. For example, while kegg enzymes gives information about the EC nomenclature, basic reaction, substrate and products, it also contains links to the databases kegg reaction, rpair, reaction class and compound. This is in addition to the links to external databases like Brenda and explorenz. Thus it gives much more information than ENZYME. However it does not contain information on enzyme kinetics, catalysis and inhibition.
- What is a “rate limiting step”?
It's the slowest step in a reaction, the bottle-neck. The whole reaction can not happen quicker then its slowest sub-process.
- What is a “salvage pathway”?
Salvage pathways are used to recover bases and nucleosides that are formed during degradation of RNA and DNA. This is important in some organs because some tissues cannot undergo de novo synthesis.
- Which principle approaches towards metabolite network simulation do you know?
Also Petri nets can be used to simulate a metabolic network
- How would you model the role of a cofactor in an enzymatic reaction simulation?
Co-factors are required for the proper functioning of enzymes. Depending on their type
they can either be integral parts of the enzymes (prosthetic groups) or only loosely
bound to it (coenzymes). Either way, one way to model both of them would be as
substrates that have to be present for the catalytic reaction to happen, yet remain
unchanged, i.e. the list of products will again include these cofactors.
- How would you represent complexes of more than one cofactor, one substrate and one enzyme in a database?
You would have to introduce a container entity type that allows for bundling various
entities together. Such a Compound entity would have a 1-to-n relationship to the
cofactors, substrates and enzymes. Probably it would make more sense to have one
container type for each of them, i.e. CompoundEnzyme, CompoundCofactor and
- Which multi-enzyme complexes do you know? In which biosysnthesis – pathway are they involved?
· Pyruvate dehydrogenase complex transforming pyruvate to acetyl-CoA, which
is required for cellular respiration. It links the cytric acid cycle to the glycolysis
· Phosphotransferase system in bacteria for the sugar uptake from
phosphoenolpyruvate as an energy source.
· Tryptophane synthesis multi-enzyme complex for tryptophane synthesis.
- Sketch an ER diagram of a database that represents the citrate-cycle.
- What distinguishes a cartoon – like representation of biochemical reactions and pathways in REACTOME from the representation in KEGG?
The representation in REACTOME is much more interactive than in KEGG. The entities in the pathway maps are coloured, the maps can be zoomed and scrolled. The enzyme names are also displayed on selection which makes it easier to understand. Substrates, products, intermediates and enzymes can be selected and relevant information is displayed in a pane on the left.
In KEGG, the enyme classification nos are given, not their names. The pathway window is static and cannot be zoomed. If any entity is clicked on a new window linked to another relevant database under KEGG opens and the information about the entity under consideration can be found there.
- If you would have to design a “virtual physiological human” as a form of in silico
representation of molecular physiology: which entity-classes would this model contain?
Numerous. Depending on the level of detail, it could start from sub-atomic entities like
neutrons and electrons, going up over molecules to macromolecules such as proteins
and enzymes. It would also have to contain entity types for sub-cellular compartments,
cell types, systems and organs.
- Sketch a strategy for the development of an automated process (a software program)for the extraction of IC50 values from tables in scientific publications. How would
you design the problem-solving approach (no need to go into details of software
engineering: just describe how an automated approach should work that autonomously
extracts IC50 values from tables in biochemistry publications).
First we would define a MeSH term for IC50 values ( half maximal inhibitory concentration),
Then we would link-out to PubMed to find the IC50 values from searching the database