"Dhruva Deshpande Diaries :P"

1 Basics: 1 n 2

Name at least three searchable types (categories) of information that are contained in

MEDLINE abstracts

Title, author, journal issue, text words.

What are MeSH terms and what is their purpose?

MeSH (Medical Subject Headings) is controlled vocabulary thesaurus used for

indexing articles in MEDLINE/PubMed. MeSH terms are organized in hierarchical

structure that allows searching at various levels of specificity. They follow hierarchy format and NOT DAG so NOT ontologies.

Q34: How can a search result be “expanded” (I refer to the PubMed help, where “expansion of search results” is a separate point)
Answer: If this question means to expand the search result if I have retrieved too few citations, then here are the following steps you need to do.

· Click the Related citations See all link for a relevant citation to display a pre-calculated set of PubMed citations closely related to the article.

· Remove extraneous or specific terms from the search box.

· Try using alternative terms to describe the concepts you are searching.

Explain “information retrieval” with an example involving Medline and MeSH terms

MEDLINE uses Medical Subject Headings (MeSH) for information retrieval. Engines designed to search MEDLINE (such as Entrez and PubMed) generally use a Boolean expression combining MeSH terms, words in abstract and title of the article, author names, date of publication, etc. Entrez and PubMed can also find articles similar to a given one based on a mathematical scoring system that takes into account the similarity of word content of the abstracts and titles of two articles.

When do we speak of synonyms and when do we speak of homonyms?

Synonyms are different words with identical meaning.

Homonyms are identical words with different meaning.

Sketch the major concepts and the conceptual schema of MedLine

Journal, Author Name, Article,Publication date, References

Explain the differences between OMIM and MedLine

Both OMIM and Medline are literature-based databases. However the OMIM database

is a catalog of human genes and genetic disorders. An entry in OMIM is a review

focusing on a disease, its phenotypic appearance and the genes involved in the

molecular etiology of the disease. Whereas Medline is a bibliographic database

covering a broad scope of biosciences.

OMIM	MedLine
OMIM is a bibliographic database which contains information on all known mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype.	Bibliographic database that cites abstracts from Biomedical journals
A database which provides access to curated data gathered from public scientific literature as well as other sources	Medline, a database of indexed abstracts from scientific biomedical literature
Each entry is obtained and compiled from several reference sources.	Each entry corresponds to a single journal article.
OMIM does not employ MeSH terms	Uses controlled vocabulary called as MeSH (Medical subject headings)
OMIM is a heavily curated database	MEDLINE is also a curated database.
OMIM is focused on human disease and gathers any kind of information which helps to understand the cause of disease.	Medline contains details of mutagenesis experiments whose relevance might be yet to be established.

What are the three root concepts of GO?

Molecular function : the elemental activities of a gene product at the molecular level, such as binding or catalysis
Biological processes: operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms.
Cellular Component: the parts of a cell or its extracellular environment.

What means “annotation”?

A combination of comments, notations, references, and citations, either in free format

or utilising a controlled vocabulary, that together describe all the experimental and

inferred information about a gene or protein. Annotations can also be applied to the

description of other biological systems. Batch, automated annotation of bulk biological

sequence is one of the key uses of Bioinformatics tools.

Which controlled vocabularies do you know besides GO?

MeSH, HGNC, sequence Ontology, Brenda enzyme source ontology, EMAP,

SwissProt keywords, MAGE-OM

· GO

· MeSH (Medical subject Headings)

· IUPAC

· EC (Enzyme nomenclature)

Name three of the most important / most informative entity-types that can be found in

EMBL or EntrezGene

Entity-types: organism, molecule, sequence.

Which objects in biology correspond to these entity-types?

Organism: animals, plants, fungi, bacteria, protozoa.

Molecule: DNA, RNA.

Sequence: nucleic acid sequence, i.e. adenine, thymine, guanine, cytosine in case of

DNA; adenine, uracil, guanine, cytosine in case of RNA.

Define a gene and name three attributes

A locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and or other functional sequence region.

Gene is a unit of DNA which performs one function. Usually, this is equated with the production of one RNA or one protein. A gene contains coding regions, introns, untranslated regions and control regions.

Sequence

Intron & exon positions

Protein binding sites

Locus

Who is the owner of an entry in EMBL?

The person who submits (submitter) the gene sequence (or any entry) to the EMBL database is the owner. The authority to change that entry is also with the submitter.

Name the most important data types / entity types that you have to provide with an

entry in EMBL

1 Submitter Information

2 Release Date Information

3 Sequence Data, Description and Source Information

4 Reference Citation Information

5 Feature Information (e.g. coding regions, regulatory signals)

What means “vector clipping”?

“Vector clipping” means separating the DNA segment of interest from the vectors

DNA, when the DNA of interest is contaminated with the vectors DNA. EBI provides

a vector screening service using BLAST algorithm for “vector clipping” procedure.

How come that so many genes have more than one name?

There has not been well established and curated gene naming system for a long time.

This may have lead to situations where the same gene might have been discovered by

different people and thus different names have been assigned. Also a gene may get

more than one name in situation where firstly partial gene fragments are discovered

and are taken for the whole gene. In this case, after finding out that the partial

sequences belong to one gene, several different names of one gene occur.

m-RNA splicing

Explain the difference between a data repository and a curated database

Data repository is a not curated place to store data. The responsibility for accuracy of

the data in a repository lies on the submitter, e.g. EMBL. In a curated database team of

specialists check the incoming entries to avoid ambiguities and redundancy, e.g.

SwissProt.

What is a knowledge base?

A knowledge base is a special kind of database for knowledge management. A knowledge base provides a means for information to be collected, organised, shared, searched and utilised. It has accuracy and it's non-redundant

What are mRNA, hnRNA, cDNA, rRNA and tRNA and how are they represented in

EMBL?

· DNA. The role of mRNA is to move the information contained in DNA to the

translation machinery (ribosomes).

· hnRNA is a precursor RNA, i.e. an RNA transcript before it is processed into

mRNA, rRNA, tRNA, or other cellular RNA species, any RNA species that is

not yet the mature RNA product.

· cDNA (complementary DNA) is a piece of DNA copied from a mature mRNA.

· rRNA is ribosomal RNA. It is a component of the ribosomes, the protein

synthetic factories in the cell.

· tRNA is a transfer RNA. It transfers an amino acid to the ribosome, so that the

amino acid would be added to a polypeptide chain.

· Small nuclear ribonucleic acid (snRNA) is a class of small RNA molecules that are found within the nucleus of eukaryotic cells.

In EMBL under SRS interface there is a field molecule, which comprises these values.

What means “coding sequence” and what is a “non-coding sequence”?

“Coding sequence” is the portion of a gene or an mRNA which actually codes for a

protein. Introns are not coding sequences; nor are the 5' or 3' untranslated regions. The

coding sequence in a cDNA or mature mRNA includes everything from the ATG (or

AUG) initiation codon through to the stop codon, inclusive.

“Non-coding sequence” is a sequence that is not translated into protein, e.g. introns,

promotors, transcription factor binding sites, all the sites that do not code mRNA.

how is information on the exon-intron-structure of a gene represented in an EMBL- entry?

Information about exon-intron structure in EMBL is stored in the “Features” field:

“Key” defines intron/exon, “Location” defines the location of intron/exon, e.g. Intron

10..50.

Give examples for “values” of the entity type “molecule” in EMBL

genomic DNA, genomic RNA, mRNA, other DNA, other RNA, pre-RNA, rRNA,

snoRNA, snRNA, tRNA, unassigned DNA, unassigned RNA, viral cRNA

Who issues an accession number?

Curator if the database is curated, if not, then it is issued automatically. (For EMBL,

curator).

What is the difference between an accession number and a database identifier?

An AC is assigned to each sequence upon inclusion into special database like uniprotKB. AC are stable from release to release. If several entries of one type are merged into one, for reason of minimizing redundancy.AC of all relevant entries are kept. Each entry has one primary AC and optional secondary AC.

The entry name (ID) is aunique identifier, often containing biologically relevant information. It is sometimes necessary for reason of consistency to change IDs (eg:1: to ensure that related entries have similar names 2: an entry is promoted from uniprot embl section with computationally-annotated records to the swiss=prot section with fully curated records, however the AC is always conserved.

We only have one ID per entry but we may have several AC per entry.

Why is EMBL synchronized with DDBJ and NCBI/GenBank?

To make the effective sharing of scientific information possible. To have an access to

all the submitted data through the gateways of all three databases. To create a common

system, with minimum ambiguity.

Why is EMBL split in EMBL, EMBL updates and EMBL (whole genome shotgun) at

the SRS interface?

The EMBL is updated very frequently. In order to maintain fast query engine

reindexing must be accomplished when every new entry is submitted. However EMBL

comprises many entries reindexing of which takes a lot of time. Thus, the splitting is

done in order to decrease time needed for reindexing.

What is a contig?

refers to the longer fragment of DNA built from short fragments . e.g: long contigous DNAsquenc assembled from shutgon sequencing .

A contig (from contiguous) is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequence and to overlapping physical segments (fragments) contained in clones depending on the context

What means “whole genome assembly”?

Whole Genome assembly refers to the process of sequencing a large number of short

DNA sequences, all of which were generated by a shotgun sequencing project, and

putting them back together to create a representation of the original DNA sequence.

What differentiates EMBL from ENSEMBL?

EMBL is sequence (molecule) focused database and ENSEMBL is genome (organism)

focused database. The Ensembl project provides automated genome annotation and

subsequent visualisation of the annotated genomes.

Sketch the process of sequencing and identify possible sources for errors

DNA extraction-fragmentation-clone into vectors-transform in bac and grow-sequence library-assemle contiguous fragments.

Possible sources for errors:

· Errors due to experimental process:

Having no bands or weak bands due to having no or far less DNA than necessary in your tube, having no primer or no efficient interaction between primer and template, having contamination in template which later causes poor resolution, having the sequence to look good in some places but not in others which can be caused by salt in the DNA, too much DNA in the reaction, an unknown impurity "poisoning" the Taq processivity, an unknown contaminant increasing the binding of dyes in the enzyme's active site, secondary structure in the template, remaining of unincorporated dyes in the sample or sample itself may have a contaminant that binds unincorporated dyes.

· Errors of the data acquisition process:

Errors in a determined DNA sequence can be caused by flaws in the translation operations of the electrophoresis signal or quirks that arose during the experiment itself. This becomes visible in the wide diversity of data that is obtained even when using a single chemistry type, let alone different ones: under- and over- oscillations of the signals, unseparated curves (compression artefacts), and signal peaks or dropouts are frequent. Incorrect signal analysis raises errors in the base calling process of the signals and constitutes a limiting factor in the automation of assembly processes. Basically, there are three types of errors introduced into the data by electrophoresis and subsequent base-calling : insertions, deletions and mismatches.

· Errors due to biology:

While errors due to the data acquisition process itself are problematic enough, the processes that precede it involve multiple steps of biological handling and add an additional level of complexity to the task. One of the larger inconveniences is due to the method used to amplify small DNA clones which consist of adding an amplification vector and inserting the resulting construct into host cells. This vector/payload construct leads to an unpleasant consequence: any DNA sequence determined is likely to contain some part of the sequencing vector itself at the start - and sometimes the end - of the determined sequence. These stretches must of course be electronically removed as they do not belong to the target DNA that is to be sequenced. Unfortunately, the vector sequences are at the very front and rear of the sequence, which are the most error prone parts. Due to these errors, simple pattern matching algorithms often fail to recognise the sequencing vector completely.

The self-replication of the host-cells itself induces two further kind of errors: 1) errors in the base replication itself, which leads most of the time to small point mutations (SNPs, Single Nucleotide Polymorphisms) or 2) errors on a larger scale where the vector can ``loose'' its sequence payload, recombine with other plasmids or even recombine with some sequence parts of the host cell.

Name at least three attributes listed under “Features” in a standard EMBL entry


organism	Nitella hyalina
strain	KGK0190
mol_type	mRNA
dev_stage	all stages
clone_lib	LIBEST_026595 Nitella hyalina EST library
note	Culture harvested from various time points during the day and across the life cycle.
db_xref

Sketch the major parts of a typical EMBL entry: what categories does a “normal”

EMBL entry have?

1.general information(Primary Acc,Acc,sequence length, Entry creation date, Modification date, ID,..)

2.description(description, keywords, Organism, Organism classification)

3.features(source)

4. Sequence ( characteristics, sequence)

5. References

What are “EST sequences” ? and which database comprises information on EST

sequences?

EST (expressed sequence tags) are short pieces of cDNA sequence. Tags can be

allocated to some certain position (tag markers). ESTs consist of 100-400 base pairs.

ESTs are produced via shotgun sequencing. Making many ESTs of one long DNA

sequence allows to reconstruct this sequence. UniLib, UniGene comprise EST

sequences.

What is a 3´UTR ?

The untranslated region at the 3'-end of an mRNA, i.e. following the coding region. It

contains the polyadenylation signal, as well as binding sites for proteins that affect the

mRNA's stability or location in the cell.

A polyadenylation signal sequence, which marks the termination of the transcript about 30 base pairs downstream of the signal, followed by a few hundred adenine residues .
Binding sites for proteins that influence the stability or the transport of the mRNA.

Binding sites for mi-RNA’s.

What is a transcription factor site and why would you collect information on these

sites in a database?

Transcription factors are proteins that interact with DNA and initiate or inhibit the

process of transcription upon binding to DNA. A TF site is a region on the DNA to

which a TF can bind. It is important to know these sites to understand how (and

which) TF's can influence the regulation of specific genes.

Explain the fundamentals of gene regulation (activation of transcription; features of
DNA that mediate and control transcription)

For transcription to start RNA polymerase must bind to the promoter (in prokaryotes).

In eukaryotes TFs must bind to the TF sites and only then polymerase is able to

recognize the promoter region.

Explain how in silico prediction of transcription factor binding sites can be validated

through molecular biology experiments

You can use pulldown assays to verify predicted binding sites. Attach a magnetic bead

to the nucleotide sequence in question, allow it to bind to proteins, pull out the

compounds using magnetic force, wash of unbound proteins and check whether the

sequence bound to the TF's as predicted (using gel electrophoresis, MS, western

plotting, ...). Other methods: Yeast 2 hybrid, site-directed mutagenesis.

What is a 5´UTR?

The untranslated region at the 5'-end of an mRNA. It contains several functional

elements, like binding sites for proteins that alter the RNA's stability or location in the

cell, as well as sequences that promote the initiation of translation.

How does the transcriptional machinery know about the beginning and the end of a

“gene” (a transcript)?:Start-Stop codons??

Which properties (features) define a class of transcription factors in TFFACTOR?

The CLASS (CL) feature defines that.

Name at least three different classes (types) of transcription factors

Leucine zipper, Helix-loop-helix factors ,Helix-loop-helix / leucine zipper factors, NF-1, heat shock factors.

How are evidences for the presence of a certain domain or motif represented in

TFFACTOR? (I refer to the FT line in TFFACTOR entries)

FT line code of the TFFACTOR represents a certain domain or motif. FT field means “feature table”, which lists the first and the last position feature.

75. What are microarrays?

A microarray (= gene chip, gene array) is a device for the large-scale, simultaneous

measurement of gene expression in a sample of mRNA. It consists of a small solid

support (similar to a computer chip, hence the alternative names) onto which a

collection of polypeptides have been fixed, chosen in such a way that they selectively

hybridise with cDNA of interest. Spots specific to a gene are distributed in an ordered

manner over the chip in some sort of array. Several thousands of these spots might be

present on just one MA.

The area containing exactly one defined species of biomolecule is called an “element” (or feature). The immobilization of thousands of “elements” (features) can be done at very high density, allowing to monitor hybridization or binding events of a very high numbers of biomolecules simultaneously.

76.What are alternative gene expression determination technologies?

Low-to-mid plex technologies (older techniques):

· Western blot

· Northern blot

· Fluorescent in-situ hybridization

· Real time PCR

Higher-plex technologies:

· Next Generation Sequencing Technology

· Expressed Sequence Tag (EST) analysis

· Serial Analysis Gene Expression (SAGE)

77.Explain the microarray workflow in the laboratory and map the major MAGE-OM

classes to the workflow

· Extraction of the sample, cell, (BioMaterial) from studied tissues of an organism

(BioSource) and preparation (BioSample) via a protocol (Treatment).

· Extraction of total RNA and purification to mRNA

· Labeling of extract (LabeledExtract) using dyes or other markers (Compound).

· Clean-up of the extract (BioAssayTreatment)

· Hybridisation (also a BioAssayCreation (subclass Hybridization)).

· Scanning (Image), spot finding, quantification (FeatureExtraction producing

· BioAssay and BioAssayData) gives us Features (subclass of DesignElement).

· Further analysis and display (MeasuredBioAssayData).

78.What is the difference between one-colour (one channel) and two-colour (two

channel) microarray assays?

Two-colour (two channel) microarrays are typically hybridized with cDNA prepared from two samples to be compared and that are labeled with two different dyes (Fluorescent dyes Cy3green and Cy5 red). The two Cy-labeled cDNA samples are mixed and hybridized to a single microarray that is then scanned in a microarray scanner to visualize fluorescence of the two fluorophores after excitation with a laser beam of a defined wavelength. Relative intensities of each fluorophore may then be used in ratio-based analysis to identify up-regulated and down-regulated genes not used for absolute level of gene expression.

In One-colour (one channel) microarrays, the arrays provide intensity data for each probe or probe set indicating a relative level of hybridization with the labeled target. However, they do not truly indicate abundance levels of a gene but rather relative abundance when compared to other samples or conditions when processed in the same experiment. Each RNA molecule encounters protocol and batch-specific bias during amplification, labeling, and hybridization phases of the experiment making comparisons between genes for the same microarray uninformative.

79.What are the consequences of one-channel versus two-channel hybridization for

normalization and comparison between chip experiments?

One of the advantages of the One-colour system lies in the fact that an aberrant sample cannot affect the raw data derived from other samples, because each array chip is exposed to only one sample (as opposed to a Two-color system in which a single low-quality sample may drastically impinge on overall data precision even if the other sample was of high quality).

Another benefit is that data are more easily compared to arrays from different experiments so long as batch effects have been accounted for.

One drawback to the one-color system, however, is that, when compared to the two-color system, twice as many microarrays are needed to compare samples within an experiment.

Each RNA molecule encounters protocol and batch-specific bias during amplification, labeling, and hybridization phases of the experiment making comparisons between genes for the same microarray uninformative

80.Give an example for a one-colour microarray platform

Examples- Affymetrix "Gene Chip", Illumina "Bead Chip", Agilent single-channel arrays, the Applied Microarrays "CodeLink" arrays

81.Give an example for a two-colour microarray platform

Aggilent, eppendorf

82.Explain the following MAGE-OM classes:

· Array : Physical substrate its annotatons and features

· Biomaterial: Superclass of all bilogically important substances (ef. Cell, DNA)

· bioSource : (Class of BioMaterial)The original sourse material before treatment

· Hybridization: (Sub-class of BioAssayCreation) The event of hybridization of Biosample with the Microarray

· Feature: (Sub-class of Design Element) Intended position on the Araay

· Feature extraction: (Class of BioEvents) Extracting the numerical data from hybridized MA images.

· Compound: (Class) may consists of various simple and complex compounds found.

· Ontology-entry: A single entry from a ontology or CV

83.What are the key concepts used in the conceptual model of ArrayExpress?

ArrayExpress is based on MAGE-ML:

Superclass: BioMaterial; subclasses: BioSample, BioSource, Labeled Extract;

Superclass: BioEvent; subclasses: BioAssay Creation, BioAssay Treatment, Feature

Extraction, Treatment;

Class: Compound;

Class: Design Element; subclasses: Feature, Reporter, CompositeSequences;

84.Why is it essential to capture all data that describe the origin of the biosample?

Microarray experiments provide information about gene expression, which is a

dynamic process. Gene expression differs in different types of cells, it also may change

in time. Gene expression depends on the biosource and biosample preparation (e.g.

what drugs where used, how long the sample was prepared etc). Thus in order to get

precise information that could be compared with information attained from other

experiments description of the biosample is necessary.

85.Outline the major differences between the conceptual design of GEO and

ArrayExpress

Both GEO and ArrayExpress are MIAMI compliant and both use similar schemas

based on MAGE-OM. However until recently the basic difference between these

databases was that in ArrayExpress it was no possibility to search for the data of a

gene of interest. GEO has GEO Profiles for that purpose. On the other, hand currently

ArrayExpress already has a prototype of such program, thus the difference between

GEO and ArrayExpress is decreasing.

Another difference is that ArrayExpress contains only data from the microarray

experiments, whereas GEO in addition comprises the data from non-array techniques

such as serial analysis of gene expression (SAGE) and mass spectrometry proteomic

data

86.What are “abundantly expressed” genes?

House-keeping genes. Housekeeping gene – A gene that is (theoretically) expressed in all cells because it provides basic functions needed for sustenance of all cell types. Also, genes involved in metabolism are abundantly expressed in cells.

87.What is the typical distribution of all mRNA species expressed in a cell?

Quantitative distribution: Abundantly-expressed genes’ mRNA comprises ~90% of all

the quantity of mRNA in the cell, whereas only ~10% of mRNA belongs to regulatory

genes.

Qualitative distribution: From qualitative point of view, regulatory genes’ mRNA has

bigger variety in the cell rather then abundantly-expressed mRNA.

88.Is this distribution cell-type specific?

There are two ways to describe distribution of mRNA in the cell: Quantitative - not

cell type specific (always the same shape of the curve, a few genes are expressed a lot,

the rest a little); Qualitative - cell type specific (apart from housekeeping-genes, the

genes that are expressed are specific to each cell type and internal and external conditions).

Name at least two foreign keys that could link a microarray database to a nucleotide

sequence database or UniProt KB

Gene Expression Atlas

Is there an accession number for microarray data?

Experiments and array designs in ArrayExpress are given unique accession numbers in the format of

E-XXXX-n for experiments
A-XXXX-n for array designs

GEOD	NCBI Gene Expression Omnibus (GEO)

Are images taken from microarray scanners part of the database schema of

ArrayExpress?

No, the raw data collected at the source generated by the scanner machine for microarrays does not include images, only .txt or .gpr files. They are under the file name of “Data files and data matrices - raw data”. Images are difficult to use in queries, nor can they easily facilitate meta-analysis of combined datasets. One of the most important uses of microarray images are in quality control.

How are experimental series represented in GEO?

GEO is conceptually divided into three components: Platform (for the physical MA),

sample (for one hybridization) and series (for the experiment). There's is a 1-to-n

relationship from platform to sample and another one from series to sample, hence

allowing to easily represent a series of experiments with many hybridizations on the

same type of MA (or several types).

If you ever visited ArrayExpress, you should have read about “Gene Expression

Atlas”. What does the “Gene Expression Atlas” comprise?’

the Gene Expression Atlas is database servicing queries for condition-specific gene expression patterns (e.g. genes over-expressed in a particular tissue or disease state) as well as broader exploratory searches for biologically interesting genes/samples. The Atlas replaces the ArrayExpress Data Warehouse.

When you search the atlas, you provide some general query parameters:

which genes you are interested in
the direction of differential expression: up, down or both
which organism the gene belongs to
what conditions (assay and sample attributes that are experimental factors)

In ArrayExpress, you will find data sets with the designator “tiling array” or “genome

tiling experiment”. What is the difference between a “classical” microarray

experiment and a genome tiling experiment? Explain!

Tiling array are a subtype of microarray chips. Like traditional microarrays, they function by hybridizing labeled DNA or RNA target molecules to probes fixed onto a solid surface. Tiling arrays differ from traditional microarrays in the nature of the probes. Instead of probing for sequences of known or predicted genes which may be dispersed throughout the genome, tiling arrays probe intensively for sequences which are known to exist in a contiguous region of genome. This is useful for 1characterizing regions of genome which are sequenced but with local functions that are largely unknown. Tiling arrays aid in transcriptome mapping as well as in discovering sites of DNA- protein interaction (ChIP-chip), of DNA methylation (MeDIP-chip), and of sensitivity to DNase (DNase Chip), in addition to other uses (e.g. array CGH). In addition to the advantage of being able to detect previously unidentified genes and regulatory sequences,2 improved quantification of transcription products is possible. Specific probes are present in millions of copies (as opposed to only several as in traditional arrays) within an array unit called a feature, with anywhere from 10,000 to more than 6,000,000 different features per array. Variable levels of mapping resolution are obtainable by adjusting the amount of sequence overlap between probes, or the amount of known base pairs between probe sequences, as well as the length of the probes themselves.

What Boolean operators are allowed for querying ArrayExpress (advanced search)?

Enter two or more keywords in the search box with the operators AND, OR or NOT.

AND is the default search term; a search for 'prostate breast' will return hits with a match to 'prostate' AND 'breast'.

Search terms of more than one word must be entered inside quotes otherwise only the first word will be searched for. E.g. transcription AND Rattus norvegicus will effectively be a search for transcription AND Rattus.

What fields can be searched in GEO and what fields can be browsed?

O navigation


				DataSets

				Gene profiles

				GEO accession

				GEO BLAST


				DataSets			Platforms

				GEO accessions			Samples

							Series

What are GEO profiles?

The GEO Profiles database stores gene expression profiles derived from curated GEO DataSets. Each Profile is presented as a chart that displays the expression level of one gene across all Samples within a DataSet. Experimental context is provided in the bars along the bottom of the charts making it possible to see at a glance whether a gene is differentially expressed across different experimental conditions. Profiles have various types of links including internal links that connect genes that exhibit similar behaviour, and external links to relevant records in other NCBI databases.

GEO Profiles can be searched using many different attributes including keywords, gene symbols, gene names, GenBank accession numbers, or Profiles flagged as being differentially expressed.

The GEO DataSets database stores original submitter-supplied records (Series, Samples and Platforms) as well as curated DataSets. See the Overview for information about these different records types and how they are are related to each other.

Curated DataSets form the basis of GEO's advanced data display and analysis features, including tools to identify differences in gene expression levels and cluster heatmaps. GEO Profiles are derived from GEO DataSets. Not all original submitter-supplied records have been assembled into curated DataSets yet.

The GEO DataSets database can be searched using many different attributes including keywords, organism, DataSet type and authors. Examples and full details about how to search for GEO DataSets of interest are provided in theQuerying GEO DataSets and GEO Profiles page.

What is the major source of knowledge on proteins? UNIPROT

99.Which parts of UniProt can be distinguished and what is the purpose of the

partitioning of UniProt?

Three major parts:

1. UniProt Knowledge Base (consisting in turn of SwissProt, TrEMBL and PIR)

is the curated DB of all knowledge about proteins (names, sequence,

taxonomic and bibliographic data + annotations: e.g. functional info,

posttranslational modifications, diseases, structural info, etc.)

2. UniRef (100/90/50): Gives clustered sets of genes (with different clustering

thresholds) to speed up searches and find similarities.

3. UniPArc: A comprehensive, non-redundant repository about the history of

protein sequences.

4. UNIPROT-METAGENOMICS AND ENVIRONMENTAL

Purpose of partitioning: To clearly distinguish between the curated knowledge in the

KB and the archive... Each part has its own purpose.

100. What are the original root databases of UniProt KB?

SwissProt (the supercurators), TrEMBL (computationally translated ORFs from

EMBL), PIR (protein info. Resource)

101. Why is UniProt KB called a “curated” database?

Because there are curators to take care of it. Each entry is checked by one or more

experts, double-checked with other entries (minimally redundant!) and annotated

carefully with state-of-the-art knowledge (.g. functional info, posttranslational

modifications, diseases, structural info, cross-references). N.B: a part of the UniProt

KB is actually not yet curated, but only annotated computationally (-> TrEMBL)

102. Why is it called a “knowledge base”?

UniProtKB is the central hub for the collection of functional information

on proteins, with:-

· accurate,

· consistent, and

· rich annotation.

In addition to capturing the core data mandatory for each UniProtKB entry (principally, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information aspossible is added. This includes widely accepted biological ontologies,

classifications and cross-references, and clear indications of the quality of annotation in the form of evidence attribution of experimental and computational data.

103. What is meant by the field “synonyms” in UniProt KB?

Other names by which the protein is known. There can be MANY, since many

different naming conventions exists (some naming is done on the basis of species,

other w.r.t. to families and still others with respect to function or diseases, etc.).

104. Describe the workflow underlying TREMBL

The protein sequences in TREMBL are obtained by translation of nucleotide sequences (from various nucleotide databases.

Translation is done by trying all 6 reading frame and guessing the correct frame from:

· guess it’s a protein if it’s long enough

· if it looks like a protein sequence with Met as the start codon

· has a termination site at a reasonable long distance

After the probable sequence is obtained the Evidence of the existence of protein is then checked by MALDI-TOF. The protein composition is checked and then matched with the predicted sequence.

[Before entry into TREMBL the protein sequence is stored in UNIPRAC]

Amino acid sequences of the proteins are generated computationally by translating

open reading frames from EMBL nucleotide sequence database. Since not a single

variant of translation exists, trEMBL data is not as reliable as Swiss-Prot data.

105. What is the difference between the Swissprot keywords and GO-terms?

GO terms are always a part of one of three sub-ontologies (BP, CC, MF), hence their

scope is, in principal, more limited (that should really narrow the applicability in this

case, though). They are organised hierachically.

SwissProt keywords, on the other hand, are merely indices (along several dimensions, like functional and structural categories). They are comments picked from CV.

KEYWORDS	GO
This section lists selected keyword(s), derived from a thesaurus of controlled vocabulary with a hierarchical structure. Keywords summarise the content of a UniProtKB entry and facilitates the search of proteins of interest.	This subsection of the ‘Ontologies’ section lists selected terms derived from the Gene Ontology (GO) project.
Keywords can be used to retrieve subsets of protein entries based on functional, structural, or other categories.	Their scope is limited along only three categories
Classified along several indices: Biological process Cellular component Coding sequence diversity Developmental stae Disease Domain Ligand Molecular function PTM	Classified along 3 indices Ø Molecular function Ø Biological process Ø Cellular location

106. Describe the workflow of the generation of a new Swissprot/UniProt KB entry

A new entry is taken from TrEMBL and, generally,

the first step is to get a copy of the article(s) given in the reference

lines. Then the sequence is aligned, using FastA or Blast, against all

existing Swiss-Prot and TrEMBL entries. This allows us, quickly and easily,

to assess if and how the sequence relates to existing families in SWISS-

PROT. The next step is to read the article(s), assess the information

given and add relevant comments and features to the entry.

When a gene has been identified from probing with the gene from another

organism and that gene encodes a characterized protein the description line will be copied over from the corresponding protein sequence entry. When present in the existing entry and it is not species specific, the function and other comment lines are added.

The submission of a new protein sequence to UniProtKB can be done by SPIN. SPIN is

the web-based tool for submitting directly sequenced protein sequences and their

biological annotations to the UniProt Knowledgebase.

107. Why is SwissProt / UniProt KB called a “semantic hub” for molecular biology data?

Integration with other databases: this is done with integration with nucleic acid sequence database, protein sequence db and protein tertiary structure database. At present there are about 50 databases linked with UniProt KB and this extensive network makes it a focal point of all biomolecular information interconnection.

108. Which sort of references do you know exist in the UniProt KB schema?

References to the three types of sequence-related DBs (NA sequences, protein

sequences, protein tertiary structures) as well as to specialised data collections.

Examples are TF-FACTOR/-SITE, OMIM, EMBL, GeneBank, PDB, BLOCKS, Pfam

and many more.

109. How is the problem of synonymous names for proteins dealt with in SwissProt /UniProtKB?

All the proteins encoded by the same gene are merged into single uniprot/swiss-prot entry. Differences found in various sequencing reports are analysed and fully described in the feature table (FT- LINE) that includes alternative splicing events or polymorphism, sequence conflicts etc.

110. Where would you find the biologically most relevant information in a UniProt KB entry?

In General Annotation of the uniprot entry. Arguably, the most relevant information, like function and description, similarities,

etc., can be found in the annotations of each entry (which are unfortunately, mostly in

a rather free-text-like form and hence computationally complicated to analyse).

111. What sort of features described in SwissProt do you know?

Protein name and synonyms, description and function, sequence, taxonomic

information, literature references, posttranslational modifications, cross-references,

protein families, domains and sites, secondary, tertiary and quaternary structure,

comments, related diseases.

112. Sketch a simple schema comprising the major SwissProt entity types (object classes represented in SwissProt / UniProtKB)

Major entity types mentioned in Swiss prot are:-

1)Names and origin

2)Protein attributes

3)General annotation

4)Ontologies

5)Sequence annotations

6)Sequence

7)Reference

8)Cross reference

113. Why are genes and proteins subject of patents?

a) Some proteins can work as a drug and the special methods of their purification is a real technical invention.
b)Proteins are important to researchers because they are the links between genes and pharmaceutical development.

114. What is the purpose of patents?

To allow the patent holder to draw some financial benefit from his findings, while

protecting his righteous claim for the intellectual property for the invention (at least for

a given time frame). This motivates individuals and companies to make their research

public.

115. Do patents contain sequence information? yes

Patents of genes contain nucleic acid sequence information; patents of proteins contain

amino acid sequence information

116. What is a “claim” in a patent?

A “Claim” in a patent is a set of phrases following the description of an invention and

describing the composition of invention, thus defining the extension of protection

provided by the patent, e.g. “Protein alcohol dehydrogenase, comprising two

chains…”

117. Which types of information from patent literature (patent protein database) is notcontained in UniProt KB?

UniProt KB and a patent of a protein both contain information about protein’s amino

acid sequence, protein’s name, description, name of the founder (patent owner),

publication date. On the other hand, UniProt KB does not contain the information,

which is specific for patents, such as patent’s number, rights that the patent ensures the

patent owner etc.

Synthetic sequences
Most patent application sequences

118. What is the purpose of the IPI (International Protein Index) database and what makes it distinct from the UniProt approach?

IPI provided a top level guide to the main databases that described the proteomes of higher eukaryotic organisms. IPI:

effectively maintains a database of cross references between the primary data sources
provides minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript)

What are the major source databases underlying IPI?: UniProt (SwissProt and TrEMBL), RefSeq, Emsembl.

What is InterPro and in how far is the InterPro scope different from the IPI approach?

InterPro is the integrated resource for protein domains and functional sites. As the

name suggests, it's aim is to provide integrate information from several databases for

the functional description of proteins and their classification into groups based on

structural properties. Its memeber databases are ProSite, PRINTS, Pfam, ProDom,

Smart, TigrFam, PIR, SuperFamily.

IPI does not attempt to create such a categorization nor for the functional inference,

but only tries to create an overview of the complete proteome of a limited number of

organisms.

What is the basis of protein-families? When do we speak of protein families?

Proteins in a family descend from a common ancestor and typically have similar three-dimensional structures, functions, and significant sequence similarity. Proteins that do not share a common ancestor are very unlikely to show statistically significant sequence similarity, making sequence alignment a powerful tool for identifying the members of protein families.

What types of sequence alignment do you know? … and what does PROSITE have to do with this?

Sequence alignment can be performed on a global or a local scale. Global alignment

tries to completely align two sequences, while the latter just looks for high-scoring

subsequences.

ProSite contains patterns and matrices describing motifs with known functional and

structural properties. By comparing the a given sequence, we can find out which

family a protein belongs to and maybe draw conclusions to its structure and function.

What is the difference between PROSITE, BLOCK, PFAM and PRINTS? Which ones of the above contribute to InterPro?

All but BLOCKS contribute to InterPro. However, BLOCKS uses ProSite as its datasource.

o PROSITE is a database of protein families and domains. It containsinformation about the conserved regions in the proteins, i.e. motifs.

o PRINTS comprises the information about fingerprints of the proteins, i.e. it contains groups of motifs that allow to encode a protein more precise than a single motif. Searches through SwissProt and trEMBL.

o BLOCKS is similar to PRINTS. It also contains information about the blocksof conserved regions in the proteins (ungapped segments of the most highlyconserved regions). Searches through SwissProt.

o PFAM is a database that contains multiple sequence alignments and hiddenMarkov models covering many common protein domains and families. It has a cartoon-like representation of the protein domains’ architecture.

o SMART was the first who started using cartoon like representations. Two modes: Normal – info from SwissProt, trEMBL, Ensembl; Genomic – proteomes of organisms with fully sequenced genomes.

124. Define a motif?

A conserved element of a protein sequence that usually correlates with a particular

function. It is a pattern with a biological meaning.

Define a pattern? element of protein sequence, which has a higher than random probability of

occurrence in the proteins.

125. What role does the annotation of InterPro domains with GO terms play for the

prediction of new protein sequences?

Annotation of InterPro domains with GO terms allows to categorize a newly

discovered protein sequence in terms of molecular function, biological process or

cellular localization. It might also allow to reversely draw conclusions on which

components of protein families are responsible for a certain function and hence to a better understanding of the process itself.

126. What is a regular expression?: A controlled language to describe arbitrary strings patterns. In biology, it can be used to define and recognize patterns in AA or NA sequences.

127. Which experimental procedures produce the data for entries in PDB? : X-ray crystallography, NMR (nucleic magnetic resonance).Also electron microscopy, atom force microscopy, however very few structures r predicted using these methods are in the PDB.

128. Why is it necessary to crystallize proteins in order to obtain structural data?

X-ray scatter from a single molecule is very weak. In a crystal, many molecules are

oriented in the same direction, thus making the X-ray scattering stronger (the waves

can add up in phase and increase the signal). Therefore, a crystal acts as an amplifier

129. What mathematical approach is taken to reconstruct the 3D structure of a protein from X-ray experiments?

Inverse Fourier transform of the diffraction pattern gives electron density.

130. Which type of screening procedure is based on protein structure information?

Docking of the proteins. Knowing the protein structure allows predict the properties of

protein binding.

131. Is PDB a curated database?

Yes. The submitted data must be validated. The validation report is created

automatically, later it is checked by the curator. The submitter and curator discuss the

issues of the validation report.

132. What types of structure can be deposited to the PDB? (see PDB documentation for

details)

X-ray crystallography structure depositions
NMR structure depositions
EM structure depositions

133. What is the Chemical Component Dictionary in PDB? What is its purpose?

The Chemical Component Dictionary^a is as an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules.

Interaction databases: 8,9, 10

136. Draft the three major experimental techniques used to investigate protein-protein

İnteractions

· Yeast two hybrid: The protein we are interested (A) in is coupled to a

transcription factor, which requires a co-factor to bind for activation. Proteins

whose interactions with A we are interested in are coupled to this co-factor.

Consequently, for proteins that interact, the TF will be activated expressing

gene products that make it resistant to some sort of poison. We can than

remove all yeast organisms that are not resistant and subsequently analyse the

binding proteins, e.g. By gel electrophoresis + MS.

· Immunoprecipation/ Pulldown: Protein A is couple to a magnetic bead or

detected via antibodies. It is then possible to extract this specific protein from

the cell together with its naturally interacting proteins and to subsequently

analyse those.

· Tandem Affinity Purification: The protein of interest is marked with a TAPtag.

First it binds to the immunoglobulin IgG, the protein complex is extracted

and the TAP-tag is cleaved. The second binding is to the calmodulin beads.After that the purified protein complex is eluted.

137. Sketch the conceptual model underlying the REACTOME database

REACTOME uses a frame-based data model (that is very similar to a object-oriented

class hierarchy). Top-level classes are (ReferenceEntity,) PhysicalEntity (subclasses:

EntitiyWithAccessionedSequence, GenomeEncodedEntity, SimpleEntity, Complex,

EntitySet), CatalystActivity and Event (subclasses: ReactionlikeEvent (Reaction,

BlackBoxEvent, Polymerisation, Depolymerisation) and Pathway). It's important to

notice that the same molecule in different cellular compartments or differently posttranslationally

modified variants will be represented by several instances.

138. Name at least three major classes of interaction types and define two subclasses (couldalso be instances) for each one of these interaction types.

Class Subclasses

Reaction :Acylation,Cleavage

BlackBoxEvent :Activation,Binding

Polymerisation Lattice formation

Depolymerisation Disintegration of the matrix layer

139. Develop a strategy for the comparison of networks of interacting proteins: how would

you compare networks?

140. Does the domain structure of a protein allow to predict its interaction partner?

Theoretically it is possible to calculate whether the protein with known structure of the

domains can interact with another protein, which domain structure is also known.

However, such predictions must be treated with precaution, because even if the

interaction is possible theoretically, it may never occur in vivo. If these predictions are

based on computationally predicted structure, the prediction of interaction is even less

reliable, since even slight mistakes in structure prediction might have crucial effects

on interactions.

141. Give two examples for predicates and rules that can be established for a given

interaction of your choice (e.g. protease and substrate; kinase and kinase-substrate).

Protein A and protein B forms a complex C.

Protein A binds protein B and influences its activity and disassociates later. (enzyme)

142. Give two examples how protein-protein-interactions are described in scientific text

Proteins bind to each other through a combination of hydrophobic bonding, van der Waals forces, and salt bridges at specific binding domains on each protein.

· If a protein interacts with a ligand : protein-ligand interaction

· If a protein interacts with another protein: protein-protein interaction

· If a protein binds to DNA: protein-DNA interactions

· If a protein binds to RNA: protein-RNA interactions

143. Write down all possible terms in scientific text that indicate protein-protein-

interactions or other types of molecular interactions

§ “...X binds Y...”, “...X interacts with Y...”, “... X phosphorylates Y ...”, “... ligands X

§ and Y...”, “... X and Y form Z...”, “... X inhibits the reaction Y of Z...”, etc

144. What is a SPOKE expansion and how does it differ from a MATRIX expansion

network? (visit the INTACT documentation for the answer)

Spoke expansion: Links the bait molecule to all prey molecules. If N is the count of molecule in the complex, it generated N-1 binary interactions.
Matrix expansion: Links all molecule to all other molecule present in the complex. If N is the count of molecule in the complex, it generated (N*(N-1))/2 binary interactions.

134Define a simple, conceptual design for a database representing information on protein-protein interactions

135Extend this conceptual design towards any given possible biochemical interaction (e.g.interaction of small molecules (metabolites) and proteins (enzymes))

The same schema as in protein-protein interaction can be used, only some additional

entity types (e.g. metabolites, enzymes etc) and relationships between them (e.g. binds

to, interacts with etc) have to be defined.

Enzyme and metabolic pathway databases :10

Give a short explanation of the principles of the Enzyme Classification (EC)

In the EC system for enzyme nomenclature, the first numerical character stands for the the main class of the enzyme. The various classes are:

1-oxidoreductases, 2-tranferases, 3-hydrolases, 4- lyases, 5-isomerases, 6- ligases. The next two characters stand for the properties of the substrate

The second digit describes the substrate, the third the acceptor and the fourth is the arbitrary serial no of the enzyme in its subclass.

Describe the difference between KEGG and the ENZYME database

ENZYME is just a repository for information regarding enzyme nomenclature. It only

contains entries for each EC-number assigned enzyme, along with recommended and

alternative names and some information about the catalytic activity, co-factors and

links to SwissProt.

KEGG, on the other hand, is a large project comprising a number of different

databases for gene and genome related information, enzymatic pathways and bioactive

chemicals. As a part of one of its subbranches (KEGG Ligands), KEGG also

comprises a EC-number-based nomenclature database, but apart from only giving

names and minimal information about enzymes, the KEGG Enzyme database is fully

integrated into the other DB's of the project and cross-references to orthologues, genes,

structures and other databases.

Describe the commonalities between both databases

They both have entries based on unique enzymes with an EC-number, provide

recommended and alternative names, basic information on catalytic activity, co-factors

and some DB cross-references (e.g. SwissProt, IUBMB EC).

What attributes of an enzyme would you need for models of metabolite flux reactions?

Metabolic flux = The rate of turnover of molecules through a metabolic pathway or enzyme.

a) Allosteric regulation or other regulatory mechanisms for activation and inhibition of enzyme

b) Its specific activity and Kmvalue for each of the substrates ie the lineweaver burke plot for the enzyme.

c) Information about inhibitors- competitive, non competitive and uncompetitive.

However, for modelling the metabolite flux, certain steady state assumptions need to be made and the enzyme kinetics can be used in part to construct models of the metabolite flux.

Do KEGG or ENZYME comprise the relevant information?

Kegg is a much more comprehensive database which houses multiple databases under it. Since it contains such a set of crosslinked databases it covers a wider range of data about the enzyme. For example, while kegg enzymes gives information about the EC nomenclature, basic reaction, substrate and products, it also contains links to the databases kegg reaction, rpair, reaction class and compound. This is in addition to the links to external databases like Brenda and explorenz. Thus it gives much more information than ENZYME. However it does not contain information on enzyme kinetics, catalysis and inhibition.

ENZYME is quite limited, more so than KEGG as it contains only the enzyme nomenclature and external links and none of the above mentioned attributes are discussed. Therefore to understand these, multiple databases need to be searched through the external link provided in ENZYME.

What is a “rate limiting step”?

It's the slowest step in a reaction, the bottle-neck. The whole reaction can not happen quicker then its slowest sub-process.

What is a “salvage pathway”?

Salvage pathways are used to recover bases and nucleosides that are formed during degradation of RNA and DNA. This is important in some organs because some tissues cannot undergo de novo synthesis.

Which principle approaches towards metabolite network simulation do you know?

Also Petri nets can be used to simulate a metabolic network

How would you model the role of a cofactor in an enzymatic reaction simulation?

Co-factors are required for the proper functioning of enzymes. Depending on their type

they can either be integral parts of the enzymes (prosthetic groups) or only loosely

bound to it (coenzymes). Either way, one way to model both of them would be as

substrates that have to be present for the catalytic reaction to happen, yet remain

unchanged, i.e. the list of products will again include these cofactors.

How would you represent complexes of more than one cofactor, one substrate and one enzyme in a database?

You would have to introduce a container entity type that allows for bundling various

entities together. Such a Compound entity would have a 1-to-n relationship to the

cofactors, substrates and enzymes. Probably it would make more sense to have one

container type for each of them, i.e. CompoundEnzyme, CompoundCofactor and

CompoundSubstrate.

Which multi-enzyme complexes do you know? In which biosysnthesis – pathway are they involved?

· Pyruvate dehydrogenase complex transforming pyruvate to acetyl-CoA, which

is required for cellular respiration. It links the cytric acid cycle to the glycolysis

metabolic pathway.

· Phosphotransferase system in bacteria for the sugar uptake from

phosphoenolpyruvate as an energy source.

· Tryptophane synthesis multi-enzyme complex for tryptophane synthesis.

Sketch an ER diagram of a database that represents the citrate-cycle.

An entity-relationship (ER) diagram is a specialized graphic that illustrates the relationships between entities in a database. ER diagrams often use symbols to represent three different types of information. Boxes are commonly used to represent entities. Diamonds are normally used to represent relationships and ovals are used to represent attributes.

What distinguishes a cartoon – like representation of biochemical reactions and pathways in REACTOME from the representation in KEGG?

The representation in REACTOME is much more interactive than in KEGG. The entities in the pathway maps are coloured, the maps can be zoomed and scrolled. The enzyme names are also displayed on selection which makes it easier to understand. Substrates, products, intermediates and enzymes can be selected and relevant information is displayed in a pane on the left.

In KEGG, the enyme classification nos are given, not their names. The pathway window is static and cannot be zoomed. If any entity is clicked on a new window linked to another relevant database under KEGG opens and the information about the entity under consideration can be found there.

If you would have to design a “virtual physiological human” as a form of in silico

representation of molecular physiology: which entity-classes would this model contain?

Numerous. Depending on the level of detail, it could start from sub-atomic entities like

neutrons and electrons, going up over molecules to macromolecules such as proteins

and enzymes. It would also have to contain entity types for sub-cellular compartments,

cell types, systems and organs.

Sketch a strategy for the development of an automated process (a software program)for the extraction of IC50 values from tables in scientific publications. How would

you design the problem-solving approach (no need to go into details of software

engineering: just describe how an automated approach should work that autonomously

extracts IC50 values from tables in biochemistry publications).

First we would define a MeSH term for IC50 values ( half maximal inhibitory concentration),

Then we would link-out to PubMed to find the IC50 values from searching the database

"Dhruva Deshpande Diaries :P"

Tuesday, March 6, 2012

1 comment: