Wednesday, February 22, 2012

  1. Name at least three searchable types (categories) of information that are contained in
MEDLINE abstracts

Title, author, journal issue, text words.

  1. What are MeSH terms and what is their purpose?

MeSH (Medical Subject Headings) is controlled vocabulary thesaurus used for
indexing articles in MEDLINE/PubMed. MeSH terms are organized in hierarchical
structure that allows searching at various levels of specificity. They follow hierarchy format and NOT DAG so NOT ontologies.

Q34: How can a search result be “expanded” (I refer to the PubMed help, where “expansion of search results” is a separate point)
Answer: If this question means to expand the search result if I have retrieved too few citations, then here are the following steps you need to do.
·       Click the Related citations See all link for a relevant citation to display a pre-calculated set of PubMed citations closely related to the article.
·     Remove extraneous or specific terms from the search box.
·     Try using alternative terms to describe the concepts you are searching.

  1. Explain “information retrieval” with an example involving Medline and MeSH terms
MEDLINE uses Medical Subject Headings (MeSH) for information retrieval. Engines designed to search MEDLINE (such as Entrez and PubMed) generally use a Boolean expression combining MeSH terms, words in abstract and title of the article, author names, date of publication, etc. Entrez and PubMed can also find articles similar to a given one based on a mathematical scoring system that takes into account the similarity of word content of the abstracts and titles of two articles.

  1. When do we speak of synonyms and when do we speak of homonyms?

Synonyms are different words with identical meaning.
Homonyms are identical words with different meaning.

  1. Sketch the major concepts and the conceptual schema of MedLine
Journal, Author Name, Article,Publication date, References

  1. Explain the differences between OMIM and MedLine
Both OMIM and Medline are literature-based databases. However the OMIM database
is a catalog of human genes and genetic disorders. An entry in OMIM is a review
focusing on a disease, its phenotypic appearance and the genes involved in the
molecular etiology of the disease. Whereas Medline is a bibliographic database
covering a broad scope of biosciences.

OMIM is a bibliographic database which contains information on all known mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype.
Bibliographic database that cites abstracts from Biomedical journals
A database which provides access to curated data gathered from public scientific literature as well as other sources
Medline, a database of indexed abstracts from scientific biomedical literature
Each entry is obtained and compiled from several reference sources.
Each entry corresponds to a single journal article.
OMIM does not employ MeSH terms
Uses controlled vocabulary called as MeSH (Medical subject headings)
OMIM is a heavily curated database
MEDLINE is  also a curated database.
OMIM is focused on human disease and gathers any kind of information which helps to understand the cause of disease.
Medline contains details of mutagenesis experiments whose relevance might be yet to be established.

  1. What are the three root concepts of GO?

Molecular function : the elemental activities of a gene product at the molecular level, such as  binding or catalysis
Biological processes: operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissuesorgans, and organisms.
Cellular Component: the parts of a cell or its extracellular environment.
  1. What means “annotation”?

A combination of comments, notations, references, and citations, either in free format
or utilising a controlled vocabulary, that together describe all the experimental and
inferred information about a gene or protein. Annotations can also be applied to the
description of other biological systems. Batch, automated annotation of bulk biological
sequence is one of the key uses of Bioinformatics tools.

  1. Which controlled vocabularies do you know besides GO?

MeSH, HGNC, sequence Ontology, Brenda enzyme source ontology, EMAP,
SwissProt keywords, MAGE-OM
·         GO
·         MeSH (Medical subject Headings)
·         IUPAC
·         EC (Enzyme nomenclature)

  1. Name three of the most important / most informative entity-types that can be found in
EMBL or EntrezGene

Entity-types: organism, molecule, sequence.

  1. Which objects in biology correspond to these entity-types?
Organism: animals, plants, fungi, bacteria, protozoa.
Molecule: DNA, RNA.
Sequence: nucleic acid sequence, i.e. adenine, thymine, guanine, cytosine in case of
DNA; adenine, uracil, guanine, cytosine in case of RNA.

  1. Define a gene and name three attributes
A locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and or other functional sequence region.
Gene is a unit of DNA which performs one function. Usually, this is equated with the production of one RNA or one protein. A gene contains coding regions, introns, untranslated regions and control regions.
Intron & exon positions
Protein binding sites

  1. Who is the owner of an entry in EMBL?
The person who submits (submitter) the gene sequence (or any entry) to the EMBL database is the owner. The authority to change that entry is also with the submitter.

Tuesday, February 21, 2012

Bio-databases...Part 2

15.The ENTREZ documentation mentions “E-utilities”. A link on the ENTREZ side
leads to the documentation of E-utilities …. Please explain, what E-utilities are and
what they can be used for.

 The E-utilities translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data. The E-utilities are therefore the structured interface to the Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature.

16.What categories of biodatabases are integrated under SRS and which ones are not?

Plant databases, organelle databases, immunological databases, microarray data and other gene expression databases are NOT included.

17.How do you link query results in SRS and how do you perform
“facetted searches “ (Multiple searches) using SRS? What is the usage of search results from one query as the starting group for the next query?

You click the ‘Link’ option in the left side of your screen after you have selected the search result you are interested in. Then when you are directed back to the database page then select multiple databases.
MIAME: Minimal Annotation about a Micoarray Experiment
Two-color microarrays or two-channel microarrays are typically hybridized with cDNA prepared from two samples to be compared (e.g. diseased tissue versus healthy tissue) and that are labeled with two different fluorophores.
In single-channel microarrays or one-color microarrays, the arrays provide intensity data for each probe or probe set indicating a relative level of hybridization with the labeled target.
 In standard microarrays, the probes are synthesized and then attached via surface engineering to a solid surface by a covalent bond to a chemical matrix. OR  Other microarray platforms, such as Illumina, use microscopic beads, instead of the large solid support.

  1. In the description file for the TAXONOMY database, the usage of taxonomy entries in
other databases is mentioned. Which types of other databases refer to TAXONOMY

The taxonomy database of the International Sequence Database Collaboration contains the names of all organisms that are represented in the sequence databases with at least one nucleotide or protein sequence. (like EMBL, ENA Project, RafSeq Genome, etc.)

  1. What is a catalogue?

The database catalog of a database instance consists of metadata in which definitions of database objects such as base tables, views (virtual tables), synonyms, value ranges, indexes, users, and user groups are stored (Wikipedia)
In computing, a catalog is a directory of information about data sets, files, or a database. A catalog usually describes where a data set, file or database entity is located and may also include other information, such as the type of device on which each data set or file is stored.

  1. What is an index? How does SRS “index” over several databases?

A) An index is a feature of an entity that allows identifying and searching for elements of
the entity. Database indexes are auxiliary data structures that allow for quicker retrieval of data at the cost of slower writes and increased storage space. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

           B)SRS indexing process
 SRS is updated daily, it uses an update mechanism whereby external and local ftp sites are checked for new data files on a daily basis. In this way the system always provides the most up to date data that is available. The system can index plain text, html and xml formatted data files. These data files are broken down by a parser into entries and subsequently into fields. These field indices can then be used for data retrieval or for generating searchable links between different database entries. SRS indexes database records using a word by word approach. Queries can be broadened or refined by using any of the logical operators – and, or and but not.

  1. What is a hierarchy? Which relationship-type is used in hierarchies?
A hierarchy is an organization of entities, where each element (except the top one) has
one parent. Every child element has the features of the parent element.

  1. What is a taxonomy? Give a brief definition of a taxonomy!

A taxonomy is a collection of controlled vocabulary terms organized into a
hierarchical structure. Each term in a taxonomy is in one or more parent-child
relationships to other terms in the taxonomy. Taxonomy only has relations of type “is_a”.

  1. What is an ontology? What are the essential features of an ontology that distinguishes
it from a taxonomy?

Ontology is a controlled vocabulary expressed in an ontology representation language,
which has grammar for using vocabulary terms to express something meaningful
within a specified domain of interest. Ontology is organized as a DAG.

  1. Which of the above mentioned controlled vocabularies has a tree structure?


  1. What is a directed acyclic graph (DAG) and which type of knowledge representation
is based on such a DAG?

DAG is a type of graph that has no cycles and all its edges are oriented in one
Ontology is based on DAG.

·         mitochondrion has two parents: it is an organelle and it is part of the cytoplasm;
·         organelle has two children: mitochondrion is an organelle, and organelle membrane is part of organelle

  1. Please explain / characterize the content of PubMed: how does a typical minimum
data set look like in PubMed? I refer to the „anatomy of search results page“
mentioned in the PubMed documentation.

Pubmed search results are displayed in a summary format, with the following anatomy of summary results.
1.      Title
2.      Abbreviated names of authors
3.      Abbreviated Journal title
4.      Publication Date
5.      Followed by Volume, Issue and Page numbers of the article.

  1. What is the difference between PubMed and MEDLINE? Explain in brief!

MEDLINE is a bibliographic database containing citations and abstracts of bioscience
articles. PubMed is a service under NCBI Entrez search and retrieval system. PubMed
provides access to bibliographic information that includes MEDLINE and some other
resources (PubMedCentral and articles from journals before MEDLINE-inclusion and
out-of-scope articles). It also provides links to free full-text articiles (if available).

  1. What is PubMedCentral? What does it contain and how does it differ from PubMed?

PubMed Central is a free digital database of full-text scientific literature in biomedical and life sciences. It is a free digital archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health (NIH), developed and managed by NIH's National Center for Biotechnology Information (NCBI) in the National Library of Medicine (NLM).  

Tuesday, February 14, 2012

Bio-databases...being meaning for long..

1.     Object orientation: What is an object in Biology and what “methods” can biological objects in execute? Provide at least three examples for “methods” executed by bio-objects.
An object in biology is a class of biomolecules that has specific attributes and the ability to execute certain methods.
DNA, RNA, Proteins, Lipids and Sugars (Carbohydrates) have different methods as “Bio-Object”s.
-DNA has methods such as
1)      strand separation
2)      replication
3)      packaging in chromatin
4)     Re-association / hybridization
-RNA has methods such as
1)      interaction with ribosome
2)      recognition of other RNAs
3)      Re-association / hybridization
-Protein has methods such as
1)      interaction with other biomolecules
2)      enzymatic functions
3)      structure functions
4)      transport functions
-Lipids has methods such as
1)      interaction with other biomolecules
·         e.g. association to form biomembranes
·         e.g. binding (covalent and non-covalent) to proteins
-Carbohydrates has methods such as
1)      interaction with other biomolecules
·         polymerization
·         conjugation (chemical reaction)
2.     What are typical attributes of Nucleic Acids? Name at least three of them!
·         Sequence
·         Secondary structure
·         Patterns and motifs, binding sites for proteins
·         Chemical stability(RNA)
·         Packing in chromatin(DNA)
       3. Which categories of biomolecules do you know?
Categories of biomolecules:
·         Proteins
·         Nucleic acids
·         Lipids
·         Carbohydrates
·         Small molecules (Metabolites, …)

4. Which categories of bioDATABASES correspond to these categories of
Biodatabases are libraries of lifescience collected from experiments and publications in various biological areas.

5. Please give at least one example for each category of biodatabase as we see them categorized at the EBI SRS interface
1)    Literature, Bibliography and Reference Databases – MEDLINE, OMIM, Karyn’s Genomes etc.
2)    Gene Dictionaries and Ontologies – UNILIB, GO, HGNC, UniGene, ENTREZGENE, SO(Sequence ontology) etc.
3)    Nucleotide Sequence Databases – EMBL, RefSeq Genome, Patent DNA, Genome Reviews etc.
4)    Nucleotide Related Databases – TRANSFAC, TRANSCELL TRANSSITE, TRANSGENE.
5)    UniProt Universal Protein Resource – UniProtKB, UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, UniRef100, UniRef90, UniRef50, UniParc
6)    Other Protein Sequence Databases – Patent Proteins, EPO Proteins, JPO Proteins, RefSeq Proteome etc.
7)     Protein Function, Structure and Interaction Databases
·         Function – PEP (ORFs), InterPro, PROSITE etc.
·         Structure – PDB etc.
·         Interaction – Experiment, Interactor, Interaction                                                        
8)    Enzymes, Reactions and Metabolic Pathway Databases – LENZYME, ENZYME, UPATHWAY, UREACTION etc.
9)    Mutation and SNP Databases – HGVBASE

       6. Which major portals to biodata do you know?
·         SRS (Sequence retrieval system)EBI (European Bioinformatics Institute)
·         ENTREZ for NCBI (National Center for Biotechnology Information)
·         ExPASy (Expert Protein Analysis System) for SIB

7. What sort of discrepancy exists between biodatabases that represent information on genes & genomes as opposed to biodatabases that store information on gene expression?
Genes and genome databases cover sequence information that should be identical in every cell and under every condition of the analysed organism. Opposed to that, information stored in databases about gene expression may vary due to the analysed tissue or the experimental conditions and thus also covers information about the actual experiment the data came from. They can also give insight to the function and regulation of genes through information on gene expression levels under certain conditions.
8.     What is a “flat file database”?
A “flat file database” describes any various means to encode a database model (most commonly a table) as a single file. It is a relatively simple database system in which each database is contained in a single table, which usually contains one record per line and where the single fields can be separated by delimiters.
9.      What features would you assign to “Bioinformatics” and how does it differ from “Systems Biology”?
Systems biology is a biology-based inter-disciplinary study field that focuses on complex interactions in biological systems and how these interactions give rise to the function and behavior of that system.While Bioinformatics is the application of computer technology to the management of biological information.
Hence, while bioinformatics takes an exhaustive approach wherin where all information obtained biological sample (such as biomolecule sequence, their structure and interaction) is stored electronically and most of it is analyzed further. System biology however deals only with studying how different biological infrormation merge to perform as a system.
10. Annotate a sketch of biomolecules with entity-types from biodatabases you know.
The biomolecules are given with the following entity types is Bio-databases:- 
DNA and RNA and proteins: Sequence, sequence length, organism name, taxon etc.
Carbohydrates and lipids: Molecular structure, Mass, Nomenclature etc.
11.How are BioDatabases integrated in SRS? What does the SRS documentation say about the mechanisms used for linking between biodatabases? (
A databank entry may contain references to other databanks, and vice versa. In SRS these relationships are known as links and can be used to extend a query across multiple databanks. Thus you can obtain all the entries in one databank that are linked to an entry (or entries) in another databank.
From a user perspective there are two types of link: hypertext links and index links (query links).
Hypertext links are links between entries which are displayed as hypertext. These are hardcoded into SRS and you can use them whenever you wish. They are useful for examining entries that are referenced directly from entries.
Index links are built into the SRS indices at the same time as databanks are added. They allow you to construct queries using relationships between databanks. They require SRS to search through entries or indices in other databanks, looking for matches.
12.What does “computer readable knowledge” mean?
The knowledge that is characterized by computer, which contain a set of data, often in the form of rules that describe the knowledge in a logically consistent manner and so that can be understood both by human and computer.
For example if we are dealing with Biodatabses then in this context the computer readable knowledge will mean that it should have proper set of Ontologies and Controlled Vocabularies which can be understood by the databases and reffered to in various others too.
13.The EBI call itself “the portal to knowledge”. How is biomedical knowledge
represented in biodatabases?

Biomedical knowledge is represented in different biodatabases: Such as OMIM, MEDLINE, MeSH, and PubMed. OMIM is a review focusing on diseases, their phenotypic appearance, and genes related to their etiology. MEDLINE is the largest part of the PubMed and contains freely accessible online biodatabases of biomedical journals, citations, and abstracts created by NLM. PubMed contains Medline and other citations from other sources of biomedical journals. And finally, the MeSH thesaurus is used by NLM for indexing articles from biomedical journals for the MEDLINE®/PubMED® database.

14.In one of the links to “relevant background information”, a primer on molecular
biology mentions “linkage disequilibrium”. What does this term mean?

 Linkage disequilibrium describes a situation in which some combinations of genes or genetic markers occur more or less frequently in a population than would be expected from their distances apart.