opacmo: Introducing the Open Access Mortar

Posted on July 25, 2011

4


Today a new web-resource — www.opacmo.org — has been born that will support biologists, physicists and life sciences researchers in general to find publications linked to genes, species, diseases, cellular components, biological processes and molecular functions. This blog post gives a brief introduction to opacmo’s features, swiftly explores the background behind its development, highlights opacmo’s provided data resources, and explains how the latter can be easily reproduced. The blog post closes with an outlook into opacmo’s future.

Introduction

opacmo's logo

The Open Access Mortar, a.k.a. opacmo, is a mash-up of several bioinformatical assets. opacmo links open access publications to biomedical resources and provides a search interface for easy information retrieval. opacmo lets you carry out searches for officially named terms and presents you the publications that are linked to these terms. opacmo also accounts for unofficial synonyms or common names that might be used instead of the officially accepted terms. For example, a search for BRCA2 will automatically include publications that refer to the gene’s synonyms FAD, BROVCA2, etc. For each provided link, opacmo also presents you a visually augmented confidence score, so that you can quickly judge whether the presented search result might be a false positive. For example, BRCA2‘s synonym FAD is also an abbreviation of “Failure Assessment Diagram” which can be mistaken for a gene due to its abbreviation.

opacmo’s gene links are supported by a species normalisation algorithm that helps to pick the best fitting gene depending on the species mentions in a document. For example, the gene BRCA2 can refer to 350 Entrez genes, but taking the species normalisation into account it is possible to subsequently pick the best fitting gene. This particular normalisation technique has been investigated in an earlier blog post, where I evaluated the gene normalisation quality of bioknack’s named entity recognition tool. bioknack’s ubiquitous role in opacmo is addressed in more detail in the Background section of this blog post.

opacmo also links publications to ontology terms of the Gene Ontology and the Disease Ontology — both of which are part of the Open Biological and Biomedical Ontologies (OBO Foundry).  bioknack’s named entity recognition tool was extended for this very purpose, but a detailed description of the term recognition technique would go beyond the scope of this blog post and has to follow later. In a nutshell, opacmo provides links to ontology terms that are typically not found by ordinary full-text search engines. For example, a document containing the phrase “heart and lung disease” would match a full-text search for “lung disease”, but a full-text search over that phrase would fail to find “heart disease” due to the absence of this specific wording. opacmo accounts for complex sentence structures and will link a document containing the phrase “heart and lung disease” to both Disease Ontology terms “heart disease” and “lung disease”.

opacmo also has the ability to combine searches as conjunctive queries, so that it is possible to find publications that are linked to multiple terms. For example, in opacmo it is straightforward to restrict a search to include only those publications that talk about the gene BRCA2 as well as the Human Disease Ontology term “leukemia” in a matter of a few mouse clicks. Further terms can be added to make the search more specific and previously added search terms can be removed again to retrieve a more general search result. opacmo’s minimalistic web-interface has been especially designed to make this otherwise complicated search query formulation as user friendly as possible.

Last, but not least, search results of opacmo can be downloaded in Excel file format as well as TSV files. This allows you to integrate opacmo’s resources into your work without requiring advanced computer skills. The screenshot below shows how opacmo’s web-interface looks like.

opacmo's user interface

opacmo's user interface

Background

opacmo is based on experiences that I have made during my early career as a bioinformatician. The following paragraphs capture very loosely the events that eventually led to the development and release of opacmo.

I have previously developed and set-up the BioMart 0.7 based web-site www.pubmed2ensembl.org that links publications to genes. pubmed2ensembl consists of several data sources, where the largest data source is provided by a run of the gene normalisation tool GNAT/LINNAEUS over the abstracts of the publication abstracts of MEDLINE’s baseline 2010. pubmed2ensembl focuses only on linking publications to genes, but I find that there is very little difference between recognising and normalising gene mentions and doing the same for other entities — as long as those entities are somewhat distinguishable from their surrounding textual context. To test this hypothesis, I developed tools for the generic recognition of arbitrary entities in another project of mine that have found their way into opacmo: the bioknack tool suite.

bioknack’s bk_ner.rb is a dictionary-based named-entity recognition tool that can efficiently locate word compounds (one or more words, not necessarily consecutive) in documents and provide identifier based links between them. bk_ner.rb supports synonyms by linking word compound entities to more than one identifier (parameter -m relational) and it allows for the recognition of multi-word compounds that are appearing non-consecutively in documents (parameter -y). Named entity recognition pipelines can be set up with bk_ner.rb‘s wrapper script bk_ner_gn.sh. I have evaluated a predecessor of bk_ner_gn.sh over the Gold standard of BioCreative 3 and presented the results in a previous blog post, where I showed that it outperforms GNAT/LINNAEUS as document retrieval system.

bk_ner_gn.sh and bk_ner.rb actually require memory and processing power orders of magnitudes lower [sic] than GNAT/LINNAEUS, so that it is actually possible for me to carry out text mining without the need for a computer cluster. The efficient execution of bioknack’s generic normalisation capabilities permit a short processing turnaround which means that there is a lower risk of a bk_ner_gn.sh based resource to become “stale” as it can be quite regularly be seen with computationally demanding bioinformatics resources. opacmo builds on top of bioknack’s bk_ner_gn.sh to circumvent a similar fate and to keep its resources up to date.

opacmo’s Resources

opacmo is currently in its startup phase and includes only linked information the open access BioMed Central journals. The table below shows the number of linked publications, genes, species and ontology terms (diseases, cellular components, biological processes and molecular functions) available through opacmo’s web site:

opacmo’s linked BioMed Central journals
30,770 Open access publications
107,969 Entrez genes
4,614 Species
5,594 Ontology terms
Ontology terms broken down by source:
3,913 Gene Ontology terms
1,681 Disease Ontology terms

 

The plan is to to include the complete open access subset of PubMed Central in opacmo, which would then provide links for more than 200,000 documents. It will only take about 2 months to process the remaining documents of PubMed Central — as background process on a plain vanilla Mac. Even though I cannot reveal more information right now, there is the possibility that I could collaborate with someone who has access to a high performance Linux cluster, which would speed up the release of the full dataset considerably.

Technical Background

The web-site is only the tip of the iceberg that is opacmo. opacmo’s repository includes the scripts that form the complete pipeline for building the database that serves the web site. The pipeline scripts make use of bioknack — the aforementioned bioinformatics tool suite — that in itself features a self-contained pipeline for carrying out the text mining tasks needed to generate opacmo’s links between publications and genes, species, etc.

opacmo’s pipeline centers around two scripts: make_opacmo.sh and load_opacmo.sh. The former script downloads the necessary resources for creating opacmo, carries out the text mining and performs some additional filtering over bk_ner_gn.sh‘s results so that they can be stored in denormalised database schema for quick information retrieval. The latter script load_opacmo.sh then populates the database with the denormalised data, where currently only the two databases PostgreSQL and MongoDB are supported.

If you are interested in recreating opacmo’s backend, then you can do so by executing the following commands on either Mac OS X (10.6 or newer) or Linux (Debian 6.0.2.1 or newer):

Linux: apt-get -y install git
Linux: apt-get -y install ruby
Linux: apt-get -y install gawk
Linux / OS X: mkdir opacmo_build ; cd opacmo_build
Linux / OS X: git clone git://github.com/joejimbo/bioknack
Linux / OS X: git clone git://github.com/joejimbo/opacmo
Linux / OS X: export PATH=$PATH:`pwd`/bioknack:`pwd`/opacmo
Linux / OS X: make_opacmo all

It is fastest to run bioknack’s bk_ner_gn.sh with JRuby (uncomment the adjacent variables ner_maxmem and ruby_interpreter in bk_ner_gn.sh), but a lot of memory is required due to JRuby’s implementation. The best compromise between computational speed and memory requirements is achieved using Ruby 1.9. bioknack needs at least Ruby 1.8.7 to function — Ruby 1.8.6 definitely does not work with the scripts.

Further Work

opacmo’s backend currently only links 30,077 publications to genes, species, diseases, cellular components, biological processes and molecular functions. Over the next months opacmo will be extended to provide links for documents of the complete open access subset of PubMed Central. I will also keep working on opacmo’s user interface to make it usable not only to bioinformaticians, but also to the biomedical researchers who can probably harness opacmo’s resources best.

Acknowledgements

Many thanks go to Miyuki Fukuma who provided web-design consultation on early versions of opacmo’s web interface. I also thank Maximilian Haeussler for pointing out several issues that plagued previous versions of the scripts under Linux.

About these ads
Posted in: Bioinformatics