Evaluating bioknack’s Gene Normalisation on BioCreative 3

Posted on June 29, 2011

3


In a previous postdoc placement I developed and set-up the customised BioMart pubmed2ensembl, which links publications to genes. Keeping pubmed2ensembl’s data sources up-to-date proved to be complicated, especially as the gene name recognition runs of the tool GNAT/LINNAEUS that we used took weeks to complete on the local computing cluster. With the recent presentation of gene normalisation results of BioCreative 3, I reevaluated possible alternative solutions (i.e. tools) for creating a similar data set to pubmed2ensembl in a more straightforward and less time consuming manner. I also implemented a script for gene normalisation as part of my bioknack bioinformatics tool-suite in order to understand gene normalisation tools better. This blog post gives a tool-centric presentation of the BioCreative 3 results as presented in the BioCreative 3 Proceedings over TAP-k scores as well as precision, recall and F1-score for them — including the retrospectively calculated performance of my bioknack’s gene normalisation script. I (selfishly?) conclude that bioknack’s gene normalisation is a “good enough” solution to create a data source of linked publications and genes in terms of normalisation quality; it is certainly the leanest gene normalisation tool (606 lines of code, 434 lines w/o comments and blank lines).

Motivation

Gene normalisation addresses the problem of linking specific genes (via unique gene identifiers) to text documents (for example publications) in which they have been mentioned. Establishing links between publications and genes is of interest to the genetics community who wishes to integrate publication link-outs into their genome browsers and gene-/protein-databases. In a previous appointment I worked on a project — pubmed2ensembl — that provided such links. The creation of a data source such as pubmed2ensembl requires a seemingly simple workflow:

  1. retrieve latest gene records
  2. retrieve latest species taxonomy
  3. retrieve latest publications
  4. build gene/species dictionaries
  5. build corpus for text-mining
  6. run the gene recognition/normalisation
  7. make the output publicly available

In the real world, this workflow proves to be rather labour intensive. It requires detailed knowledge to generate the gene/species dictionaries for many normalisation tools that I looked at and the remainder of tools require too much computing time to work on large scale text-resources — such as MEDLINE. bioknack’s lean gene normalisation aims to overcome these problems — whilst still achieving qualitatively acceptable normalisation results.

BioCreative 3′s Gene Normalisation Task (GN Task) is an excellent resource for evaluating the quality of gene normalisation tools. Of course, since I carried out the evaluation of bioknack’s gene normalisation in retrospective, I had to ensure not to reverse engineer good evaluation scores. I have therefore kept my gene normalisation algorithm as lean and straightforward as possible, thus opening it for the scrutiny of critics. Unlike more sophisticated tools, bioknack’s gene normalisation does not make use of part-of-speech analysis, contextual mentioning information or an elaborate scoring algorithm.

bioknack’s Gene Normalisation

The core of bioknack’s gene normalisation is the named entity recognition script bk_ner.rb, which I presented in an earlier blog post when comparing its performance to the dictionary based keyword-finder LINNAEUS for MeSH-term recognition over MEDLINE. bk_ner.rb is an efficient dictionary based named entity recogniser that can run in various operational modes and produce multiple output formats. The true speed of bk_ner.rb is not revealed on the small BioCreative 3 corpus though, because the tool spends an unproportionally large amount of time loading the dictionaries (especially with the used “relational” dictionary mode, see “-m” parameter of bk_ner.rb) compared to the actual entity recognition. However, the script’s versatility is demonstrated by the fact that I use it for finding gene mentions in text as well as determining the discussed species in text.

The actual gene normalisation with bioknack is carried out using the bk_ner_gn.sh bash script in general and using run_bc3_ner.sh for use with BioCreative 3. The difference between the two scripts is simple: run_bc3_ner.sh is not developed further, whereas bk_ner_gn.sh is currently being tuned to simplify and speed-up the gene normalisation process even further. Both scripts cover the full workflow of a text-mining run as explained for pubmed2ensembl above: given file-databases of genes/species as well as text-/XML-documents — download URLs are provided for Entrez Gene, RefSeq, Taxonomy and BioCreative 3 GN Task Corpus with the tool — the script automatically creates dictionaries for use with bk_ner.rb itself and extracts a text-mining corpus too. It then proceeds to carry out the actual text-mining for genes and species separately, which are subsequently joined and scored before producing the final result set.

Running bioknack on BioCreative 3′s GN Task

The bash script run_bc3_ner.sh can be used to recreate the results that I achieved on the BioCreative 3 GN Task corpus. It was written on OS X 10.6 (Snow Leopard) and works with both awk (installed by default on Macs) and gawk (sudo port install gawk). The script should also work on Linux, because it uses — besides the bioknack tool suite — only the common Unix tools bash, cut, awk/gawk, grep, sed, sort, tr and uniq as well as Ruby.

Including the installation of bioknack, the download of the gene-/species-dictionaries and the BioCreative 3 GN Task corpus, the data for the evaluation below can be reproduced as follows:

mkdir biocreative
cd biocreative
mkdir dictionaries tmp
git clone git://github.com/joejimbo/bioknack.git
wget ftp://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz
wget ftp://ftp.ncbi.nih.gov/refseq/release/release-catalog/release*.accession2geneid.gz
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
wget http://www.biocreative.org/media/store/files/2010/BC3GNTest.zip
gunzip gene_info.gz
gunzip release47.accession2geneid.gz
mv gene_info release*.accession2geneid dictionaries
tar -xzf taxdump.tar.gz -C dictionaries names.dmp
unzip BC3GNTest.zip
rm -f BC3GNTest.zip taxdump.tar.gz
./bioknack/run_bc3_ner.sh
OSX: run ‘sudo port install gawk‘ or set ‘awk_interpreter=awk‘ in ‘run_bc3_ner.sh‘.
.

The output is written to the file ‘bc3gn_bioknack‘:

head -n 5 bc3gn_bioknack
2858555 41273 10
2858555 178845 6
2858555 44279 5
2858555 43838 4
2858555 39558 4
.

BioCreative 3 GN Task Evaluation

BioCreative 3′s GN Task was set to determine the quality of gene normalisation tools over a chosen set of open access PubMed Central documents. The documents were provided as fully annotated XML files — one XML file per document — that included structural information about the documents and some semantic annotations. The GN Task challenge was to link documents and Entrez Gene identifiers according to their mention in the documents and to assign a confidence score for each established link. This task can be imagined as a Google-like search, where the results are sorted by score (best to worst) whilst showing which documents contain which genes.

The results of the gene normalisation results were officially evaluated by the BioCreative organisers using the Threshold Average Precision metric, which I will address next. Some BioCreative participants included also measurements for precision, recall and F1-score (F-score) in the BioCreative proceedings. For a more complete picture of these measures over all tools, I have asked all BioCreative GN Task participants to send me their result sets that they handed in at BioCreative’s GN Task so that I can calculate and present precision, recall and F-score myself and visualise them in an overview diagram in this blog post. I address this below for the tools of those researchers who got back to me with either their result sets or their calculated precision, recall and F-score numbers.

Note: I only carried out the evaluation over the Gold Standard provided by the BioCreative organisers. The Gold Standard consists of 50 documents that have been manually curated for Entrez gene identifiers. There are also two Silver Standards provided by the organisers, which I will not use for the presented evaluation here. Roughly speaking, the Silver Standards are produced by taking the common denominator among all result set submissions as a source for automatically annotating the 50 documents of the Gold Standard and a second set of 507 documents (including the documents from the Gold Standard). I have the impression that the Silver Standards measure no longer the quality of the returned Entrez gene identifiers, but that they are a metric for determining a tool’s degree of conformity instead.

Threshold Average Precision

Threshold Average Precision (TAP) is a fairly new measure that aims to replace the Receiver Operating Characteristic. TAP is a metric for use with Google-like search results, i.e. it is a metric to evaluate the performance of search results that are ordered by a score. For BioCreative 3, the more specific TAP-k score was used, which calculates a score by traversing the search results in order of their scoring (best to worst) whilst returning a score after k false positives were found during the traversal.

The diagrams below show the tools and their achieved TAP-5, TAP-10 and TAP-20 scores as they were presented in the BioCreative 3 proceedings plus the results for my retrospective bioknack evaluation. Participants in the challenge were allowed to submit multiple result sets (up to 3), but for clarity I am only showing the results for the respective latest submissions. This decision is in my eyes justified since the score goes up for most tools (10 out of 15) with later submissions.

TAP-5

Depicted: Either gene normalisation tool name, if a tool name was explicitly stated by a team; or the name(s) of the tool(s) used for the gene normalisation. For example, KNoGM is a gene normalisation tool name whereas BANNER denotes the named entity recognition tool used by team 89.
* The name UZH was suggested to me by private communication. It does not appear in the proceedings.

TAP-10

TAP-20

.

bioknack’s gene normalisation (pale blue bar) is placed in the centre field of TAP-5 scores between NERsuite and BANNER. The stricter TAP-5 metric denotes higher quality results. For the less stricter TAP-10 and TAP-20 metrics, bioknack drops behind BANNER as more false positives are permitted in the result set. This means that BANNER returns more false positives among its best scored genes, but good genes are still trickling in, whereas bioknack’s results drop more sharply in quality once the first false positives are returned.

The light grey bar denotes the TAP-k scores for bioknack when NCBI’s RefSeq database is excluded from bioknack’s gene dictionaries. I am adressing this point separately, because I think that many tools did not include this database and might subsequently have underperformed in the result evaluation. Without the RefSeq database, bioknack’s gene normalisation performs less well, but appears still in the same score range as before.

Precision, Recall and F-Score

Precision, recall and F1-score (F-score) are established metrics to evaluate the performance of text mining algorithms. In the here discussed context of gene normalisation these metrics evaluate the quality of an algorithm’s returned gene identifiers without taking a score into account. These metrics are important whenever the results are not presented in an ordered fashion. For example, pubmed2ensembl links publications to genes and displays the results as a set of PubMed-/PubMed Central-identifiers, which removes any of the GNAT/LINNAEUS’ confidence scoring and makes the linked identifiers to appear of equal quality. Precision tells us whether the results of a tool are generally so good that such a presentation is justified, recall expresses how many true positive results (results that should have been found, but were not) we are overall missing out and the F-score is the harmonic mean of precision and recall that summarises the results’ quality as a single number.

The diagram below shows precision, recall and F-score for those tools where I either received the handed in result sets or where the researchers sent me their calculated results for said metrics. The axes denote precision (y-axis) and recall (x-axis), whereas the numbers adjoining the dots denote the F-score.

Precision, Recall and F-Score

bioknack’s gene normalisation (light blue dot) performs “good enough” when considering the overall F-score. With an F-score of 0.19 (0.1857 not rounded) it slightly outperforms BANNER and is placed fourth among the F-scores of all tools. Its precision (0.1553) places bioknack fifth, where it performs better than KNoGM but worse than BANNER. bioknack’s recall (0.2306) almost coincides with BIOADI/LINNAEUS’s recall (0.2364). When removing RefSeq from bioknack’s gene dictionaries (violet dot), the tool’s F-score approaches that of BANNER wheras precision and recall only drop slightly.

Note: I was informed by private communication that the poor precision of GNAT/LINNAEUS in BioCreative 3 is due to modifications to the tool in order to achieve higher TAP-k scores.
The low precision of GNAT/LINNAEUS in BioCreative 3 does not imply that pubmed2ensembl’s established links between publications and genes are of bad quality. For pubmed2ensembl, we used an older version of GNAT/LINNAEUS that achieves high precision and recall as it has been published beforehand.

Recapture and Outlook

This blog post addressed gene normalisation and hinted at the difficulties when setting up a data source of linked publications and genes such as pubmed2ensembl. I presented a lightweight gene normalisation tool as part of the bioknack tool suite that aims to reduce the labour and computationally intensive workflow of creating pubmed2ensembl-esque data sources. I then showed that the performance of bioknack’s gene normalisation solution is “good enough” by evaluating its TAP-k score, precision, recall and F-score for the BioCreative 3 GN Task.

The performance evaluation of bioknack’s gene normalisation has to wait until another blog post. In the long term I hope that the toolkit progresses in such a way that it will be possible to carry out gene normalisation on all MEDLINE abstracts on a home computer eventually. It will then be an interesting question whether it is better to set up a similar service to pubmed2ensembl or if it might not be easier to provide the gene normalisation data set to established genomic database providers such as Ensembl or UCSC for data integration.

Acknowledgements

My greatest thanks go to the researchers who provided their submitted BioCreative 3 GN Task data: Shashank Agarwal, Joerg Hakenberg, Minlie Huang and his student Jingchen Liu, Naoaki Okazaki, Fabio Rinaldi and his colleague Simon Clematide, and Karin Verspoor. I am also grateful that Chun-Nan Hsu and Hung-Yu Kao sent me references where I could look up their calculated metrics (precision, recall and F-score).

I also like to thank Martin Gerner and Cheng-Ju Kuo for addressing some questions of mine and John Spouge for fixing a bug in TAP1.7 (number of 0-relevant records were not permitted) and releasing TAP1.8 the next day.

Addendum

Repo and Versioning

bioknack is publicly available under https://github.com/joejimbo/bioknack.

This blog post refers to bioknack’s commit ee18d1fd926010688f2f24831b7a246b2baedfc9.

TSV-Files and Statistics Scripts

wget http://www.biocreative.org/media/store/files/2010/GNTestEval.zip
unzip GNTestEval.zip
wget ftp://ftp.ncbi.nih.gov/pub/spouge/web/software/TAP_1.8/TAP_1.8.zip
unzip TAP_1.8.zip
rm *.url readme.txt *.zip
cd TAP_1.8
../bioknack/bk_fmt_ner_to_tap.rb ../bc3gn_bioknack ../GNTestEval/test50.gold.txt
perl tap.pl -f summary.lst -k 5
perl tap.pl -f summary.lst -k 10
perl tap.pl -f summary.lst -k 20
../bioknack/bk_stats_biocreative_3.rb ../GNTestEval/test50.gold.txt ../bc3gn_bioknack

The scores are transcribed from the tool’s output to the TSV files without rounding.

TAP-k scores: gold_latest_submission.tsv
Precision, recall, F-scores: gold_precision_recall_fscore.tsv

Rename -tsv.doc files to .tsv.

Line Counts

cat bioknack/run_bc3_ner.sh bioknack/bk_ner_accumulate_score.rb \
    bioknack/bk_ner.rb bioknack/bk_ner_fmt_entrezgene.rb | wc -l
 606
cat bioknack/run_bc3_ner.sh bioknack/bk_ner_accumulate_score.rb \
    bioknack/bk_ner.rb bioknack/bk_ner_fmt_entrezgene.rb | grep -v -E '^$|\s*#' | wc -l
 434

R-Script for Generating Figures

# Working with the gold standard here.
set <- "gold"

# Generate TAP-k diagrams:
set_values <- read.delim(
        paste(set, "_latest_submission.tsv", sep=""), header=F
    )
for (column in 3:length(set_values)) {
    name <- paste(set, c(5, 10, 20)[column-2], ".png", sep="")
    png(filename = name,
        width = 650, height = 400, units = "px",
        pointsize = 12, bg = "white")
    par(mfrow=c(1,1), mar=c(2.5,18,1,1))

    view <- set_values[t(order(set_values[,c(column)])), c(1,column)]

    colors <- rep("#aaaaaa", length(view[,2]))
    for (index in 1:length(view[,1])) {
        if (view[,1][index] == "bioknack") {
            colors[index] = "#666699"
        }
        if (view[,1][index] == "bioknack w/o RefSeq") {
            colors[index] = "#eeeeee"
        }
    }

    par(las=1)
    barplot(view[,2], names.arg=view[,1], col=colors, horiz=TRUE,
        xlim=c(0,0.35), space=rep(2, length(view[,1])))
    dev.off()
}

# Generate precision/recall/F-score diagram:
set_values <- read.delim(
        paste(set, "_precision_recall_fscore.tsv", sep=""), header=F
    )
name <- paste(set, "_precision_recall_fscore.png", sep="")
png(filename = name,
       width = 650, height = 400, units = "px",
    pointsize = 14, bg = "white")
par(mfrow=c(1,1), mar=c(4.5,4.5,1,1))

tool <- set_values[,1]
precision <- set_values[,3]
recall <- set_values[,4]
fscore <- set_values[,5]

colors <- c("#bb0000", "#00bb00", "#0000bb",
            "#bb00bb", "#bbbb00", "#00bbbb",
            "#aaaaaa", "#ee7733", "#3377ee",
            "#7733ee"
        )

plot(recall, precision, cex=.8, pch=19,
        xlab="Recall", ylab="Precision", xlim=c(0,1), ylim=c(0,1),
        col=colors, axes=FALSE
    )
axis(1, seq(0, 1, 0.1))
axis(2, seq(0, 1, 0.1))
# Fix offset to prevent overwriting of labels:
precision[10] <- precision[10]-0.04
text(recall+0.03, precision+0.02,
        labels=gettextf("%0.2f", fscore), cex=0.8, col="#333333"
    )
legend(0.65, 1, tool, cex=0.8, fill=colors)
dev.off()
About these ads
Posted in: Bioinformatics