Examples of bioknack’s Named-Entity Recognition Core

Posted on July 29, 2011

0


I am currently using bioknack‘s named-entity recognition (NER) tools for recognising genes, species, diseases, cellular components, biological processes and molecular functions in the open-access subset of PubMed Central for generating the database that will serve opacmo. bioknack’s NER performance has been evaluated by speed and by accuracy in two earlier blog posts of mine. The script bk_ner_gn.sh is the NER wrapper which provides higher level functions, such as downloading and preprocessing dictionaries and corpora as well as postprocessing and scoring functions. The actual recognition of entities is carried out by the highly optimised Ruby script bk_ner.rb. In the following I give a short run through of bk_ner.rb‘s options that are relevant to text miners.

Since this is a blog post of my Mini Series, I keep the post short and provide only examples for the text-mining options of bk_ner.rb. The examples show the resulting output of bk_ner.rb for the following corpus database and entity dictionary files: corpus.txt and dictionary.txt. Both files are TSV files with two columns. The corpus database contains document identifiers in its first column and the documents’ text in the second column. The entity dictionary contains the entities in its first column and an optional entity identifier in its second column.

Parameters for Text Mining

Usage: bk_ner.rb [options] database dictionary

-b | --brief               : do not repeat matched dictionary entries (per document)
-c | --concise             : do not output character positions
-d CHAR | --delimiter CHAR : match only full-length words/compounds between the delimiter
                             (lines in the corpus need to end on the delimiter, or the
                             last entry will not be matched)
-l | --lines               : write multiple dictionary matches on separate lines
-m MODE | --mode MODE      : dictionary key/value mapping (default: functional)
                             values for MODE:
                               functional - 1:1 mapping between keys and values
                               relational - 1:n mapping between keys and values
-s CHAR | --separator CHAR : character to use to join multiple values in the output
                             when MODE is :relational and -l is not used
                             (default: \t (tabulator))
-x | --casesensitive       : case sensitive dictionary matching
-y | --regexper CHAR       : replaces the given character (or string) in the
                             dictionary entries with \W([^.!?]+\W)? and then
                             matches documents against the resulting regular expression
                             (has to be used with -c, because character positions are
                             no longer determined to achieve good performance)

Functional Mode

In functional mode the dictionary is treated as a hash table where each dictionary entity can only be mapped to one entity identifier. This means that the entity “the” is only mapped to the latter entity identifier “adjective”, because the former mapping to “article” will be overwritten when the dictionary is read into memory.

Classic text-mining output

bash$ bk_ner.rb corpus.txt dictionary.txt
weller	the	7	9	adjective
weller	the	44	46	adjective
weller	the	55	57	adjective
etaoin_shrdlu	the	55	57	adjective
etaoin_shrdlu	new york	90	97	city
etaoin_shrdlu	the	108	110	adjective
etaoin_shrdlu	the best	108	115	excellence
etaoin_shrdlu	the	184	186	adjective

Concise output

bash$ bk_ner.rb -c corpus.txt dictionary.txt
weller	the	adjective
weller	the	adjective
weller	the	adjective
etaoin_shrdlu	the	adjective
etaoin_shrdlu	new york	city
etaoin_shrdlu	the	adjective
etaoin_shrdlu	the best	excellence
etaoin_shrdlu	the	adjective

Concise and brief output

bash$ bk_ner.rb -c -b corpus.txt dictionary.txt
weller	the	adjective
etaoin_shrdlu	the	adjective
etaoin_shrdlu	new york	city
etaoin_shrdlu	the best	excellence

Concise, brief and matching over sentence span

bash$ bk_ner.rb -c -b -y '\ ' corpus.txt dictionary.txt
weller	the	adjective
weller	good party	fun
etaoin_shrdlu	the	adjective
etaoin_shrdlu	new york	city
etaoin_shrdlu	the best	excellence

Relational Mode

In relational mode the dictionary is treated as a hash table that maps dictionary entities to sets of entity identifiers. This means that the entity “the” is mapped to both “article” and “adjective” now.

Output matches on separate lines

bash$ bk_ner.rb -m relational -l corpus.txt dictionary.txt
weller	the	7	9	adjective
weller	the	7	9	article
weller	the	44	46	adjective
weller	the	44	46	article
weller	the	55	57	adjective
weller	the	55	57	article
etaoin_shrdlu	the	55	57	adjective
etaoin_shrdlu	the	55	57	article
etaoin_shrdlu	new york	90	97	city
etaoin_shrdlu	the	108	110	adjective
etaoin_shrdlu	the	108	110	article
etaoin_shrdlu	the best	108	115	excellence
etaoin_shrdlu	the	184	186	adjective
etaoin_shrdlu	the	184	186	article

Output matches on separate lines, concise and brief

bash$ bk_ner.rb -m relational -l -c -b corpus.txt dictionary.txt
weller	the	adjective
weller	the	article
etaoin_shrdlu	the	adjective
etaoin_shrdlu	the	article
etaoin_shrdlu	new york	city
etaoin_shrdlu	the best	excellence

Output matches on separate lines, concise, brief and match over sentence span

bash$ bk_ner.rb -m relational -l -c -b -y '\ ' corpus.txt dictionary.txt
weller	the	adjective
weller	the	article
weller	good party	fun
etaoin_shrdlu	the	adjective
etaoin_shrdlu	the	article
etaoin_shrdlu	new york	city
etaoin_shrdlu	the best	excellence
About these ads