I am currently using bioknack‘s named-entity recognition (NER) tools for recognising genes, species, diseases, cellular components, biological processes and molecular functions in the open-access subset of PubMed Central for generating the database that will serve opacmo. bioknack’s NER performance has been evaluated by speed and by accuracy in two earlier blog posts of mine. The script bk_ner_gn.sh is the NER wrapper which provides higher level functions, such as downloading and preprocessing dictionaries and corpora as well as postprocessing and scoring functions. The actual recognition of entities is carried out by the highly optimised Ruby script bk_ner.rb. In the following I give a short run through of bk_ner.rb‘s options that are relevant to text miners.
Since this is a blog post of my Mini Series, I keep the post short and provide only examples for the text-mining options of bk_ner.rb. The examples show the resulting output of bk_ner.rb for the following corpus database and entity dictionary files: corpus.txt and dictionary.txt. Both files are TSV files with two columns. The corpus database contains document identifiers in its first column and the documents’ text in the second column. The entity dictionary contains the entities in its first column and an optional entity identifier in its second column.
Parameters for Text Mining
Usage: bk_ner.rb [options] database dictionary
-b | --brief : do not repeat matched dictionary entries (per document)
-c | --concise : do not output character positions
-d CHAR | --delimiter CHAR : match only full-length words/compounds between the delimiter
(lines in the corpus need to end on the delimiter, or the
last entry will not be matched)
-l | --lines : write multiple dictionary matches on separate lines
-m MODE | --mode MODE : dictionary key/value mapping (default: functional)
values for MODE:
functional - 1:1 mapping between keys and values
relational - 1:n mapping between keys and values
-s CHAR | --separator CHAR : character to use to join multiple values in the output
when MODE is :relational and -l is not used
(default: \t (tabulator))
-x | --casesensitive : case sensitive dictionary matching
-y | --regexper CHAR : replaces the given character (or string) in the
dictionary entries with \W([^.!?]+\W)? and then
matches documents against the resulting regular expression
(has to be used with -c, because character positions are
no longer determined to achieve good performance)
Functional Mode
In functional mode the dictionary is treated as a hash table where each dictionary entity can only be mapped to one entity identifier. This means that the entity “the” is only mapped to the latter entity identifier “adjective”, because the former mapping to “article” will be overwritten when the dictionary is read into memory.
Classic text-mining output
bash$ bk_ner.rb corpus.txt dictionary.txt
weller the 7 9 adjective
weller the 44 46 adjective
weller the 55 57 adjective
etaoin_shrdlu the 55 57 adjective
etaoin_shrdlu new york 90 97 city
etaoin_shrdlu the 108 110 adjective
etaoin_shrdlu the best 108 115 excellence
etaoin_shrdlu the 184 186 adjective
Concise output
bash$ bk_ner.rb -c corpus.txt dictionary.txt
weller the adjective
weller the adjective
weller the adjective
etaoin_shrdlu the adjective
etaoin_shrdlu new york city
etaoin_shrdlu the adjective
etaoin_shrdlu the best excellence
etaoin_shrdlu the adjective
Concise and brief output
bash$ bk_ner.rb -c -b corpus.txt dictionary.txt
weller the adjective
etaoin_shrdlu the adjective
etaoin_shrdlu new york city
etaoin_shrdlu the best excellence
Concise, brief and matching over sentence span
bash$ bk_ner.rb -c -b -y '\ ' corpus.txt dictionary.txt
weller the adjective
weller good party fun
etaoin_shrdlu the adjective
etaoin_shrdlu new york city
etaoin_shrdlu the best excellence
Relational Mode
In relational mode the dictionary is treated as a hash table that maps dictionary entities to sets of entity identifiers. This means that the entity “the” is mapped to both “article” and “adjective” now.
Output matches on separate lines
bash$ bk_ner.rb -m relational -l corpus.txt dictionary.txt
weller the 7 9 adjective
weller the 7 9 article
weller the 44 46 adjective
weller the 44 46 article
weller the 55 57 adjective
weller the 55 57 article
etaoin_shrdlu the 55 57 adjective
etaoin_shrdlu the 55 57 article
etaoin_shrdlu new york 90 97 city
etaoin_shrdlu the 108 110 adjective
etaoin_shrdlu the 108 110 article
etaoin_shrdlu the best 108 115 excellence
etaoin_shrdlu the 184 186 adjective
etaoin_shrdlu the 184 186 article
Output matches on separate lines, concise and brief
bash$ bk_ner.rb -m relational -l -c -b corpus.txt dictionary.txt
weller the adjective
weller the article
etaoin_shrdlu the adjective
etaoin_shrdlu the article
etaoin_shrdlu new york city
etaoin_shrdlu the best excellence
Output matches on separate lines, concise, brief and match over sentence span
bash$ bk_ner.rb -m relational -l -c -b -y '\ ' corpus.txt dictionary.txt
weller the adjective
weller the article
weller good party fun
etaoin_shrdlu the adjective
etaoin_shrdlu the article
etaoin_shrdlu new york city
etaoin_shrdlu the best excellence




Posted on July 29, 2011
0