Share this post on:

Ed, rule-based/pattern-based, machine-learning and hybrid systems (and combinations of these approaches). Most research in this area has concentrated on recognising gene and protein mentions; however, there has also been some work on identifying cell lines, chemicals and species. Competitions such as PubMed ID:https://www.ncbi.nlm.nih.gov/pubmed/26866270 NLPBA19 and BioCreativeare held in order to evaluate NER methods for gene mention recognition. Dictionary-based methods21 work by matching text against a fixed dictionary of entity names. The performance of these methods is highly dependent on both the coverage of the dictionary and the performance of matching techniques used. Use of a simple text-matching algorithm will lead to a large number of false positives being found because of the overlap between dictionary words and common English, as well as some false negatives due to misspellings not present in the dictionary. Gene names which lead to false positives are typically filtered out of dictionaries. Most systems that are based on this method either use an approximate method of string matching22 or expand the dictionary by generating spelling# HENRY STEWART PUBLICATIONS 1479 ?364. HUMAN GENOMICS. VOL 5. NO 1. 17 ?29 OCTOBERREVIEWHarmston, Filsell and Stumpfvariants.23,24 These methods tend to lead to an increase in recall accompanied by a decrease in precision. In some cases, dictionary-based NER methods can perform normalisation at the same time.25 Mangafodipir (trisodium)MedChemExpress Mangafodipir (trisodium) rule-based methods26 use orthographic and morpho-syntactic features of NEs (capital letters, numbers, symbols and affixes) and their surrounding words to generate patterns and rules. Biochemical suffixes such as -ase and -in are very useful in indicating possible protein names and so a simple rule would be to tag words with these features as proteins. These systems incorporate expert knowledge easily and the rules generated are human readable and easily extendable. Rule-based techniques are able to reach high levels of precision but at the expense of recall, as they are not robust against unseen names. This is mainly because there are so many potential surface grammatical variations (active, passive voice) and it is not feasible to develop robust patterns for all of these. Machine learning (ML) methods tend to achieve the highest performance for NER. All of the top ten performing methods in the BioCreative II gene mention task (BCII GM) used a machine-learning component. ML methods use training data in the form of a manually annotated gold standard corpus and learn features that are useful in identifying NEsin text. The performance of the methods used in NER can be very sensitive to feature selection, although this is not always the case.27 NER can be viewed as either a classification or a sequencelabelling problem. Classification approaches normally consider NER as assigning a class to a bag of features. These features include surface clues and morpho-syntactic features of NEs and their adjacent words. These methods do not tend to take the order of features into account and support only binary classifications. Sequence labelling approaches deduce the most probable sequence of tags for a given sequence of words. Each token is assigned a tag by calculating the most likely label for the current token, given both the features of that token and the previous history of tag assignments. The performance of any ML tagger will be biased by the size, inter-annotator agreement and topic structure of the corpus (see Table 3). Determining the correct cl.

Share this post on:

Author: email exporter