Named Entity Recognition (NER) is the process of identifying specific groups of words which share common semantic characteristics. Consider organization names for instance. They may show superficial differences in the way they look but all convey the same type of information. They also usually appear in comparable contexts. Same with person names, locations or expressions of time.
Finding this type of information helps improving search results and document classification. It is also the corner stone for other Natural Language Processing (NLP) techniques such as Named Entity Disambiguation, which is the task of linking different occurrences to the same entity. For instance, “Mr Page”, “L. Page” or “Page, Larry” would all point to the same canonical entity, “Larry Page”.
NER is sometimes considered a solved problem. At the very least, we can say that indeed, well-trained systems reach extremely high scores, almost comparable to human performance. Techniques evolved from rule-based systems, to statistical models such as CRFs or Maximum Entropy to Neural Nets, consistently bridging the gap with human efficiency. Why bother writing a blog entry about it then?
NER has become more of a data problem than an algorithm problem
Training an NER engine requires appropriate and annotated data.
Appropriate refers to the domain of the documents. Law firms produce different documents than medical firms and these are also different from newspaper articles. The training corpus needs to match the type of document you want to analyse.
Annotated refers to the fact that named entities must be identified in the training set, and these annotations must be reliable to produce a well-performing engine.
Producing training sets is the costly part of creating an NER engine. It involves a considerable amount of human labour. We should always keep in mind that many machine learning techniques ultimately rely on data composed by humans, be it image recognition, machine translation or of course, NER.
The academic world produces such training sets for evaluation campaigns, but these are not always relevant to your domain or language. These datasets are often reused for several years while the algorithms get more efficient.
Therefore, we are facing a data problem rather than an algorithm problem. Algorithms will evolve and improve overtime, but they will always rely on high-quality training material that is difficult and costly to produce.
Finding the needles in the haystack
This is where the haystack and the unknown amount of needles metaphor kicks in and we should say a word about how you evaluate an NER engine (or any other classification engine).
You need to consider two factors, precision and recall. Precision indicates whether an engine is efficient at recognizing needles. Recall indicates whether the engine found all the needles in the haystack.
The recall is also the most difficult part of the challenge. Unless you examine every single document in detail, you cannot know how many needles they contain.
Let’s say there are 100 needles in your haystack. If the engine finds 1 needle, its precision will be of 100% as it did not make any mistake, but its recall will be of 1%, since it overlooked 99 needles.
If the engine picks up the whole haystack, its recall will be of 100% – it did find all the needles – but its precision will be close to zero, since it also picked up everything else. You need to find a balance between the two.
To produce a meaningful score, precision and recall need to be combined. We do so by calculating the F-Measure (also known as harmonic mean) between precision and recall. The higher the F-Measure, the better the engine.
ProcessMaker IDP’s solution
ProcessMaker IDP is an intelligent content management solution. We offer innovative solutions for unstructured data management, and NER is part of the larger chain of Machine Learning technologies that we use.
We need to produce relevant and reliable annotations for each customer efficiently in terms of time and cost. Manually annotating thousands of documents for each customer is not an option.
To do so, we rely on a semi-supervised approach.
We developed pattern-matching rules to match elements in the documents. Pattern-matching rules are very efficient at identifying entities precisely, while they will skip many more ambiguous entities. In other words, their precision is extremely high while their recall is low. Dictionaries are also used to further enhance the precision. The entities discovered within each document are then reused to annotate the whole corpus automatically.
This is the supervised part of the training, and at this point our precision is extremely high.
To improve on the recall, an NER engine is trained on the data annotated using rules. We will then perform NER on the same data again. The extra entities discovered by the engine are then used to reannotate the same data and train the engine again until it stabilizes.
This is the unsupervised part of the training.
Unfortunately, we cannot completely escape from manual annotation. But instead of annotating a whole corpus, it is sufficient to annotate only a small part that will become our evaluation standard. The trained engine will be applied to the evaluation set and the results compared to the manual annotations. We constantly produce over 90% efficiency on different Dutch corpora.
This approach enables us to bootstrap NER engines efficiently in a matter of hours instead of weeks.
The next step
After being deployed, the life-cycle of an engine is not over yet. It will still make mistakes or overlook entities that present new features.
In these cases, the human expertise is invaluable, and engines can benefit from such feedback. We want to enable a better machine-human expert cooperation.
To do so we introduce ProcessMaker IDP Interactive Learning, an integrated interface that will let human experts correct the outputs of our engines, with instant feedback loops for NER.