Short Story on Scaling an NLP Problem without Using a Ton of Hardware

Popular Science

The cornerstone of the small work done for getting the info for these great charts with IEPY, was to be able to catch mentioned companies.

The basic idea of relation extraction is to be able to detect mentioned things in text (so called Mentions, or Entity-Occurrences), and later decide if in the text is expressed or not the target relation between each couple of those things. In our case, we needed to find where companies were mentioned, and later determine if in a given sentence it was said that Company-A was funding Company-B or not.

In order to detect those funding we needed to be sure of capturing every mention of a company. And although the NER used catched most of them, there are always some folks that name their company #waywire or 8th Story, words that are not very easily trackable with a NER.

A good solution is to build a Gazetteer containing all the company names we can get. The idea of working with Gazettes, is that when using them, each time one of the Gazette entries is seen on a text, it’s automatically considered as a mention of a given object, ie, an Entity-Occurrence.

From an encyclopedic source; we got more than 300K entries.Great!

The next challenge was that… well, in the text to process, a company could be mentioned on a different way than the official one stated on the encyclopedic source. For instance, would be more natural to find mentions of “Yiftee” than “Yiftee Inc.”

So, after incorporating a basic schema for the alternative names (ie, substrings of the original long name), the number of entries grew up to 600K.

After that, when we felt confident about our gazette and wanted to start processing text, we faced several issues:

  • we weren’t able to handle that gazette size at reasonable speed
  • we were having tons of poor quality Entity-Occurrences (ie, most of the times a human reader would say that it was wrongly labeled as an Entity-Occurrence)
  • tons of poor quality Entity-Occurrences implied exponential tons of potential funding evidences to check (roughly one per pair of Entity-Occurrences on the same sentence)

So, knowing that we were trading recall[1], we decided to add several levels of filters. Let the pruning start!

First step was to add a second encyclopedic source, not to augment, but instead to add confidence, keeping only the intersection of those 2 sources.

Next, with a precomputed count of words frequency, we filtered out all those company names that were too probable to occur as normal text (we used some threshold and tuned it a bit before leaving it fixed).

With that very same idea of words frequency, we pruned the companies sub-names (the substrings of the original long company names), with a higher threshold; so, for a company listed like “Hope Street Media” we didn’t end up with a dangerous entry for “Hope”, but instead for “Yiftee Inc.” we did have “Yiftee” on the final list.

With all that done, we reduced the list to about 100k, which was still capturing a really good portion of the names to work with, but reducing a lot the mentioned issues above.

The last step was to pick a sample of documents, preprocess them, and simply hand-check the most used (found) Gazette-Items creating a blacklist for the cases where it was obvious that occurrences were most of the times not the company mentioned, but just natural usage of those words on the language.

We finished very satisfied with the results and also with the lessons learnt. Hopefully some of the tips above can help you.

[1] recall: (also known as sensitivity) is the fraction of relevant instances that are retrieved. http://en.wikipedia.org/wiki/Precision_and_recall

Want to read more related content? Follow us on Twitter @machinalis

Original blog post: http://www.machinalis.com/blog/gazette-for-relation-extraction/

Comments

    3,751

    Ropes — Fast Strings

    Most of us work with strings one way or another. There’s no way to avoid them — when writing code, you’re doomed to concatinate strings every day, split them into parts and access certain characters by index. We are used to the fact that strings are fixed-length arrays of characters, which leads to certain limitations when working with them. For instance, we cannot quickly concatenate two strings. To do this, we will at first need to allocate the required amount of memory, and then copy there the data from the concatenated strings.