The Natural Language Processing Algorithms Behind Our Company Signal Data

The Natural Language Processing Algorithms Behind Our Company Signal Data

The Natural Language Processing Algorithms Behind Our Company Signal Data

At the core of DataFox is a focus on creating a pristine dataset of company data. We want this data to be as clean, accurate, and all-encompassing as possible.

In practical terms, this means we want to gather as much relevant information as we can on every company in our database. In one workflow, we contribute toward this goal through data partnerships; we combine, prioritize, and resolve a huge variety of data sources.

In another workflow, we focus on the company data that partnerships cannot provide. We make a meticulous effort to find relevant and unique company data from unstructured text through natural language processing algorithms. It's an enormous undertaking, but the creation of a proprietary, competitively advantageous set of company data is well worth it.

News Sources

Although we also classify unstructured text coming from public websites, company blogs, and tweets, this post will focus solely on news generated from news organizations.

Every 30 minutes of every day, we collect articles from nearly 7,000 of these sources (through RSS feeds) and pass them through our classifier. These sources range from highly specific news sources (e.g. Healthcare IT News) to more widely digested websites (e.g. TechCrunch).

What they all have in common is a high degree of content curation (thank you editors) and a focus on being very up-to-date (thank you capitalism). These features help prevent garbage from entering our data pipeline and helps us keep our classification of the news highly targeted and highly relevant.

News Processing: Problem Statement

Given the constant pipeline of articles being created by these sources, we want to make sense of them and extract important company data from them. For as many companies as possible, we want to automatically tag articles with their appropriate company signals using natural language processing algorithms.

In our domain, we currently classify 68 such signal types that broadly fall into one of the following eight buckets:

  • Signal Types
    • Growth
    • Financial
    • People
    • Recognition
    • Events & Marketing
    • Corporate Update
    • Negative News

Categorization of these signal types can be difficult for many reasons:

  • Customers expect high precision
    • Consumer trust is built on providing accurate signals

  • Customers expect high recall
    • The more signals we can classify the greater the edge our customers have and customers expects us to not miss any events

  • The difference between signal types can be subtle
    • For example, we distinguish between a company relocating offices and expanding offices (which shows growth).

  • Attribution of a signal to a company can be tricky
    • For example, Google and Facebook are mentioned in a very large number of articles, but only a small fraction are actually focused on these companies.

Classification Approach

We could have approached this as a multi-label classification problem at the article level. In this situation, an article could contain information about a company that both received private funding and expanded to open new offices.

We would then tag the article as containing two signal types, though the specific context of which sentences relate to which signals would be tough to disambiguate.

Instead, we decided to focus on sentences or groups of sentences within the article. We refer to these smaller units as "snippets." By narrowing the lens of the classifier, we focus the problem into a multi-label classification problem per snippet.

Though in many cases there will be only one signal per snippet (leading to more of a multiclass classification approach), there are many interesting snippets that can contain several signals.

Here is one such motivating example from PR Newswire that contains 5 signals:

  • Snippet:
    • Today DoorDash announced that it raised $535 million in Series D funding led by the SoftBank Group ("SoftBank") with participation from existing investors Sequoia Capital, GIC and Wellcome Trust.

  • Signals:
    • DoorDash: funding
    • SoftBank Group: investment
    • Sequoia Capital: investment
    • GIC: investment
    • Wellcome Trust: investment

We decided to tackle this multi-label problem through a natural language processing (NLP) and supervised machine learning (ML) approach. Our customers expect high precision out of these signals, so, over the span of many years, we created a golden set of tagged snippet data.

Gathering Data

The precursor to all of our work on classification comes from an essential dataset that is continuously being expanded by our highly trained auditors. DataFox's team of auditing specialists have collectively spent tens of thousands of hours combing through our news pipeline to tag articles with relevant signals.

This huge amount of work has created a useful, clean, and structured set of company data with hundreds of thousands of relevant signal tag examples. Using our set of internal tools, an auditor will read through an article, highlight a relevant snippet, tag it with any relevant signals, and associate each signal with the relevant company.

DataFox's natural language processing algorithm for company data

This human classification step creates structure for our training dataset and creates a significant competitive advantage for our machine learning approach.

Preparing Data

Returning to the crux of the multi-label problem, we knew that in order to make sense of the tagged training data, we still needed to extract relevant features. Specifically, we needed a way to turn the tagged snippet text into features that could be incorporated into a model.

In order to do this, we turned to the python library spaCy to perform a significant portion of data cleaning and pre-processing. Using spaCy helped us with the following tasks:

  • Apply Named Entity Recognition (NER) to perform substitutions that allow our models to learn generalized sentence structures
    • 2017 or 2018 becomes DATE
    • $200M or €200 becomes CURRENCY
    • DataFox becomes ORGANIZATION

  • Remove stopwords, using multiple approaches:
    • A list of known stopwords
    • Words that occur too frequently to be informative
      • maximum document frequency
    • Words that occur too infrequently to be informative
      • minimum document frequency

  • Once the snippets had been prepared in this way, we tested a variety of other pre-processing steps:
    • N-grams where N = {1, 2, 3, 4}
    • Downsampling the untagged snippets
    • Attempt to avoid excessive bias in the model
    • Balancing the training data using class weights

  • And we tried a few different methods to vectorize the text into features:
    • TF-IDF vectorization
    • Bag of Words - Count vectorization
    • Singular Value Decomposition (SVD) on bag of words

Ultimately, we tested these different methods of preparing the data on different classes of models to gauge the relative effects on the classifier's precision and recall per signal.

Model Selection

After these data preparation steps, it was time to test out different models and gauge how well they classified the snippets in question. In a typical test/train setup, we set aside 20% of the tagged data as a test set.

We then evaluated individual models using cross-validation on the remaining 80% of data, with discrete outputs of the the mean and variance for precision and recall across all cross-validated runs.

Here were a few of the model classes we tested employing scikit-learn's classes and its stochastic gradient descent classifier:

  • Random Forests

  • Support Vector Machines

  • Logistic Regression

In order to compare the results from these models, we used precision, recall, and f1-scores per signal as guiding metrics. Additionally, to choose amongst similarly performing models, we considered the interpretability of the model.

In the end, a tuned logistic regression performed best, lending itself well to future re-training and giving us the ability to inspect the most important features contributing to each signal tag. Here are some high-level results from the model's performance on the test set, where the overall precision (78.1%) and the overall recall (73.2%) combined for a healthy f1-score of 74.6%.

Signal Type Precision Recall F1-Score
valuation 99.5% 98.3% 98.8%
dividends 95.2% 99.1% 97.1%
stock-performance 94.1% 95.5% 95.0%
. . .
All Signals 78.1% 73.2% 74.6%

In the end, this structured classification of the news brings in fascinating examples, from the international marketing activity of interestingly named French companies:

  • Snippet:
    • Google Maps on Android and iOS now has a new transportation option. If you're living in a country where French startup BlaBlaCar operates, you can now open the BlaBlaCar app and book a ride straight from Google Maps.

  • Signals:
    • BlaBlaCar: marketing-activity
    • Google: marketing-activity

To the can't miss details of massive funding rounds in Silicon Valley:

  • Snippet:
    • Instacart said it raised $200 million in a new funding round this morning led by Coatue Management, as well as Glade Brook Capital Partners

  • Signals:
    • Instacart: private-funding
    • Coatue Management: investment
    • Glade Brook Capital Partners: investment

Future Work

As always, at the end of such a project, there are exciting opportunities to improve the current methodology. Here are a few items that are top of mind heading forward:

  • At the tail end of our signal taxonomy are a few signal tags that haven't received enough tags in our training set to perform well in a classification setting. We have room to improve those signal classifications either by artificially boosting our training set, or manually tagging more examples to get a better breadth of examples.

  • We can consider using neural networks like Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs). In this situation we would use word embeddings as features rather than sentence vector space models. An advantage of using a CNN could be to help make the classifier more context aware than N-grams preprocessing can provide. And an advantage of using an RNN could be making the classifier more temporally aware, processing one word at a time, as opposed to all at once.