The Data-Driven Approach to Finding Similar Companies


Finding Similar Companies

One of the features of our product that our customers love is our
“Related Companies” module. On the profile of any company in our database, we provide our users with a list of other companies that we suggest might be
related. A lot goes into these suggestions, including both some manual effort
and a lot of machine learning. An army of analysts could not comb through
our half million companies and come up with a good list of related companies for
each from the other half million (minus one) companies. And with an ever
increasing number of companies in our database, we knew it was a problem fit
for computer science, so we apply a combination of engineering and
machine power to the problem.

Our related companies algorithm takes a number of inputs into account (company
sector, company size, news co-mentions, participation in conferences, etc), but
here we will focus on one aspect
of the algorithm: company keywords. Every company we have in our database is
associated with some list of keywords, which we index from their
home pages, infer from their presence in news articles and directories, and
collect from the training lists our analysts build in house. By using a blend of
how a company describes itself, mostly with the purpose
of luring visitors from organic web searches, and how others classify the company,
it is a useful array of words
to summarize the company’s product and/or industry. By
comparing any pair of companies based on this array of keywords,
we can create a highly useful facet for our overall related companies algorithm.

However, comparing two lists of keywords is not entirely obvious. We could look
at some metric based upon exact matches between keyword lists, such as Jaccard
similarity or cosine similarity, but keywords are a very sparse feature. Out of
the thousands of keywords in our database, any given company will list only about
a dozen. To deal with this, we want to be able to harness a measure of similarity
between keywords. After all, if company A lists “cloud storage” as a keyword,
then we should have more confidence they are related to a company listing
“file sharing” as a keyword than a company listing “mobile payments” as a keyword.

While it can be hard to determine how to compare two keywords, we all know many
ways to compare two vectors, whether by Euclidean ((L2)) distance, Manhattan
((L
1)) distance, cosine distance, or any other crazy metric you want to come up with. If only we could translate our keywords to vectors…

word2vec

A very interesting recent project out of Google does
just that. Training a neural network on word-context occurrences in a huge corpus
of text, word2vec encodes words as vectors.
The basics are described at the
Google Code repository
for their Python implementation, and you can find more detail in a
series of papers
linked to from the word2vec Google code page.
The vector encodings they produce have some surprising and desirable arithmetic
properties. Using an implementation trained on the Google News data set, they
provide a couple of interesting examples of how arithmetic operations on the
vectors capture some semantic meaning:

  • word2vec('king') - word2vec('man') + word2vec('woman') is close in cosine similarity to word2vec('queen')
  • word2vec('france') is close in cosine similarity to word2vec('spain'), word2vec('belgium'), etc.
It would be great if we could harness these same ideas in our context. Unfortunately, using the default word2vec implementation directly is not optimal because the data it is trained is too broad and misses out on the nuances of domain-specific training data.

Neural Networks, or Matrix Factorization

We want to train a “keyword2vec” model so we can more robustly compare both
keywords themselves and companies based on their listed keywords.
The word2vec project provides
a way to train using a new data set, but training neural networks is never easy:
they are sensitive to parameter tuning (step-size schedules for stochastic gradient
descent, number of hidden layers, number of hidden units in each layer, etc.),
they are not very interpretable, and they are often slow to train.

A recent paper from NIPS 2014,
Neural Word Embedding as Implicit Matrix Factorization, shows
that the neural
network implementation of word2vec is actually very similar to a low-rank
factorization of a certain matrix. In particular, the word vectors represent
the loadings found in a matrix factorization of a positive
pointwise mutual information ((PPMI)) matrix
(P \in \mathbb{R}^{V \times V}), where

(P_{w,v} = PPMI(w,v) = \max\left{0, PMI(w,v)\right},\;\;\; PMI(w,v) = \log \frac{p(w,v)}{p(w)p(v)})

in which (V) is the size of the vocabulary, (w) and (v) are words,
(p(w)) is the probability of seeing word (w), and (p(w,v)) is the probability of seeing words (w) and (v)
together.

In our setting, we find (p(w,v)) by counting co-occurrences of words (w)
and (v) in a company’s keywords set, while (p(w)) and (p(v)) are
found from total occurrences across all companies’ keywords.

To obtain our word vectors, we now factorize the (PPMI) matrix. Provided a
desired dimension (d) for the word vectors, we find a factorization of
(P) so that

(P \approx W W^T)

where (W \in \mathbb{R}^{V \times d}). Notice that we can motivate a
factorization of this form ((W W^T)) because (P) is a symmetric matrix
– i.e. (P({w,v} = P{v,w}). As it turns out, finding an optimal solution in terms of the (L_2) norm is feasible by taking the
singular value decomposition (SVD) of the matrix,
(P = U \Lambda U^T), and keeping only the first (d) dimensions of the diagonal matrix (\Lambda). It turns out that
(\tilde{P} = U \tilde\Lambda U^T), where (\tilde\Lambda) contains only the first (d) elements of the diagonal matrix (\Lambda), is the
rank-(d) matrix that minimizes the Frobenius norm of the difference between
(P) and any other rank-(d) matrix – i.e.

(tilde{P} \in \arg\min{\hat{P}\;:\;\textrm{rank}(\hat{P}) = d} ||P - \hat{P}||F )

Using this, we go back to our earlier problem of finding (W) so that
(P \approx W W^T). If we take

(W = U \tilde{\Lambda}^{1/2} )

we now have a matrix (W) such that (WW^T = U \tilde\Lambda U^T) is very
close to our original
(PPMI) matrix (P). Furthemore, each row of (W) can now be used as a vector representing the corresponding word – i.e. (W_{i\cdot}) is the
(d)-dimensional vector corresponding to the (i^{th}) word. And, it turns out, these vectors have similar properties with the vectors obtained from
word2vec.

Using the Keyword Vectors

Now that we have our own trained vectors for the keywords obtained for our
companies, we can have some fun!

Similar Keywords

We can find similar keywords by comparing the vector representation of one
keyword to the vector representations of all other keywords. Looking, for
example, at the keyword cloud storage, we find that the closest other
keywords in terms of cosine similarity are

  • flash storage: 0.981116179237997
  • idrive: 0.978347988316545
  • online file sync: 0.977817384377984
  • your external hard drive in the cloud: 0.973818440471018
  • backup online: 0.973174935235816
  • online storage: 0.970892387415698
  • free online storage: 0.970609131103706
  • online backup solutions: 0.969503062603722
  • online file backup: 0.968670947689898
  • online sync: 0.965037194231968
  • etc…

and for the keyword mobile payments,

  • mobile payment: 0.985906495453295
  • payments: 0.982129350125816
  • android payments: 0.979123374750015
  • ivr payments: 0.978475369212524
  • mwallet: 0.977912206702434
  • fintech payments: 0.977378781616218
  • mobile payment enabler: 0.974429434165879
  • mobile payment solution: 0.974288467923257
  • credit cards & transaction processing: 0.974219772716241
  • card payments: 0.972751346246811
  • etc…

Similar Companies

Although we incorporate much more into our “Related Companies” algorithm, the
keywords alone can provide pretty good similarity scores. If we encode each
company as the mean of its keyword vectors, we can look again at the cosine
similarity between each company and all other companies.

Some examples are, for Dropbox (staying with the
cloud storage / collaboration theme),

and, for BOKU (staying with the mobile payments
theme),

Using these methods, we can build a set of similar company weights for a single
company. The engineering feat of running this portion of the algorithm across our
more than 500,000 companies is an another feat entirely, which started in
CoffeeScript, then re-wrote in Scala, and then re-wrote in R, but that’s a subject
for a separate blog post. Similarly, the algorithm for determining how to weight
the keyword similarity compared with the other facets we use in the overall
algorithm has been an interesting process.

Here, we have shown how we are able to to use some fun frameworks and an
interesting use of NLP and matrix factorization to feed clusters of related
companies into our application to help our customers identify new prospects and
competitors for any company in our database.