# The Data-Driven Approach to Finding Similar Companies

## Finding Similar Companies

One of the features of our product that our customers love is our

“Related Companies” module. On the profile of any company in our database, we
provide our users with a list of other companies that we suggest might be

related. A lot goes into these suggestions, including both some manual effort

and a lot of machine learning. An army of analysts could not comb through

our half million companies and come up with a good list of related companies for

each from the other half million (minus one) companies. And with an ever

increasing number of companies in our database, we knew it was a problem fit

for computer science, so we apply a combination of engineering and

machine power to the problem.

Our related companies algorithm takes a number of inputs into account (company

sector, company size, news co-mentions, participation in conferences, etc), but

here we will focus on one aspect

of the algorithm: company keywords. Every company we have in our database is

associated with some list of keywords, which we index from their

home pages, infer from their presence in news articles and directories, and

collect from the training lists our analysts build in house. By using a blend of

how a company describes itself, mostly with the purpose

of luring visitors from organic web searches, and how others classify the company,

it is a useful array of words

to summarize the company’s product and/or industry. By

comparing any pair of companies based on this array of keywords,

we can create a highly useful facet for our overall related companies algorithm.

However, comparing two lists of keywords is not entirely obvious. We could look

at some metric based upon exact matches between keyword lists, such as Jaccard

similarity or cosine similarity, but keywords are a very sparse feature. Out of

the thousands of keywords in our database, any given company will list only about

a dozen. To deal with this, we want to be able to harness a measure of similarity

between keywords. After all, if company A lists “cloud storage” as a keyword,

then we should have more confidence they are related to a company listing

“file sharing” as a keyword than a company listing “mobile payments” as a
keyword.

While it can be hard to determine how to compare two keywords, we all know many

ways to compare two vectors, whether by Euclidean ((L*2)) distance, Manhattan
((L*1)) distance, cosine distance, or any other crazy metric you want to
come up with. If only we could translate our keywords to vectors…

## word2vec

A very interesting recent project out of Google does

just that. Training a neural network on word-context occurrences in a huge corpus

of text, *word2vec* encodes words as vectors.

The basics are described at the

Google Code repository

for their Python implementation, and you can find more detail in a

series of papers

linked to from the word2vec Google code page.

The vector encodings they produce have some surprising and desirable arithmetic

properties. Using an implementation trained on the Google News data set, they

provide a couple of interesting examples of how arithmetic operations on the

vectors capture some semantic meaning:

`word2vec('king') - word2vec('man') + word2vec('woman')`

is close in cosine similarity to`word2vec('queen')`

`word2vec('france')`

is close in cosine similarity to`word2vec('spain')`

,`word2vec('belgium')`

, etc.

*word2vec*implementation directly is not optimal because the data it is trained is too broad and misses out on the nuances of domain-specific training data.

## Neural Networks, or Matrix Factorization

We want to train a “*keyword2vec*” model so we can more robustly compare both

keywords themselves and companies based on their listed keywords.

The *word2vec* project provides

a way to train using a new data set, but training neural networks is never easy:

they are sensitive to parameter tuning (step-size schedules for stochastic gradient

descent, number of hidden layers, number of hidden units in each layer, etc.),

they are not very interpretable, and they are often slow to train.

A recent paper from NIPS 2014,

Neural Word Embedding as Implicit Matrix Factorization, shows

that the neural

network implementation of *word2vec* is actually very similar to a low-rank

factorization of a certain matrix. In particular, the word vectors represent

the loadings found in a matrix factorization of a positive

pointwise mutual information ((PPMI)) matrix

(P \in \mathbb{R}^{V \times V}), where

(P_{w,v} = PPMI(w,v) = \max\left{0, PMI(w,v)\right},\;\;\; PMI(w,v) = \log \frac{p(w,v)}{p(w)p(v)})

in which (V) is the size of the vocabulary, (w) and (v) are words,

(p(w)) is the probability of seeing word (w),
and (p(w,v)) is the probability of seeing words (w) and (v)

together.

In our setting, we find (p(w,v)) by counting co-occurrences of words (w)

and (v) in a company’s keywords set, while (p(w)) and (p(v)) are

found from total occurrences across all companies’ keywords.

To obtain our word vectors, we now factorize the (PPMI) matrix. Provided a

desired dimension (d) for the word vectors, we find a factorization of

(P) so that

(P \approx W W^T)

where (W \in \mathbb{R}^{V \times d}). Notice that we can motivate a

factorization of this form ((W W^T)) because (P) is a symmetric matrix

– i.e. (P(*{w,v} = P*{v,w}). As it turns out, finding an optimal solution
in terms of the (L_2) norm is feasible by taking the

singular value decomposition (SVD) of the matrix,

(P = U \Lambda U^T), and keeping only the first (d) dimensions of the
diagonal matrix (\Lambda). It turns out that

(\tilde{P} = U \tilde\Lambda U^T), where (\tilde\Lambda) contains only
the first (d) elements of the diagonal matrix (\Lambda), is the

rank-(d) matrix that minimizes the Frobenius norm of the difference between

(P) and any other rank-(d) matrix – i.e.

(tilde{P} \in \arg\min*{\hat{P}\;:\;\textrm{rank}(\hat{P}) = d} ||P - \hat{P}||*F
)

Using this, we go back to our earlier problem of finding (W) so that

(P \approx W W^T). If we take

(W = U \tilde{\Lambda}^{1/2} )

we now have a matrix (W) such that (WW^T = U \tilde\Lambda U^T) is very

close to our original

(PPMI) matrix (P). Furthemore, each row of (W) can now be used as a
vector representing the corresponding word – i.e. (W_{i\cdot}) is the

(d)-dimensional vector corresponding to the (i^{th}) word. And, it turns
out, these vectors have similar properties with the vectors obtained from

*word2vec*.

## Using the Keyword Vectors

Now that we have our own trained vectors for the keywords obtained for our

companies, we can have some fun!

# Similar Keywords

We can find similar keywords by comparing the vector representation of one

keyword to the vector representations of all other keywords. Looking, for

example, at the keyword **cloud storage**, we find that the closest other

keywords in terms of cosine similarity are

**flash storage**: 0.981116179237997**idrive**: 0.978347988316545**online file sync**: 0.977817384377984**your external hard drive in the cloud**: 0.973818440471018**backup online**: 0.973174935235816**online storage**: 0.970892387415698**free online storage**: 0.970609131103706**online backup solutions**: 0.969503062603722**online file backup**: 0.968670947689898**online sync**: 0.965037194231968- etc…

and for the keyword **mobile payments**,

**mobile payment**: 0.985906495453295**payments**: 0.982129350125816**android payments**: 0.979123374750015**ivr payments**: 0.978475369212524**mwallet**: 0.977912206702434**fintech payments**: 0.977378781616218**mobile payment enabler**: 0.974429434165879**mobile payment solution**: 0.974288467923257**credit cards & transaction processing**: 0.974219772716241**card payments**: 0.972751346246811- etc…

# Similar Companies

Although we incorporate much more into our “Related Companies” algorithm, the

keywords alone can provide pretty good similarity scores. If we encode each

company as the mean of its keyword vectors, we can look again at the cosine

similarity between each company and all other companies.

Some examples are, for Dropbox (staying with the

cloud storage / collaboration theme),

- Box: 0.988772243220449
- SugarSync: 0.973269196861258
- Hightail: 0.972030152842286
- SkyDox: 0.964419747675301
- filobite: 0.961857655119854
- Firedrive Media: 0.959809673680361
- Yakimbi: 0.958234583938729
- Mediafire: 0.956988047920301
- Filesgateway.com: 0.956174150033562
- Zoolz: 0.955439684268182
- etc…

and, for BOKU (staying with the mobile payments

theme),

- PayAnywhere: 0.967311639462488
- Zong: 0.966622717338347
- Square: 0.962493155772685
- VeriFone: 0.957852113994113
- Fortumo: 0.956916229626648
- QFPay: 0.956654558103476
- Onebip (acquired by Neomobile): 0.9550171271087
- PaybyMe: 0.954839544810337
- Zaypay.com: 0.953663005838281
- obopay: 0.953425806463978
- etc…

Using these methods, we can build a set of similar company weights for a single

company. The engineering feat of running this portion of the algorithm across our

more than 500,000 companies is an another feat entirely, which started in

CoffeeScript, then re-wrote in Scala, and then re-wrote in R, but that’s a subject

for a separate blog post. Similarly, the algorithm for determining how to weight

the keyword similarity compared with the other facets we use in the overall

algorithm has been an interesting process.

Here, we have shown how we are able to to use some fun frameworks and an

interesting use of NLP and matrix factorization to feed clusters of related

companies into our application to help our customers identify new prospects and

competitors for any company in our database.