At DataFox, our mission is to provide data-driven business insights to salespeople, analysts, executives, and investors. One of our proprietary systems is the DataFox scoring system, which uses machine learning to quantify hard-to-define traits like financial stability and management quality, and most importantly, how those traits can predict a company's growth.
Using the companies in our database, we've developed models to provide a leading indicator of success. This predictive method is particularly important when evaluating private companies, where traditional factors like revenue and past performance paint an inaccurate picture of future growth.
To accomplish this, we’ve built a series of algorithms to evaluate companies based on growth, influence, finance, management and overall quality. These scores allow DataFox users to quickly search our company database and identify the best or most suitable companies in any sector, location, or stage. Our scores identify the best companies, just as Google's PageRank algorithm identifies the best webpages.
How We Calculate our Five Distinct Scores
Our customers have different perspectives on how to score a company, such as overall size, growth, funding strength, team quality, and more. Recognizing this, we calculate five different scores based on dozens of criteria.
Like Google’s search algorithm, DataFox scores are proprietary and we do not disclose our exact formulas. To give you an idea of how they work, here are some of the features that factor into them.
Is the company likely to experience revenue growth?
To make our growth predictions, we analyze eight factors, including:
- Headcount growth: The increase in the number of employees at a firm over time.
- Investor score: A function that estimates the number of "prestigious" investment firms that have invested in the company. The set of prestigious institutions is curated by DataFox analysts.
- Job listings: The number of available positions the company has listed on Indeed and Jobvite.
- Growth factor: A function of overall quality score with respect to time.
What is the financial strength of the company?
To evaluate the financial stability and upside of a company, we analyze eight factors, including:
- Investor score:A function that estimates the number of "prestigious" investment firms that have invested in the company. The set of prestigious institutions is curated by DataFox analysts.
- Estimated revenues: While revenue estimates for private companies are imprecise, our analyses have shown them to be valuable indicators of the relative size of the business.
- Liquidity score: A formula we developed that calculates how liquid the company’s finances are based on funds raised and the recency of those financings.
How strong is the company's executive team?
To determine whether a company has a quality leadership team, we analyze seven factors, including:
- Retention rate: The average number of previous jobs held by members of the company’s executive team.
- Educational prestige: The fraction of degrees earned by the company’s executive team that come from "prestigious" schools, defined to mean schools ranked in the top 30 in the US News & World Report rankings.
- Change in LinkedIn followers: The rate at which the company is gaining or losing LinkedIn followers. This indicates whether a company is gaining or losing cachet in the job market.
How significant is the company's online presence?
To determine whether a company has a strong brand and marketing presence, we analyze seven factors including:
- Website traffic: The amount of web traffic to the company’s website.
- News mentions: The frequency with which the company is mentioned in the news, as calculated by our news auditing algorithms.
- Twitter mentions and followers: The number of @mentions about the company on Twitter.
- Conference sponsorship: The number of conferences a company sponsors.
Overall DataFox Score
How successful is the company overall?
The DataFox Score leverages machine learning to build a model, selecting among all of the underlying features available across our four sub-scores.
What is the Distribution of our Scores?
Clients often ask us how they should interpret the DataFox score. Our scores are calculated on a scale between 0 and 1,250. The following is a histogram of the “Overall DataFox Score” for the top 230,734 businesses in our data set:
Why do we not present all of our companies here?
First, many companies are either too early, too small, or for other reasons have not hit enough milestones to earn a good score. Second, many smaller companies operate in stealth and have little to no publicly visible footprint.
Thus, we are especially careful about our score calculations, and we in fact calculate our confidence in a company’s score. When there is a key data point missing on any given company, we do not publish that score. We still built a profile for the company and collect events for the, but we leave the score blank. Meanwhile, the distribution among the more established companies gives you a sense of what the score means.
Harnessing Machine Learning to Score Companies
We use machine learning to train our algorithm to assign scores across companies. The role of machine learning in this process is to rapidly run simulations and assist us in determining the appropriate value of the coefficients to cause the algorithm’s output to match with the training sets.
Defining Training Sets
First, we define and built training sets to classify what success and failure look like for the algorithm. A training set is a set of examples used to fit a model to predict a type of response based on the input variables or features. To accomplish this, we identify companies that exemplify the characteristics of the high and low-achieving companies for each score. We also make sure our training sets include companies that are at all levels of growth, ranging from Early Stage to Late Stage companies.
Then, we prepare our data set - raw data must first be cleaned before it can be used. We start by removing outliers and noisy data, then move on to more complex calculations to normalize our data. For instance, part of our growth score is an intermediate "liquidity score" calculation that estimates the company's financial position on a basis that can be compared across industries.
Picking the Model
Given the training data, we train the chosen model to learn the optimal combination and weightings of the input features. This learning phase helps us determine which quantitative factors differentiate the successes from the failures among examples provided in the training set. The most naïve possible model would be to apply a formula like this:
((1*A) + (1*B) + (1*C) + (1*D) + (1*E)) / 5
After doing all of the manual work gathering data points and computing them, even very simple functions can help separate the high achievers from the rest. Meanwhile, in-house teams do not normally have the wherewithal to use sophisticated functions to normalize the features or scientifically calculate what the model and its coefficients ought to be.
We generate these scores so our clients don’t have to spend hours creating their own scores with less sophisticated models.
In the end, each of our scoring models takes a form that looks somewhat like the following:
((11.5*A^3) + (0.5*function(B)) + (32.3*function(C)) + (1.2*function(B&D)) + (19.2*function(J&C)) + (17.7*H^1/2) + (.2*function(J)))
In this hypothetical, A might be employee retention, B might be headcount growth, and so on. In reality, they tend to be more complex polynomial equations than the above, but this helps illustrate the point. The role of machine learning in this process is to rapidly run simulations and assist us in determining the appropriate value of the coefficients (11.5 and .5 and 32.3 and so on) to cause the algorithm’s output to match with the training sets. Once it finds the best fitting model, it then applies that formula to all of the other companies in our data sets.
Iteration and Cross-Validation
We output the results of the algorithm and examine performance, looking at measures such as root mean squared error and confusion matrices to identify weaknesses in the quality of the input data, normalization, or training sets, then optimize, then repeat. We apply cross-validation testing to get a good estimate of how our model will generalize to data that our model has not been trained on.
Conclusion (and Room for Improvement)
More data. At the core of this algorithm, there are data inputs, transformations to organize the data, and formulas to create and output these scores. Improvements to the scores are therefore achievable in those same few areas: more data, better transformations, or better formulas.
How do we prioritize our efforts? We follow a golden rule when it comes to machine learning:
There are diminishing returns to formula improvements over time.
Tweaking a score’s algorithm to crunch the same data set may yield a 1% improvement in its accuracy (which we measure using our training sets). In contrast, adding a new and proprietary feature to the data set might yield an improvement of 5% or 10%. This has been reinforced by our previous work in machine learning and statistics, our own internal iterations and testing, and interviews with our advisors and customers.
As a result, we actually put more effort into collecting more proprietary data than we do fine-tuning our training sets and the polynomial equations or linear regressions that calculate our scores (although we do lots of that, too). Knowing where to start can be tough, so we prioritize our new data collection initiatives by their expected contribution to our score quality.