Search

Identifying Value Traps with Deep Learning

We discussed how machine learning can improve quantitative investing in our earlier work. We looked at how deep learning can be used to forecast future fundamentals and uncertainty estimates to improve portfolio performance. In this post, we present a different application where we use deep learning to remove companies from the portfolio that could be seen as 'traps'. This article was first published on Euclidean's Blog. For more background on machine learning and systematic investing visit euclidean.com.


Quantitative Value Investing and Value Trap

Many studies have shown that quantitative value investing, the practice of building portfolios of stocks that are quantitatively inexpensive relative to some fundamental measure (e.g., earnings, book value, free cashflow, etc.), has outperformed market averages over long periods of time. This phenomenon has been significant, persistent, and global[1]. And it has been validated as far back as the data will allow us to investigate[2]. However, this style of investing is no panacea. For one, it is highly volatile. Second, there can be excruciatingly long periods of time where it underperforms the market. Also, many stocks that find their way into a quantitative value portfolio really do deserve their meager valuation. This occurs when a stock’s current fundamentals are a poor predictor of their future fundamentals. As an example, if company is earning $100M today and has a market valuation of $1B, it might look inexpensive at its 10X multiple. However, if future earnings turn out to be $50M then the 20X multiple at a $1B valuation looks more fair or even expensive. In retrospect, the company is not cheap.


In this research, we investigate the idea of improving the performance of a specific quantitative value strategy by building a model that attempts to identify deservedly cheap stocks and removes them from a quantitatively constructed value portfolio. We demonstrate that a value strategy enhanced by removing these stocks achieved a simulated annualized return of 15.9% over the period 2000-2019 vs a 14% return for the non-enhanced strategy over the same period. We also show that the difference in performance is statistically significant over varying investment time periods and horizons.


Long-Only Quantitative Value Investing

A common approach in a long-only quantitative investment strategy is to rank a given universe of stocks using a factor and pick the top-ranking stocks to form a portfolio. For example, a momentum strategy might use trailing 12-month price change to rank stocks and build a portfolio. Similarly, a value strategy might use EBIT/EV (operating income[3] divided by enterprise value[4]) as the ranking factor. Other popular strategies use factors such as profitability, size, and investment[5]. In practice, most quantitative strategies use a combination of factors.


Factors such as EBIT/EV are called value factors because, in theory, you receive more value (EBIT) per dollar spent (EV) on high-ranking stocks than you do on low-ranking stocks. There are two theories on why a value-factor strategy might work. One is that a high EBIT/EV indicates that the company is mispriced and a reversion to being appropriately priced will result in positive returns (i.e., EV will increase). The other is that a high EBIT/EV indicates that a stock is risky and that investors are rewarded for taking such risk in the form of higher returns (i.e., high risk, high reward). In this post, we focus more on the mispricing theory. However, value-factor outperformance likely results from both phenomena.


Value Traps

Consider a portfolio constructed on the value-factor EBIT/EV. If a company ranks highly on EBIT/EV and its EBIT (earnings) is a good proxy for future earnings, then it is reasonable to say that the company is attractively priced relative to companies that are ranked below it. However, if future earnings turn out to be much less than current earnings, then the company may not be attractively priced at all – it may deserve its low valuation. This type of opportunity, a quantitatively cheap stock that is a deservedly cheap stock, is sometimes called a value trap. If a quantitatively inexpensive value portfolio is littered with value traps, then the premise on which the value strategy works, as described above, is undermined.


We are thus motivated to investigate ways of identifying and removing highly ranked stocks that are value traps because, if we can improve the ratio of value opportunities to value traps in our portfolio, it seems likely that we will increase portfolio performance. Most quantitative strategies do use more than one factor, many of which (such as return-on-equity or amount of leverage) may weed out some value traps indirectly. For example, if a company is highly levered there may be a higher likelihood that it will have liquidity problems in the future and hence not be able to earn as much tomorrow as it does today. However, in the research that inspired this article, we sought to investigate the idea of identifying value traps more directly.


Deep Learning

This research was performed as part of a multi-year effort to investigate the application of deep learning to long-term investing. Recent applications of deep learning and recurrent neural networks have resulted in better-than-human performance by computers in many domains. However, there has been very little work in the application of these technologies to investment management. Nonetheless, there are several reasons why deep learning might achieve better results than traditional statistical methods or non-deep machine learning approaches when applied to long-term investing. These reasons include:

  • Machine learning approaches are typically structured to predict something from a fixed number of inputs. However, in the investment world, the input data typically come in sequences (for example, how a company’s operating results evolve over time), and the distribution of investment outcomes are conditioned by the evolution of those sequences. Recurrent neural networks, which have claimed many successes in recent years, are designed precisely for this type of sequenced data.

  • In quantitative investment research, a great deal of effort is put into “factor engineering” – the process of determining which features of a company are most valuable in forecasting the future. Deep learning provides the potential opportunity to let the algorithms discover the features based on raw financial data. That is, the “deep” in deep learning means that successive layers of a model are able to untangle important relationships in a hierarchical way from data as found “in the wild,” and these relationships may be stronger than the ones found via traditional approaches to factor engineering.

With these potential advantages over traditional statistical and non-deep machine learning techniques, we explored the idea of applying a deep learning approach to separating value opportunities from value traps with as high a degree of accuracy as possible.


The Value of Removing Value Traps

As discussed above, common sense tells us that if we can remove value traps from a portfolio of quantitatively inexpensive stocks, the performance of the portfolio should improve. To test the validity of our intuition, we ran investment simulations where we knew in advance (we could look into the future) which highly ranked stocks were value traps and which were not. Obviously, we can’t know the future, but we can use historical data to simulate how well we would have performed if we had known which stocks to remove. This process helps us quantify the value of identifying and removing value traps. It also puts an upper limit on how much we can improve portfolio performance by forecasting and removing value traps.


Before we begin, however, we must first give a concrete definition to the concept of a value trap. So far, we’ve simply defined it as a quantitatively inexpensive stock that performs poorly. This idea can be quantified in many ways, so we chose one that is simple, represents the idea accurately, and is a tractable learning problem for our models. To that end, we define a value trap as a stock chosen from those that rank highly on the factor EBIT/EV (value) but whose annual return is in the bottom decile (trap) of the universe. That is, the worst performing 10% of stocks.


Next, to empirically test the benefit of removing value traps, we rank all stocks by the value-factor EBIT/EV and then simulate portfolio performance where an increasing percentage (i.e., 5%, 10%, 15%, …, 100%) of known value traps are removed. Our expectation is that as the percentage goes up, portfolio performance should as well. This experiment helps confirm that expectation and also quantifies the amount by which performance goes up as an increasing number of value traps are removed.

Figure 1: Portfolio performance for the varying amounts of value traps removed. It is a 50-stock portfolio ranked by EBIT/EV metric. For details on the simulation methodology and assumptions, see Appendix A.

In the above figure, we show how performance improves (with respect to a standard factor model) as the percentage of value traps removed increases. To simulate the average performance for each fraction of value traps removed, we randomly sample the fraction of value traps 100 times. For example, when removing 50% of the value traps, we randomly sample 50% of the value traps 100 times with different seeds, and average the portfolio performance.


As expected, the more value traps we remove, the better our portfolio performs. Hypothetically, if we can identify and remove all the value traps, we would have achieved a 3.5% increase in annualized return from the standard factor model over the simulation period.


Since 100% accuracy is not feasible, the blue region gives us an idea of the opportunity for varying degrees of accuracy. For example, if we remove 80% of the value traps, we can expect an increase of almost 2% in annualized performance. This motivates us to build a model that identifies value traps with a reasonable degree of accuracy and remove them from our universe.


The Deep Learning Model

In previous posts we demonstrated the successful use of deep learning to forecast future earnings using time series of historical fundamental data. Motivated by these results, we set out here to use historical five-year time series of fundamental data to train a deep neural network (DNN) to predict the likelihood that a company is a value trap. If being a value trap is defined as an inexpensive company whose stock price doesn’t perform well, then the most obvious approach to identifying them is to forecast future stock returns. However, early in our research with deep learning, we conducted several experiments where we attempted to forecast price changes directly and they failed to meaningfully outperform a simple linear model. Essentially, the price movement of stocks is extremely noisy, making it difficult to extract meaningful signals with a DNN.


So, instead, we hypothesized that a more tractable approach would be to set up the problem as a binary classification task. That is, rather than forecast price directly, we attempted to train a DNN to discriminate between two classes, ‘Value Trap’ and ‘Not Value Trap’, where ‘Value Trap’ is defined as the worst performing 10% of stocks. The intuitive justification for this approach is that, given the number of exogenous factors that affect a stock’s price (economic cycles, macro-events, industry regulation, etc.), forecasting whether a stock will do well or not relative to its peers seems like an easier task than forecasting its specific price in one year’s time.

Figure 2: A Deep (Recurrent) Neural Network that takes historical company data as input, and outputs the probability that a company is a ‘Value Trap’.

Training the Model

When training a model on time-dependent data, we make a distinction between in-sample and out-of-sample data. The in-sample data is a contiguous (in time) block of data that we use to train a model. The out-of-sample data is a contiguous block of data that begins after the in-sample period ends. The out-of-sample data is used to validate (test) the model trained on the in-sample data.


In this research, we use historical company data that spans the period 1970 to 2019 (See Appendix B for a detailed description of the data). We set the out-of-sample dataset to be all the data from the period 2000 to 2019. For each year in the out-of-sample period, we train a model using the last 30 years of data. For example, to predict for 2005, we use data from 1975-2004 as the in-sample dataset. And to predict for 2006, we use data from 1976-2005.


Since we define value traps as the companies that fall in the bottom decile of performance as measured by their 12-month returns, the number of data points representing value traps is much smaller than non-value traps. To resolve this imbalance, we augment the data by upsampling the value trap data points.


In training any deep learning model, we minimize some form of error between the predictions and the true values. For example, in a linear regression task it is common to use the squared error. As this is a classification task, the model tries to minimize the log loss which penalizes misclassifications.


Finally, we ensemble (combine) multiple deep learning models to make the predictions more robust.


Model Performance and Results

The DNN model outputs the probability of being a value trap which lies between 0.0 and 1.0. We typically pick a threshold above which we classify an output as a value trap. The value of this threshold depends on the final application. For example, if a classifier is predicting whether a tumor is benign or malignant, it is better to be conservative and have a lower threshold to classify tumors as malignant. This will reduce the number of unchecked malignant tumors. On the other hand, for a different application such as email spam classification, it is better to have a higher threshold so that important emails are not classified as spam.


To evaluate the performance of a classifier irrespective of the threshold value, we look at the AUC (area under the ROC curve) metric to see how well the model can discriminate between value traps and not value traps. AUC can take a maximum value of 1, where a higher value means that the model can discriminate between the two classes more effectively regardless of the threshold value. A value of 0.5 means that the model is no better than a random guess.


We show the model performance on the out-of-sample period in the following figures. The ROC curve for the entire out-of-sample period of 2000-2019 is plotted on the left in figure 2. The AUC for every month in the out-of-sample period is plotted on the right. AUC for the full out-of-sample period is 0.76.

Figure 3: ROC curve for the full out-of-sample period (2000-2019) on the left. AUC is plotted on the right for every month in the out-of-sample period.

For metrics such as accuracy, precision, and recall (defined below), we use a threshold of 0.5, above which companies are classified as value traps. Dashed plots represent the standard value-factor model (naive model) where companies ranked on EBIT/EV. Since the value-factor model makes no attempt to classify value traps, we can treat it as a model that classifies every stock as not a value trap. Accuracy for the DNN model is always higher than the naive model across the full period. The average accuracy of the model is 72%.

Figure 4: Accuracy of the model on the out-of-sample period.

In the beginning of this section, we discussed two classification task examples – tumor and email spam classifiers. The nature of such applications dictates what performance metric we should focus on when judging the binary classifier. It could be accuracy, precision, recall, or some combination of these metrics. With our value trap classifier, accuracy is the percentage of correct classifications. Precision is a measure of how well the model identifies value traps. And recall is the percentage of value traps the model is able to identify.


In the tumor classifier example, it is safer to have more false positives than false negatives as an unchecked tumor would have more serious consequences than a false alarm. Here the recall statistic would be a good metric for measuring the classifier performance as it focuses on false negatives. The opposite is true for the email spam example. It is better to have more false negatives than false positives as wrongly classifying an important email as spam would be more problematic than missing a spam email. Hence, precision would be an appropriate metric for this case.


In our value trap case, both precision and recall are important. A false negative (a component of recall) would mean the model misses a value trap. At the same time, a false positive can remove a high performing stock from the portfolio and lower the returns. Hence, we evaluate both together. They are plotted in the chart below. Since the standard value-factor (naive) model assumes no value traps, its precision and recall are set to 0%.

Figure 5: Out-of-Sample Precision and Recall for 2000-2019 period

Model performance on the out-of-sample period shows the DNN is effective in learning how to identify value traps as demonstrated by the AUC, accuracy, and recall values in the table below. Portfolios constructed using the DNN to remove value traps had a simulated annualized return of 15.9%. From Figure 1, recall that removing all value traps (hypothetically) would give us an improvement of 3.5% over the standard value-factor model (14%). With the DNN model, we get an increase of 2% and capture more than half of the outperformance. Furthermore, the Sharpe ratio jumps from 0.52 to 0.69.

The improvement in the annualized return and the Sharpe ratio for the DNN model over the standard factor model is for the entire out-of-sample period of 2000-2019. However, it is possible that returns for the better performing strategy were higher only in certain periods. This would make the total return of one strategy superior, but if investors were to miss those periods, the outperformance might not be as big as the total return suggests.


To robustly test if our DNN model outperforms the standard model irrespective of the time periods, we construct the following two-sample t-test. We randomly sample 300 varying lengths of time periods between 2000 and 2019. For example, we sample the three-year period starting from June 1, 2005, the eight-year period starting from April 1, 2012, and so on. We then calculate the monthly return and Sharpe ratio for each sample period. We report the t-statistics for the outperformance of the DNN model in the table below.

This test shows that the performance improvement for the DNN model over the standard factor model is significant regardless of what time period is selected. As different investors have different investing periods, such a test offers a more robust way of comparing strategies.


Conclusion


To quantify our hypothesis that removing value traps from a portfolio is beneficial, we showed that, by using a perfect value trap classifier, we improve annualized returns in simulations by about 3.5% over a standard value-factor model. Of course, this is not possible because we cannot create a perfect classifier. However, motivated by this analysis, we attempted to predict the probability that a company is a value trap, as accurately as possible. We used deep learning to train a neural network to classify stocks as either being value traps or not. We then demonstrated that by removing forecasted value traps from a portfolio constructed using the value-factor EBIT/EV, we can meaningfully improve portfolio performance.


In future research we would like to investigate the explainability of the value trap model. That is, what features of historical company data are most influential in the model’s decision to classify a company as a value trap or not a value trap? Also, in this research, we used only highly structured and quantifiable data—e.g., fundamentals, price, industry classification—to learn from. However, there has been much success in the use of deep learning on textual and other unstructured data. To capitalize on these capabilities, in future research we’d like to explore the use of unstructured data, such as earnings transcripts and SEC filings, to improve the model’s ability to classify value traps.



Appendix A: Simulation Methodology and Assumptions


In these simulations, NYSE, AMEX, and NASDAQ companies were ranked according to the stated criteria. Non-US-based companies, companies in the financial sector, and companies with a market capitalization that when adjusted by the S&P 500 Index Price to January 2010 is less than $100M, were excluded from the ranking. The simulation results reflect assets-under-management (AUM) at the start of each month that when adjusted by the S&P 500 Index Price to January 2010 is equal to $100M.


We construct portfolios by ranking all stocks according to the factor of interest depending on which we are simulating, investing equal amounts of capital into the top 50 stocks and re-balancing in this way annually. We limit the number of shares of a security bought or sold in a month to no more than 10% of the monthly volume for a security. Simulated prices for stock purchases and sales are based on the volume-weighted daily closing price of the security during the first 10 trading days of each month. If a stock paid a dividend during the period it was held, the dividend was credited to the simulated fund in proportion to the shares held. Transaction costs are factored in as $0.01 per share, plus an additional slippage factor that increases as a square of the simulation’s volume participation in a security. Specifically, if participating at the maximum 10% of monthly volume, the simulation buys at 1% more than the average market price and sells at 1% less than the average market price. This form of slippage is common in portfolio simulations as a way of modeling the fact that as an investor’s volume participation increases in a stock, it has a negative impact on the price of the stock for the investor.

Due to how a portfolio is initially constructed and the timing of cash flows, two portfolio managers can get different investment results over the same period using the same quantitative model.


To account for this variation, we run 300 portfolio simulations for each model where each portfolio is initialized from a randomly chosen starting state. The portfolio statistics presented in this article, such as compound annual return and Sharpe ratio, are the mean of statistics generated by the 300 simulations.


Appendix B: The Training and Simulation Data


In this research, Standard & Poor's COMPUSTAT database is used as a source for all information about companies. The data spans the time-period 1970 to 2019. Since reported information arrives intermittently throughout a financial period, we discretize the raw data into monthly time steps. We are interested in long term predictions, so to smooth out seasonality, we feed a time series of inputs with a one-year gap between the time steps. In an earlier post, we tried to predict company earnings one year into the future from the last time step. Here, however, we use the same five-year input time series of fundamental features, but forecast if a company is going to be a value trap or not.


We use three classes of time series data: fundamental features, momentum features, and auxiliary features. Fundamental features include the data from a company’s financial statements. Momentum features include one-, three-, six-, and nine-month relative momentum (change in price) for all stocks in the universe at the given time step. Auxiliary features include any additional information we thought was important, such as short interest, industry sector, company size classification, etc. In total, we used about 80 input features.


Our expectation in training a deep learning model was that, by providing a large number of examples of time series sequences spanning thousands of companies, the model could learn patterns that help identify value traps. However, there can be wide differences in the absolute value of these fundamental features when compared between companies and across time. For example, Walmart’s annual revenue for 2020 was $559 billion USD, while GoPro had a revenue of $892 million USD for the same period. Intuitively, these statistics are more meaningful when scaled by company size.

Hence, we scaled all fundamental features in each time series by the market capitalization in the last time step of the series. We also scaled all time steps by the same value so that the deep learning model could assess the relative change in fundamental values between time steps. While other notions of size have been used—such as EV and book value—we chose to avoid these measures because they can (although rarely) take negative values. We then further scaled the features so that each had zero mean and unit standard deviations.


Footnotes

[1] Eugene Fama’s and Kenneth French’s paper, The Cross-Section of Expected Stock Returns, was an early piece that gave wide recognition to the value phenomena. In their paper, What Has Worked in Investing, the investment firm Tweedy Brown has collected many studies that detail the historical record of value investing’s outperformance.


Many early academic papers use book value as a measure of a company’s fundamental value which has become somewhat controversial in recent years as modern companies rely more and more on intangible assets (e.g., intellectual property) as sources of value. So, more recent research has included various form of earnings and cashflow as a measure of value. A review of such research can be found here. There are also several books that explore the value phenomena, including, Joel Greenblatt’s The Little Book that Beats the Market, Wesley Grey’s and Tobias Carlisle’s Quantitative Value, and Carlisle’s The Acquirer’s Multiple.


Many studies have shown that the value phenomenon reveals itself globally. Two pieces that document this can be found here and here.


[2] High quality data of detailed fundamental data for a broad set of companies does not exist much further back than the 1950’s and 60’s. However, high level data such as a companies’ book value can be obtained for most publicly traded companies all the way back to the 1920’s. Kenneth French’s data library documents the value premium when measured on book to market value starting in 1927. The follow chart depicts the 5-year rolling outperformance of value stocks vs growth stocks between 1931 and 2020. Here value is defined as the least expensive 20% of the market when measured on book to market and growth is defined as the most expensive 20% on the same metric. Obviously, value outperformed more often than growth but also does not always outperform.



[3] EBIT is the acronym for earnings before interest and taxes. We will interchangeably use “operating income”, “earnings”, and EBIT throughout this article.


[4] Enterprise Value is the amount it would cost, based on market prices, to buy the entire business and pay off all liabilities.


[5]You can find a data library of common factors used in academic research at Kenneth French’s website here: https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

29 views0 comments