Stock Market Winners: A first look at 12 forests (011.0)

This post looks at initial (spoiler alert - promising) results from 12 forests. Each forest was created from a single month of data. The first 12 months were used mainly to see if the program is running properly. This evaluation uses 10-fold cross validation (explained below). In this analysis we have a custom function to evaluate the performance. We average the observed returns of the 50 companies predicted to have the highest return and subtract the average return of the entire universe. This is to simulate owning the 50 stocks identified as best by the model and comparing this to an equal-weighted universe. This is called “LongEx” for long excess return. In the table below we show the 12 LongEx values and the standard deviation of those values based on the 10-fold CV results. The LongEx values range from just over 2% (that’s for a single month) to over 8%. The standard deviations are generally between 1% and 2%. The mtry is a parameter explained in the notes section.

To be clear each LongeEx value is the average of 10 results from the CV analysis. The SD is the standard deviation of the 10. Also, we are averaging the actual (observed) returns, not the predicted returns.

##          mtry LongEx LongExSD
## 20030103   72   3.38     1.90
## 20030131 2583   3.83     0.83
## 20030228 2543   3.07     1.23
## 20030404 2598   6.63     1.73
## 20030502 2690   8.22     1.26
## 20030530   74   5.17     1.52
## 20030703 2918   5.93     1.24
## 20030801 2983   5.00     1.13
## 20030829   77   3.77     1.16
## 20031003 3074   6.38     1.79
## 20031031 3122   2.17     1.95
## 20031128   79   6.65     1.25

This looks promising. However, I expect these results are overly optimistic. In any month a model identifies characteristics (predictors) which worked well that month. These should work on the fold which was left out producing good results. And those results are likely to be better than the results if one were trying to predict the performance of stocks in future months. On the other hand a real-life implementation will have an advantage (my gut says not enough to offset the negative). To produce these results, we have effectively picked 500 stocks each month because we picked 50 stocks from each of the ten folds. In reality we would only be picking 50 stocks from the entire universe. Picking 50 from a universe of 2000 offers more opportunity (and more risk) than picking 500 by selecting 50 from ten groups of 200.

I’ll explore true out of sample results in a future post.

More details for each forest are shown below. In addition to LongEx, we calculate ShortEx which is the average return of the universe minus the average return of the 50 stocks with the lowest predicted returns. Hedge is the average actual returns of the top 50 predicted companies less the average of the bottom 50 to represent a long/short strategy.

## [1] "Random Forest for 20030103"
## Random Forest 
## 
## 1841 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1657, 1657, 1657, 1657, 1657, 1656, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  1.645823  1.536707  3.182530  1.105092   1.094489    1.920719
##     72  3.383879  2.634079  6.017958  1.899364   1.192036    2.639495
##   2647  3.000747  3.000757  6.001504  1.644549   1.079754    2.359390
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 72. 
## [1] ""
## [1] "Random Forest for 20030131"
## Random Forest 
## 
## 1778 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1600, 1601, 1601, 1600, 1600, 1600, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  2.251012  2.069897  4.320909  0.7438784  1.4527422   1.926697
##     71  3.278089  3.313642  6.591731  1.0148935  0.9846684   1.524026
##   2583  3.832299  3.118804  6.951102  0.8342543  1.2259727   1.968768
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2583. 
## [1] ""
## [1] "Random Forest for 20030228"
## Random Forest 
## 
## 1737 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1564, 1564, 1565, 1562, 1564, 1563, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  2.278811  2.107003  4.385813  0.8000779  1.278999    1.741812
##     71  2.762551  2.622176  5.384727  1.0287755  1.628047    2.593365
##   2543  3.066078  2.566322  5.632400  1.2270250  1.095004    2.125386
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2543. 
## [1] ""
## [1] "Random Forest for 20030404"
## Random Forest 
## 
## 1793 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1613, 1614, 1614, 1614, 1614, 1614, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge      LongEx SD  ShortEX SD  Hedge SD
##      2  4.691011  4.315543   9.006554  1.232598   0.9963404   1.804397
##     72  6.574063  5.327072  11.901135  1.308748   0.7783539   1.815156
##   2598  6.631014  5.793130  12.424144  1.727432   1.1157856   2.352689
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2598. 
## [1] ""
## [1] "Random Forest for 20030502"
## Random Forest 
## 
## 1884 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1695, 1696, 1696, 1696, 1696, 1696, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  7.547904  4.709156  12.25706  1.423476   0.6656333   1.891159
##     73  8.012248  6.315012  14.32726  1.289156   0.9187280   1.994717
##   2690  8.216312  6.251946  14.46826  1.263155   0.7277428   1.897688
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2690. 
## [1] ""
## [1] "Random Forest for 20030530"
## Random Forest 
## 
## 1994 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1795, 1795, 1794, 1794, 1796, 1794, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  4.233718  2.271090  6.504807  0.9472787  0.9201049   1.576348
##     74  5.168139  3.385412  8.553551  1.5206274  1.0122496   2.093077
##   2800  4.891951  3.635727  8.527677  1.1751994  0.6730027   1.736448
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 74. 
## [1] ""
## [1] "Random Forest for 20030703"
## Random Forest 
## 
## 2112 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1900, 1901, 1901, 1900, 1901, 1901, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge      LongEx SD  ShortEX SD  Hedge SD
##      2  5.029364  3.578111   8.607475  1.342638   0.6247807   1.554427
##     76  5.545285  5.744639  11.289924  1.476264   1.1153937   2.399354
##   2918  5.931024  6.174434  12.105458  1.237745   1.1080703   1.625167
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2918. 
## [1] ""
## [1] "Random Forest for 20030801"
## Random Forest 
## 
## 2178 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1959, 1962, 1959, 1961, 1961, 1959, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  3.679001  3.617475  7.296476  2.217545   1.0722525   2.831624
##     77  4.755090  4.025980  8.781070  1.261864   0.7648514   1.604275
##   2983  4.999232  4.470548  9.469780  1.130875   0.9915790   1.806839
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2983. 
## [1] ""
## [1] "Random Forest for 20030829"
## Random Forest 
## 
## 2224 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2000, 2000, 2003, 2003, 2001, 2002, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  2.827245  3.218264  6.045510  1.284799   1.280548    2.410151
##     77  3.765488  3.690990  7.456478  1.163804   1.242862    2.094116
##   3030  3.362627  3.700911  7.063538  1.587360   1.233176    2.442162
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 77. 
## [1] ""
## [1] "Random Forest for 20031003"
## Random Forest 
## 
## 2268 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2042, 2042, 2041, 2041, 2041, 2041, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge      LongEx SD  ShortEX SD  Hedge SD
##      2  3.823929  3.591133   7.415062  1.595134   1.403108    2.706097
##     78  6.012630  5.384848  11.397478  1.780591   1.521213    2.774167
##   3074  6.376218  5.523964  11.900182  1.788174   1.958884    3.210630
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3074. 
## [1] ""
## [1] "Random Forest for 20031031"
## Random Forest 
## 
## 2317 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2085, 2085, 2085, 2085, 2086, 2086, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx     ShortEX    Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  0.7249286  0.9280415  1.652970  0.8189695  1.432565    1.826758
##     79  2.1073070  2.2033438  4.310651  1.6530306  1.587118    3.005683
##   3122  2.1704221  2.2015307  4.371953  1.9472564  1.269238    2.975489
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3122. 
## [1] ""
## [1] "Random Forest for 20031128"
## Random Forest 
## 
## 2377 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2140, 2139, 2138, 2139, 2138, 2139, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  5.218160  5.167228  10.38539  1.051254   0.8127089   1.163762
##     79  6.651044  6.186669  12.83771  1.253466   0.9298146   1.606550
##   3183  6.398450  6.323075  12.72152  1.114751   1.3537946   1.879923
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 79. 
## [1] ""

Notes on the data: The response (Y) is the next one month return adjusted to have an average value of zero. At this point, the companies have been filtered to those with prices of at least $5 and average daily trading volume (dollars, not shares) of $2 million.

Description of 10-fold cross-validation (CV) As the output shows, in the first month there were 1841 samples (companies) and 692 predictors. The data are split into 10 sets of about 1657 (90% of 1841) observations. Nine sets are used to build a model which is tested on the tenth set (the remaining 10% not used to build the model). This is repeated leaving out each of the 10 sets one time. CV provides two purposes. It gives an indication of how well our model might do out of sample because it is holding out some data. It also allows us to tune parameters for a model. In the case of a random forest, we tune the number of predictors selected for each tree (mtry). We run several different values for this parameter and use CV to pick the best.

In these results the CV tries three values of mtry. The low value is consistently inferior. The other two are close in that the averages are within a standard deviation of the other. Given this consistency, we can probably eliminate the tuning in future models speeding the process. Also, it may seem odd that the mtry can exceed the number of predictors. This is because some of the predictors are factors. For example, the sector can take on 10 values. The sector variable is converted into 9 dummy variables. So the model actually has more than 692 predictors.

Stock Market Winners

Wednesday, December 30, 2015

A first look at 12 forests (011.0)

No comments:

Post a Comment