Thursday, December 31, 2015

Random Forest Results (012.0)

In this post, we look at out-of-sample performance for some random forests. The purpose here is not an exhaustive analysis. We are only hoping to get an indication that our efforts might bear fruit.
Each random forest model we build uses predictor (x) data from one month (e.g. Dec 2002) and response (y) data from the subsequent month (e.g. Jan 2003). We refer to a model here by xmonth/ymonth (e.g, Dec02/Jan03) to indicate the months of the x and y data. The y variable is a company’s excess return defined as the return of company for the month less the average return of all companies for that month.
The first forest is Dec02/Jan03. The first set of predictions was for Jan 2004 created by averaging the 12 predicted returns for each company by feeding in the Dec03 x data into the Dec02/Jan03 through the Nov03/Dec03 models. The last prediction is for Oct 2015. We have 142 months of predicted return, a bit shy of 12 years.
For each month with a predicted return, we calculate three values: LongEx, ShortEx and Hedge. Respectively, these roughly represent buying the 50 stocks with the top predicted returns, shorting the bottom 50, and doing both which can be thought of as long, short and hedged portfolios. However, since our y variable is an excess return over the average stock, the values for LongEx is the return over the average. The values for ShortEx are the returns under the index. Hedge represents the long minus the short. In all cases, positive values are desirable.
Below is a summary of the 142 observations for each.
##      LongEx            ShortEX            Hedge        
##  Min.   :-22.5162   Min.   :-24.762   Min.   :-40.723  
##  1st Qu.: -2.2450   1st Qu.: -1.205   1st Qu.: -2.135  
##  Median :  0.6743   Median :  1.375   Median :  2.676  
##  Mean   :  0.5615   Mean   :  1.750   Mean   :  2.311  
##  3rd Qu.:  3.7627   3rd Qu.:  5.406   3rd Qu.:  7.610  
##  Max.   : 20.7098   Max.   : 26.804   Max.   : 43.397
The means and medians are all positive, but there is substantial variation. Below we look to see if the means are significantly different from zero. The standard error is the standard deviation divided by the square root of 142. That is that standard deviation of the sample mean used to determine the signicance of a sample mean. The z score is the mean divided by the standard error.
##           LongEx  ShortEX      Hedge
## Mean   0.5615240 1.749783  2.3113067
## SD     5.8138698 7.068079 10.9250335
## StdErr 0.4878891 0.593140  0.9168084
## z      1.1509254 2.950033  2.5210355
It appears that the model is better at finding short candidates than long. For fun, the average annual return for the hedged portfolio (without transaction costs or taxes) would have been 22.42% which sounds nice but it had an annualized standard deviation of 37.85 which is much higher than a stock index.
The data in graphical and tabular forms follows.
##                LongEx      ShortEX       Hedge           R2      RMSE
## 20040102   5.40387192   1.49061552   6.8944874 1.834194e-02 11.543642
## 20040130  -2.90584266  -1.20667423  -4.1125169 1.569290e-02  9.995714
## 20040227   0.13278078   0.98072400   1.1135048 7.969597e-05 10.180885
## 20040402  -9.13640289  -5.63940335 -14.7758062 6.364674e-02 12.002930
## 20040430  -2.62634212  -4.01278530  -6.6391274 9.988053e-03 11.767189
## 20040604   5.90600405   0.43858870   6.3445928 1.747279e-02  9.522831
## 20040702   6.20691840   6.41741864  12.6243370 5.745495e-03 10.558267
## 20040730  -2.46220377   5.16432893   2.7021252 9.654185e-04 10.214583
## 20040903   8.95935876  -3.82297670   5.1363821 1.086204e-05  9.665120
## 20040930  -2.25621028  -4.98650710  -7.2427174 1.922212e-02 10.730126
## 20041029  -4.71568260  -4.40054214  -9.1162247 1.233034e-01  4.586698
## 20041203  -5.28801760   0.52139770  -4.7666199 4.370066e-04 10.039059
## 20041231   2.99957129   8.19629732  11.1958686 5.723372e-02  9.451499
## 20050131   5.52407287   5.46746389  10.9915368 3.020293e-02  9.907540
## 20050228  -0.30067131   3.79045522   3.4897839 1.649513e-03  9.499445
## 20050331  -1.67862247   2.83974979   1.1611273 1.628908e-02  9.529142
## 20050429   0.99275765  -6.97082462  -5.9780670 2.219801e-02 11.282139
## 20050531   2.72670839  -2.16575073   0.5609577 1.069156e-03  9.180009
## 20050630   1.84195759  -0.69437808   1.1475795 3.426903e-03  9.782919
## 20050729   1.79737890  -2.47887147  -0.6814926 6.166191e-03  9.857890
## 20050831   6.17286384   2.24980071   8.4226646 2.554721e-02 10.014044
## 20050930  -5.98986592   4.41408379  -1.5757821 1.051827e-02  9.738044
## 20051031   2.85395424  -0.08926807   2.7646862 3.864877e-04  9.538442
## 20051130   2.02526217   3.97892719   6.0041894 6.130005e-03  8.198109
## 20051230  10.57817177  -1.16175125   9.4164205 3.635384e-02 11.298988
## 20060131 -12.31409780   1.11100556 -11.2030922 3.765661e-02  9.937796
## 20060228  -1.23882121   0.02364098  -1.2151802 5.932061e-03  9.790312
## 20060331   2.33144960   3.83521890   6.1666685 1.126032e-02  9.610501
## 20060428  -2.16512505  -0.25432596  -2.4194510 2.215926e-02  9.868287
## 20060531   0.57065155   3.44502314   4.0156747 5.118772e-03  8.882741
## 20060630  -0.73614571   6.70541092   5.9692652 1.451162e-03  9.924111
## 20060731  -4.82746900   1.06078641  -3.7666826 1.820947e-02 10.411808
## 20060831  -1.53685366   2.36372882   0.8268752 4.551253e-08  8.550465
## 20060929   1.34401623  -5.22733741  -3.8833212 4.963447e-03 10.091191
## 20061031   0.33179334  -1.79494054  -1.4631472 2.071015e-03  9.493776
## 20061130   3.23719076   7.11509833  10.3522891 1.899938e-02  8.077665
## 20061229   0.62651354   5.53317368   6.1596872 1.917711e-02  8.510653
## 20070131   1.79883810   1.91282218   3.7116603 8.551330e-05  8.662607
## 20070228   2.36532269   8.12193065  10.4872533 1.870687e-02  8.724430
## 20070330   0.66013174  -1.44985453  -0.7897228 4.948791e-05  8.340518
## 20070430   4.20283840   5.72909214   9.9319305 5.325003e-03  9.935763
## 20070531   2.55541558   4.78139656   7.3368121 9.077696e-03  8.271241
## 20070629   4.45754882   8.88819376  13.3457426 4.694815e-02 10.177925
## 20070731   0.35362342   4.90751993   5.2611434 3.905772e-04 10.864827
## 20070831   9.35609986   5.22171948  14.5778193 4.011274e-02  9.938294
## 20070928  13.02347206   9.71712419  22.7405962 7.581077e-02 12.143863
## 20071031  -3.29591884   6.85961644   3.5636976 1.028862e-02 11.033317
## 20071130   4.27899793   0.50451265   4.7835106 1.897805e-02 11.390148
## 20071231  -6.81048107 -16.58904093 -23.3995220 1.019013e-01 13.510965
## 20080131   9.62473902   6.67259437  16.2973334 6.698388e-02 12.362566
## 20080229  -1.30283115  10.77690007   9.4740689 1.629201e-03 11.750796
## 20080331   5.23295558   5.13420383  10.3671594 6.292338e-03 12.836256
## 20080430   9.51447558  15.06175678  24.5762324 4.709161e-02 11.662685
## 20080530  20.70976327  22.68674048  43.3965037 2.454185e-01 12.378729
## 20080630 -22.51623306  -9.16655632 -31.6827894 9.108577e-02 15.812902
## 20080731  -6.71746970  -6.45012430 -13.1675940 4.224404e-02 12.228179
## 20080829 -12.11853852  -6.15555256 -18.2740911 7.854400e-03 15.045762
## 20080930   7.70343500   2.98878497  10.6922200 6.351603e-02 16.550795
## 20081031   7.77065580  16.33398338  24.1046392 6.723788e-02 16.789974
## 20081128  -4.74044557  -8.85837340 -13.5988190 2.856071e-02 16.384498
## 20081231  -6.38976733   0.76137445  -5.6283929 1.109729e-03 15.911839
## 20090130   2.93786172  14.11290523  17.0507670 6.924414e-02 13.395856
## 20090227   0.28782440   1.44022007   1.7280445 8.413000e-03 13.855778
## 20090331 -19.96418207 -20.75861716 -40.7227992 1.752411e-01 21.165795
## 20090430  -1.22754181 -12.17448192 -13.4020237 3.708760e-02 15.749353
## 20090529  -5.97528124   4.49151171  -1.4837695 1.406668e-02 11.896696
## 20090630   0.99788090  -0.57775222   0.4201287 3.801494e-06 12.187126
## 20090731   7.49204705  -4.62934670   2.8627004 1.542391e-02 12.130585
## 20090831   6.02854870  -0.06153462   5.9670141 1.724183e-02 10.617830
## 20090930  -5.13539674   1.13381662  -4.0015801 1.873951e-02 10.037458
## 20091030   0.57892472  -0.02746684   0.5514579 6.325462e-05 10.795407
## 20091130   5.78601764   1.30962862   7.0956463 1.452356e-02  9.556327
## 20091231  -4.89740870  -5.58256908 -10.4799778 6.613241e-03  9.547761
## 20100129   0.68854332   4.15288100   4.8414243 5.792483e-03  9.218921
## 20100226   3.56453273   3.48810664   7.0526394 2.171415e-02 10.425906
## 20100331   3.76850032  -1.11862379   2.6498765 1.043408e-02  9.955274
## 20100430  -3.57297455   1.03427232  -2.5387022 1.200480e-02  8.437923
## 20100528  -9.19090511   1.02365515  -8.1672500 6.393829e-02  9.551787
## 20100630   3.30652374  -3.85553371  -0.5490100 5.207713e-04  9.849794
## 20100730  -2.30857883   0.16902594  -2.1395529 7.085925e-04 10.558207
## 20100831   0.22770813  -3.21858771  -2.9908796 6.146996e-03  9.606660
## 20100930   1.56357640   0.00938446   1.5729609 1.164398e-03  8.942725
## 20101029   4.96682807   2.13126222   7.0980903 1.386240e-02  9.579759
## 20101130   0.69697360  -1.92746175  -1.2304882 1.213165e-05  9.631950
## 20101231   0.41944075   1.76707161   2.1865124 1.905663e-04  9.111841
## 20110131   4.07249029   3.43658299   7.5090733 2.288144e-02  9.329095
## 20110228   1.57398135  10.42736372  12.0013451 1.629467e-02  9.152722
## 20110331  -3.90680955   2.42981982  -1.4769897 2.815608e-04  9.219599
## 20110429  -2.21139055   2.09828754  -0.1131030 8.730911e-04  8.526235
## 20110531   1.78352772  -0.71897356   1.0645542 2.200055e-03  7.684808
## 20110630   2.37557885  -1.57026100   0.8053178 8.728497e-07  9.042902
## 20110729  -1.78568333   9.32320511   7.5375218 1.033621e-05 10.826720
## 20110831   0.57745107  13.62326644  14.2007175 1.329892e-02 10.497113
## 20110930  -8.08491174 -12.58544459 -20.6703563 1.171755e-01 12.754920
## 20111031   0.73842683  11.92893829  12.6673651 4.522994e-02 10.313859
## 20111130   0.62039481   6.05542499   6.6758198 9.910000e-03  8.385709
## 20111230  -3.12637477 -10.77157383 -13.8979486 1.193010e-01 11.509917
## 20120131   3.83209651  -0.72662539   3.1054711 8.697444e-04  9.006328
## 20120229   2.78250343   4.20896233   6.9914658 1.259814e-02  8.232192
## 20120330   1.29603066   1.93316943   3.2292001 1.430192e-02  8.517169
## 20120430   0.70772341   7.06908371   7.7768071 6.766543e-02  9.671125
## 20120531   0.02516932   0.77521587   0.8003852 1.987880e-03  8.505345
## 20120629  -1.60434325   4.40654669   2.8022034 5.495914e-03 10.136726
## 20120731  -2.11993454  -0.92663903  -3.0465736 1.074480e-02  9.898911
## 20120831   0.52310340  -5.08074991  -4.5576465 4.650886e-03  7.295555
## 20120928   0.05659778   0.51117595   0.5677737 4.214609e-04  9.087631
## 20121031  -0.23779127   0.37185423   0.1340630 2.713459e-03  8.377153
## 20121130   2.00536228  -1.57489129   0.4304710 4.888700e-04  6.963249
## 20121231  -1.95429132   2.19968932   0.2453980 3.341025e-03  8.083581
## 20130131  -0.29090963   9.14237681   8.8514672 2.022032e-02  8.141436
## 20130228   3.74537978   2.28872860   6.0341084 2.562331e-02  7.258036
## 20130329   1.92061553   4.31622002   6.2368355 7.148353e-03  8.545509
## 20130430   4.83940591  -1.19840637   3.6409995 8.253624e-05 10.545379
## 20130531  -1.57245576   7.97683901   6.4043832 9.882217e-03  8.465437
## 20130628   6.69750147  -2.90917080   3.7883307 1.968991e-03  9.674799
## 20130731  -0.11998782  -1.07104708  -1.1910349 9.797558e-05  8.772422
## 20130830   1.34077359   9.87206502  11.2128386 3.540380e-02  9.101706
## 20130930  -3.80803191   6.17023329   2.3622014 2.317402e-05 10.142080
## 20131031   3.47095169   6.80909668  10.2800484 2.659929e-02  9.558073
## 20131129   4.82172181   4.48176530   9.3034871 9.850874e-03  9.418059
## 20131231   7.36582535  -9.48903691  -2.1232116 1.038420e-04 12.306525
## 20140131   8.71588554  -0.72788277   7.9880028 1.768703e-02  9.722693
## 20140228  -6.43266709   1.88881620  -4.5438509 4.598074e-02  9.251081
## 20140331  -9.38698918  -2.28673159 -11.6737208 6.641972e-02  9.699626
## 20140430  -4.21838698   0.75619705  -3.4621899 3.545542e-03  8.668348
## 20140530   5.98800987  -7.42259449  -1.4345846 1.341742e-03  9.899448
## 20140630  -0.98814354  -0.77379340  -1.7619369 2.488615e-03  9.389641
## 20140731   8.13736699  -4.25483957   3.8825274 4.045783e-04  8.749128
## 20140829  -4.64167473   4.30032614  -0.3413486 1.659678e-04  9.108906
## 20140930  -0.43646130   2.82873412   2.3922728 6.177938e-05 11.646524
## 20141031   3.24017972   5.88838519   9.1285649 4.314582e-02  9.929667
## 20141128   0.69624041   9.44601147  10.1422519 2.322371e-02 10.184646
## 20141231   3.32910254   8.04782296  11.3769255 2.426322e-02 10.459319
## 20150130   0.62160724  -6.84705564  -6.2254484 2.445408e-02 10.292621
## 20150227   6.53451777  12.04962323  18.5841410 3.950588e-02  9.150841
## 20150331  -3.83022242 -24.76201620 -28.5922386 2.128941e-01 11.148031
## 20150430  14.28192548  15.05480750  29.3367330 1.189118e-01  9.917572
## 20150529   6.50182847   6.96746221  13.4692907 5.988414e-02  8.753609
## 20150630   7.37529168  26.80369568  34.1789874 1.775662e-01 10.530589
## 20150731  -9.78038615  -1.04585056 -10.8262367 1.059950e-02 10.215449
## 20150831  -9.09721311  16.73157848   7.6343654 6.713816e-02 10.577300
## 20150930  -6.36406713  -0.36577902  -6.7298462 1.281589e-02 12.954967

Machine Learning Resolution for the New Year

First, happy new year.  And what's my resolution?  I resolve not to peek at my test (out-of-sample) data.  Will I keep it?  Not likely.

I get the point of out-of-sample, test sets.   But I cheat a little.  The question is whether my cheating is too much.

Topping the ML commandments is thou shall not use test data more than once.   It's kind of a kosher law not eat your test set or a do not eat the apple of the test set tree.  Perhaps, we can thank of is as: Thou shall not adulterate one's test set out of marriage.  The test set must remain pure until you've made the sacred commitment and betrothed a final model.  

Actually, I believe there is a new testament ML religion, growing in popularity, that says you don't need to keep kosher; trust in cross-validation (later I explain why I can't trust in CV - but I can still respect you if you do).

Why am I tempted to have pre-marital sex with my test data?  It's simple.  I don't want to end up with a model I can't love.  I spend a lot of time on the ritual of collecting and cleansing data.  Then I build a model.  The ML high priests  preach that I should build lots of models and pick the one that works best.  Then I am to use the test set to estimate how well the model will work.  I hear it.  I read what I just wrote and it sounds pure.  My problem is that when I build my first model (so far I start with random forests), I want to see if it works on the test set just to see if it works at all.  If it works, it motivates me to try other models and to keep going.  Without peeking I am left with the worry and all my effort will be in vain.  So I peek. Is this a unforgivable sin (your thoughts appreciated)?

It seems to me that my sin does not condemn me to a bad model (a model with no predictive ability), but that I must reduce my confidence in the out-of-sample error rate.

-- A note on my problem with cross-validation.
I am working with time series; specifically predicting the returns of stocks.  As an example,  I've thought about creating 12 folds with each fold representing a month of data (think Jan-Dec predictors with the subsequent month's return being the response).  It's ok to use Jan-Nov to predict Dec.  But I can't use Feb-Dec to predict Jan.  One of my predictors is past return.  Thus in Mar, the return for Feb is a predictor.  To use that to build a model in which the return for Feb is the response commits the time-series sin of looking-ahead (a mortal sin leading to hellish models that have outstanding in-sample performance and no predictive ability).  Also, within a month, an industry or sector might do well.  A model for a month will typically id an industry or sector predictor.  That will work great in a (within-month) cross-validation, but not nearly as well for a prediction for the future.  Thus cross-validation does not replace an out-of-sample test set in my ML bible.




A Second First Look at 12 forests (011.1)

This is a post-script to the previous entry in which I looked at the performance of 12 forests.  While looking more deeply into the performance I came across a mistake.   I included "COMPANY_ID" which is a unique identifier for each company as a variable.  Including this would not have helped the model performance.  However because this was treated as a factor, it created over a thousand new variables (explaining the high variable count).  I suspect this had little influence on performance.  It probably slowed down the computer as it considered more variables.  It may have hurt performance as better variable were crowded out of some trees.  I am re-running and we will see.

There's another lesson I need to pay closer attention to.  When something seems off, it is off.  I noted the oddly high variable count in the previous post. I was too quick to use sector, industry, sector and other factor variables to explain the increase.  But the increase was too large for these.


Wednesday, December 30, 2015

A first look at 12 forests (011.0)

This post looks at initial (spoiler alert - promising) results from 12 forests. Each forest was created from a single month of data. The first 12 months were used mainly to see if the program is running properly. This evaluation uses 10-fold cross validation (explained below). In this analysis we have a custom function to evaluate the performance. We average the observed returns of the 50 companies predicted to have the highest return and subtract the average return of the entire universe. This is to simulate owning the 50 stocks identified as best by the model and comparing this to an equal-weighted universe. This is called “LongEx” for long excess return. In the table below we show the 12 LongEx values and the standard deviation of those values based on the 10-fold CV results. The LongEx values range from just over 2% (that’s for a single month) to over 8%. The standard deviations are generally between 1% and 2%. The mtry is a parameter explained in the notes section.
To be clear each LongeEx value is the average of 10 results from the CV analysis. The SD is the standard deviation of the 10. Also, we are averaging the actual (observed) returns, not the predicted returns.
##          mtry LongEx LongExSD
## 20030103   72   3.38     1.90
## 20030131 2583   3.83     0.83
## 20030228 2543   3.07     1.23
## 20030404 2598   6.63     1.73
## 20030502 2690   8.22     1.26
## 20030530   74   5.17     1.52
## 20030703 2918   5.93     1.24
## 20030801 2983   5.00     1.13
## 20030829   77   3.77     1.16
## 20031003 3074   6.38     1.79
## 20031031 3122   2.17     1.95
## 20031128   79   6.65     1.25
This looks promising. However, I expect these results are overly optimistic. In any month a model identifies characteristics (predictors) which worked well that month. These should work on the fold which was left out producing good results. And those results are likely to be better than the results if one were trying to predict the performance of stocks in future months. On the other hand a real-life implementation will have an advantage (my gut says not enough to offset the negative). To produce these results, we have effectively picked 500 stocks each month because we picked 50 stocks from each of the ten folds. In reality we would only be picking 50 stocks from the entire universe. Picking 50 from a universe of 2000 offers more opportunity (and more risk) than picking 500 by selecting 50 from ten groups of 200.
I’ll explore true out of sample results in a future post.
More details for each forest are shown below. In addition to LongEx, we calculate ShortEx which is the average return of the universe minus the average return of the 50 stocks with the lowest predicted returns. Hedge is the average actual returns of the top 50 predicted companies less the average of the bottom 50 to represent a long/short strategy.
## [1] "Random Forest for 20030103"
## Random Forest 
## 
## 1841 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1657, 1657, 1657, 1657, 1657, 1656, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  1.645823  1.536707  3.182530  1.105092   1.094489    1.920719
##     72  3.383879  2.634079  6.017958  1.899364   1.192036    2.639495
##   2647  3.000747  3.000757  6.001504  1.644549   1.079754    2.359390
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 72. 
## [1] ""
## [1] "Random Forest for 20030131"
## Random Forest 
## 
## 1778 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1600, 1601, 1601, 1600, 1600, 1600, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  2.251012  2.069897  4.320909  0.7438784  1.4527422   1.926697
##     71  3.278089  3.313642  6.591731  1.0148935  0.9846684   1.524026
##   2583  3.832299  3.118804  6.951102  0.8342543  1.2259727   1.968768
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2583. 
## [1] ""
## [1] "Random Forest for 20030228"
## Random Forest 
## 
## 1737 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1564, 1564, 1565, 1562, 1564, 1563, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  2.278811  2.107003  4.385813  0.8000779  1.278999    1.741812
##     71  2.762551  2.622176  5.384727  1.0287755  1.628047    2.593365
##   2543  3.066078  2.566322  5.632400  1.2270250  1.095004    2.125386
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2543. 
## [1] ""
## [1] "Random Forest for 20030404"
## Random Forest 
## 
## 1793 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1613, 1614, 1614, 1614, 1614, 1614, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge      LongEx SD  ShortEX SD  Hedge SD
##      2  4.691011  4.315543   9.006554  1.232598   0.9963404   1.804397
##     72  6.574063  5.327072  11.901135  1.308748   0.7783539   1.815156
##   2598  6.631014  5.793130  12.424144  1.727432   1.1157856   2.352689
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2598. 
## [1] ""
## [1] "Random Forest for 20030502"
## Random Forest 
## 
## 1884 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1695, 1696, 1696, 1696, 1696, 1696, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  7.547904  4.709156  12.25706  1.423476   0.6656333   1.891159
##     73  8.012248  6.315012  14.32726  1.289156   0.9187280   1.994717
##   2690  8.216312  6.251946  14.46826  1.263155   0.7277428   1.897688
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2690. 
## [1] ""
## [1] "Random Forest for 20030530"
## Random Forest 
## 
## 1994 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1795, 1795, 1794, 1794, 1796, 1794, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  4.233718  2.271090  6.504807  0.9472787  0.9201049   1.576348
##     74  5.168139  3.385412  8.553551  1.5206274  1.0122496   2.093077
##   2800  4.891951  3.635727  8.527677  1.1751994  0.6730027   1.736448
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 74. 
## [1] ""
## [1] "Random Forest for 20030703"
## Random Forest 
## 
## 2112 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1900, 1901, 1901, 1900, 1901, 1901, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge      LongEx SD  ShortEX SD  Hedge SD
##      2  5.029364  3.578111   8.607475  1.342638   0.6247807   1.554427
##     76  5.545285  5.744639  11.289924  1.476264   1.1153937   2.399354
##   2918  5.931024  6.174434  12.105458  1.237745   1.1080703   1.625167
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2918. 
## [1] ""
## [1] "Random Forest for 20030801"
## Random Forest 
## 
## 2178 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1959, 1962, 1959, 1961, 1961, 1959, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  3.679001  3.617475  7.296476  2.217545   1.0722525   2.831624
##     77  4.755090  4.025980  8.781070  1.261864   0.7648514   1.604275
##   2983  4.999232  4.470548  9.469780  1.130875   0.9915790   1.806839
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2983. 
## [1] ""
## [1] "Random Forest for 20030829"
## Random Forest 
## 
## 2224 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2000, 2000, 2003, 2003, 2001, 2002, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  2.827245  3.218264  6.045510  1.284799   1.280548    2.410151
##     77  3.765488  3.690990  7.456478  1.163804   1.242862    2.094116
##   3030  3.362627  3.700911  7.063538  1.587360   1.233176    2.442162
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 77. 
## [1] ""
## [1] "Random Forest for 20031003"
## Random Forest 
## 
## 2268 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2042, 2042, 2041, 2041, 2041, 2041, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge      LongEx SD  ShortEX SD  Hedge SD
##      2  3.823929  3.591133   7.415062  1.595134   1.403108    2.706097
##     78  6.012630  5.384848  11.397478  1.780591   1.521213    2.774167
##   3074  6.376218  5.523964  11.900182  1.788174   1.958884    3.210630
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3074. 
## [1] ""
## [1] "Random Forest for 20031031"
## Random Forest 
## 
## 2317 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2085, 2085, 2085, 2085, 2086, 2086, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx     ShortEX    Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  0.7249286  0.9280415  1.652970  0.8189695  1.432565    1.826758
##     79  2.1073070  2.2033438  4.310651  1.6530306  1.587118    3.005683
##   3122  2.1704221  2.2015307  4.371953  1.9472564  1.269238    2.975489
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3122. 
## [1] ""
## [1] "Random Forest for 20031128"
## Random Forest 
## 
## 2377 samples
##  692 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2140, 2139, 2138, 2139, 2138, 2139, ... 
## Resampling results across tuning parameters:
## 
##   mtry  LongEx    ShortEX   Hedge     LongEx SD  ShortEX SD  Hedge SD
##      2  5.218160  5.167228  10.38539  1.051254   0.8127089   1.163762
##     79  6.651044  6.186669  12.83771  1.253466   0.9298146   1.606550
##   3183  6.398450  6.323075  12.72152  1.114751   1.3537946   1.879923
## 
## LongEx was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 79. 
## [1] ""
Notes on the data: The response (Y) is the next one month return adjusted to have an average value of zero. At this point, the companies have been filtered to those with prices of at least $5 and average daily trading volume (dollars, not shares) of $2 million.
Description of 10-fold cross-validation (CV) As the output shows, in the first month there were 1841 samples (companies) and 692 predictors. The data are split into 10 sets of about 1657 (90% of 1841) observations. Nine sets are used to build a model which is tested on the tenth set (the remaining 10% not used to build the model). This is repeated leaving out each of the 10 sets one time. CV provides two purposes. It gives an indication of how well our model might do out of sample because it is holding out some data. It also allows us to tune parameters for a model. In the case of a random forest, we tune the number of predictors selected for each tree (mtry). We run several different values for this parameter and use CV to pick the best.
In these results the CV tries three values of mtry. The low value is consistently inferior. The other two are close in that the averages are within a standard deviation of the other. Given this consistency, we can probably eliminate the tuning in future models speeding the process. Also, it may seem odd that the mtry can exceed the number of predictors. This is because some of the predictors are factors. For example, the sector can take on 10 values. The sector variable is converted into 9 dummy variables. So the model actually has more than 692 predictors.

Sunday, December 27, 2015

Y? More information on the Response variable (010.5)

This is a follow-up / continuation of the last post. In that one we look at the monthly return for 1 month. In this one we look at all of them. For each month we look at the distribution of the 1 month return, calculating the mean (mu), median, standard deviation (stddev), skew, kurtosis (kurt), and maximum value. Below is a summary of each of these variables for the 154 months.
summary(ystats)
##        mu               median            stddev            skew        
##  Min.   :-22.1700   Min.   :-21.790   Min.   : 4.010   Min.   :-0.9000  
##  1st Qu.: -2.1950   1st Qu.: -1.465   1st Qu.: 9.180   1st Qu.: 0.0625  
##  Median :  1.1200   Median :  0.825   Median : 9.655   Median : 0.5400  
##  Mean   :  0.7871   Mean   :  0.498   Mean   :10.269   Mean   : 0.9528  
##  3rd Qu.:  3.8375   3rd Qu.:  3.265   3rd Qu.:10.758   3rd Qu.: 0.9950  
##  Max.   : 15.7700   Max.   : 12.490   Max.   :20.900   Max.   :30.0600  
##       kurt              max        
##  Min.   :   0.87   Min.   : 35.29  
##  1st Qu.:   5.27   1st Qu.: 71.39  
##  Median :   7.79   Median : 86.84  
##  Mean   :  23.71   Mean   :106.64  
##  3rd Qu.:  11.97   3rd Qu.:117.30  
##  Max.   :1313.57   Max.   :933.33
We note that the mean and median are not stationary. That is the value moves around a lot. No surprise. We see more evidence of the non-normality of the cross-section of returns. In over 75% of the months, the skewness is above 0 (a normal distribution has a value of 0). A normal distribution would have a kurtosis of 3.0 and these values are generally higher. The standard deviation is close to 10 for about one-half of the months. Finally, we note an outlier maximum return near 1000%. We examined this value which was for the now delisted company Biocoral Inc (BCRA). While the return is computationally correct, it turns out this stock did not trade on many days. It is so illiquid that I plan to add a new filter for the data to screen out stocks will low trading volumes. What follows are time series plots of each variable.
##                mu median stddev  skew    kurt    max
## 2003-01-03  -2.27  -2.35  10.60  0.73   12.66 131.00
## 2003-01-31  -2.75  -2.00   9.67 -0.87    6.69  44.64
## 2003-02-28   0.47   0.45   9.54 -0.09    5.52  57.37
## 2003-04-04   9.22   8.07  11.49  1.00    5.97 109.64
## 2003-05-02   9.79   7.39  20.90 30.06 1313.57 933.33
## 2003-05-30   2.50   1.51   9.78  1.29    7.08  88.78
## 2003-07-03   5.10   3.89  12.49  1.44    8.42 135.14
## 2003-08-01   4.38   3.03   9.47  0.56    6.58  73.51
## 2003-08-29  -0.73  -0.92   9.37  0.76    5.40  71.55
## 2003-10-03   8.09   6.68  12.16  1.36    8.98 131.71
## 2003-10-31   3.14   2.56   9.29  0.43    7.79  69.62
## 2003-11-28   3.46   2.76   9.90  0.57    6.17  76.44
## 2004-01-02   3.69   2.59  11.14  1.21    6.27 114.89
## 2004-01-30   1.66   1.37   9.35  0.52    4.65  67.67
## 2004-02-27   0.08  -0.26   9.40  1.12   11.68 113.36
## 2004-04-02  -4.48  -3.97  10.78  0.07    5.08  92.19
## 2004-04-30   1.20   0.68  10.44  4.10   75.04 211.26
## 2004-06-04   3.46   3.33   9.38  0.34    4.99  69.00
## 2004-07-02  -5.53  -4.21  10.25 -0.76    2.28  36.29
## 2004-07-30  -0.37   0.13   9.68  0.01   16.62 129.68
## 2004-09-03   4.62   3.55   9.32  1.22    7.92  96.25
## 2004-09-30   2.26   1.91  10.32  0.31   12.42 124.49
## 2004-10-29   1.22   1.11   4.01  0.19   12.01  35.29
## 2004-12-03   3.54   2.83   9.60  4.51  116.11 246.37
## 2004-12-31  -3.36  -3.14   9.41  0.66   14.58 121.88
## 2005-01-31   2.23   1.41   9.67  0.46    6.27  98.78
## 2005-02-28  -2.85  -2.65   8.65 -0.29    5.38  56.48
## 2005-03-31  -4.69  -4.26   9.21 -0.36    2.46  40.09
## 2005-04-29   4.92   4.00   9.85  0.81   11.60 118.68
## 2005-05-31   3.42   2.96   8.89  0.41    7.21  73.27
## 2005-06-30   5.75   4.95   9.31  0.67    4.14  76.58
## 2005-07-29  -1.17  -1.48   9.69  0.90    9.17  89.60
## 2005-08-31   0.93   0.26   9.59 -0.25    7.95  54.43
## 2005-09-30  -3.25  -3.06   9.25 -0.16    4.79  51.12
## 2005-10-31   4.35   3.54   9.64  0.11    8.03  73.34
## 2005-11-30   0.62   0.11   8.45  0.54    8.57  57.27
## 2005-12-30   6.97   5.33  10.98  1.11    4.50  97.66
## 2006-01-31  -0.16   0.00   9.34  0.15    5.98  79.35
## 2006-02-28   3.36   2.79   9.55  0.43    4.78  64.51
## 2006-03-31   0.87   0.40   9.49  0.43    8.55  96.73
## 2006-04-28  -5.02  -4.48   9.48 -0.17    6.47  82.36
## 2006-05-31  -0.57  -0.48   8.91 -0.32    9.00  80.31
## 2006-06-30  -2.32  -1.42   9.64 -0.56    3.13  47.14
## 2006-07-31   2.77   2.44   9.64  0.12    6.25  71.06
## 2006-08-31   1.01   0.76   8.53  0.06    7.45  82.08
## 2006-09-29   4.94   4.43   9.57  0.83   13.67 126.75
## 2006-10-31   3.16   2.39   9.42  1.00    5.80  78.00
## 2006-11-30   1.16   0.88   8.28  0.20   15.20  71.41
## 2006-12-29   1.81   1.38   8.53  0.41    4.46  73.68
## 2007-01-31  -0.46  -0.77   8.67  0.80   16.42 121.52
## 2007-02-28   1.08   0.77   8.83  0.14   16.93 115.80
## 2007-03-30   2.77   2.22   8.54  0.52    4.59  57.71
## 2007-04-30   3.54   2.60   9.78  0.70   11.17 108.40
## 2007-05-31  -0.91  -1.39   8.35  0.36    4.31  50.16
## 2007-06-29  -5.42  -5.52  10.49  0.16    5.46  59.03
## 2007-07-31   0.20   0.24  10.89  0.01    6.06  93.16
## 2007-08-31   2.57   1.97  10.64  2.02   38.63 202.34
## 2007-09-28   2.50   2.14  12.07 -0.16    4.65  65.00
## 2007-10-31  -7.43  -6.38  10.99 -0.23    4.72  87.94
## 2007-11-30  -0.78  -1.20  11.86  3.77   67.79 254.61
## 2007-12-31  -6.37  -6.25  12.69 -0.01    3.38  88.89
## 2008-01-31  -2.67  -3.00  12.70  0.39    5.18 102.19
## 2008-02-29  -1.75  -0.87  11.90 -0.90    4.95  55.68
## 2008-03-31   4.46   4.14  12.68  0.09    4.14  79.00
## 2008-04-30   3.75   2.74  12.07  0.62    4.08  94.78
## 2008-05-30  -9.57  -9.74  12.81  0.58    5.74 107.54
## 2008-06-30   0.88   0.59  14.78  0.18    1.84  76.37
## 2008-07-31   1.86   1.45  11.60  1.19   17.16 173.52
## 2008-08-29 -10.23  -9.31  14.84 -0.27    2.51  71.39
## 2008-09-30 -22.17 -21.79  16.93  0.00    0.87  73.91
## 2008-10-31 -11.24 -10.35  16.96 -0.33    1.52  69.39
## 2008-11-28   5.42   4.31  15.77  0.63    3.71 106.61
## 2008-12-31  -8.57  -8.74  16.06  1.33   19.97 232.38
## 2009-01-30 -11.05 -10.24  13.86 -0.29    1.64  50.09
## 2009-02-27   8.74   8.11  13.82  0.35    3.13  91.95
## 2009-03-31  15.77  12.40  19.82  1.46    5.95 180.03
## 2009-04-30   6.27   4.56  15.60  1.10    4.87 129.92
## 2009-05-29   0.73   0.39  11.74  0.81    7.79 117.80
## 2009-06-30   9.40   8.27  12.26  0.73    3.94  90.06
## 2009-07-31   2.97   1.99  11.98  3.45   59.20 244.98
## 2009-08-31   6.35   5.11  10.82  0.77    5.31  78.38
## 2009-09-30  -4.97  -4.33   9.82 -0.28    2.80  49.92
## 2009-10-30   4.10   3.39  10.93  1.31   11.44 114.71
## 2009-11-30   6.11   5.35   9.34  0.80    4.79  80.06
## 2009-12-31  -3.53  -3.96   9.31  0.93    6.29  89.65
## 2010-01-29   3.84   3.15   9.34  0.90    6.88  82.88
## 2010-02-26   7.05   6.38  10.55  3.26   62.25 224.38
## 2010-03-31   4.27   3.29  10.04  1.05    5.59  89.01
## 2010-04-30  -8.15  -8.02   8.67  0.54   10.10  83.33
## 2010-05-28  -5.84  -5.39   9.29 -0.05    4.12  66.93
## 2010-06-30   7.27   6.93   9.86  0.55    6.71  86.75
## 2010-07-30  -5.71  -5.75  10.47  3.25   70.47 218.02
## 2010-08-31  11.06   9.99   9.91  0.77   10.38 100.82
## 2010-09-30   4.04   3.54   9.30 -0.06    9.17  64.48
## 2010-10-29   1.88   0.91   9.75  0.50    4.65  58.19
## 2010-11-30   7.38   6.52   9.83  2.53   42.78 189.81
## 2010-12-31   0.31   0.23   9.12  0.59    5.94  86.93
## 2011-01-31   4.07   3.19   9.49  0.70    7.02  71.51
## 2011-02-28   1.52   1.03   9.21  0.54    6.29  68.80
## 2011-03-31   2.66   2.31   9.35  1.67   22.78 119.18
## 2011-04-29  -2.24  -2.09   8.77  0.41    8.49  78.53
## 2011-05-31  -2.34  -2.11   7.70 -0.19    5.63  47.88
## 2011-06-30  -3.40  -3.39   8.80  0.38    5.46  67.07
## 2011-07-29  -8.02  -7.23  10.69 -0.52    4.77  68.54
## 2011-08-31 -11.02 -10.13  10.67 -0.03    6.89 105.30
## 2011-09-30  13.50  12.49  12.09  0.57    4.08 122.52
## 2011-10-31  -1.53  -0.78  10.17  0.98   27.93 167.46
## 2011-11-30  -0.38  -0.05   8.60  0.35    9.35  73.29
## 2011-12-30   7.33   6.09  10.62  1.55   10.93 133.27
## 2012-01-31   3.04   2.43   9.05  0.32    9.99  88.60
## 2012-02-29   1.44   1.35   8.69  0.76    9.64  84.45
## 2012-03-30  -1.26  -0.67   8.38  0.21    8.73  78.52
## 2012-04-30  -7.73  -6.92  10.08 -0.12    5.27  77.32
## 2012-05-31   3.88   3.51   8.33  0.61    7.52  73.81
## 2012-06-29  -0.54   0.45   9.85 -0.39    8.79  85.04
## 2012-07-31   2.74   1.96   9.12  1.30   15.81 115.27
## 2012-08-31   3.03   2.50   7.35  0.69   16.04 100.00
## 2012-09-28  -1.50  -1.12   8.67 -0.17    7.83  56.07
## 2012-10-31   0.62   0.50   8.31  0.13    5.25  51.65
## 2012-11-30   2.23   1.63   7.36  1.46   16.48  88.16
## 2012-12-31   5.81   5.41   8.22  0.45    9.52  78.46
## 2013-01-31   0.50   0.53   7.85 -0.07    9.76  55.28
## 2013-02-28   3.83   3.50   8.06  0.93    9.77  89.17
## 2013-03-29   0.39   0.44   8.69  0.29    8.14  72.70
## 2013-04-30   2.88   2.22  10.63  2.03   18.07 134.74
## 2013-05-31  -0.78  -0.92   8.48  0.32   12.18  83.11
## 2013-06-28   6.15   5.48   9.48  2.00   21.59 147.85
## 2013-07-31  -2.60  -3.34   9.39  3.02   44.54 175.82
## 2013-08-30   5.61   4.88   9.39  1.62   16.81 106.17
## 2013-09-30   2.72   3.15   9.94 -0.48    8.45  67.79
## 2013-10-31   3.01   2.47   9.77  0.55    5.71  69.31
## 2013-11-29   2.21   1.58   8.95  2.06   28.32 119.32
## 2013-12-31  -2.26  -3.36  11.69  8.01  212.58 340.66
## 2014-01-31   4.65   4.04   9.42  1.31   14.24 125.02
## 2014-02-28  -0.16   0.33   8.99 -0.43    9.94  80.79
## 2014-03-31  -2.66  -1.86   9.30 -0.36    4.49  58.98
## 2014-04-30   0.90   1.07   8.70 -0.10    8.37  81.01
## 2014-05-30   4.49   3.44   9.55  8.06  211.95 284.37
## 2014-06-30  -4.67  -4.34   9.17  4.58  135.29 235.94
## 2014-07-31   4.14   3.62   8.98  0.67   10.30  84.72
## 2014-08-29  -5.07  -4.97   9.09  0.72   14.10  91.64
## 2014-09-30   3.49   3.84  12.33  0.51   20.64 192.24
## 2014-10-31  -0.15   0.57  10.02 -0.27    6.20  67.68
## 2014-11-28   0.71   0.76  10.55  1.39   16.13 122.46
## 2014-12-31  -3.14  -3.30  10.35  0.50    8.98 114.49
## 2015-01-30   5.83   5.42   9.92  0.80    5.27  84.17
## 2015-02-27   0.26   0.13   9.47  0.84   11.87 112.53
## 2015-03-31   0.21  -0.91  10.60  0.77   11.15  91.30
## 2015-04-30   0.92   0.46  10.33  2.03   21.15 132.07
## 2015-05-29  -0.70  -1.41   9.14  0.67    7.70  68.04
## 2015-06-30  -2.06  -1.27  10.90 -0.28    5.31  92.68
## 2015-07-31  -6.18  -5.77  10.04 -0.37    8.95  83.30
## 2015-08-31  -5.78  -4.24  10.91 -0.60    5.25  77.63
## 2015-09-30   6.24   6.22  12.08  0.52   10.40 141.07