Tuesday, December 15, 2015

Predicting Stock Market Returns - Part II (Initial Results)

Let’s see if we can make some useful predictions about future daily returns of the stock market. Our dataset is derived entirely from daily prices of the S&P 500 since 1950. This analysis ignores dividends and transaction costs. Even the weak form of the Efficient Market Hypothesis holds that past prices should give no information about future returns. This analysis casts doubt on that claim.

Description of Features

The features used in this model closely follow what was described in “A Look At Daily Returns of the S&P 500”. Six variables are defined by the closing price divided by 2d, 5d, 10d, 20d, 50d, and 200d simple moving averages. We also calculate the SMA50/SMA200. Another variable is %B calculation associated with Bollinger bands which is scaled such that 0 represents the price at the lower band and 1 at the higher band. We have a factor variables representing the day of the week and the month. We measure the trading day of the month, the trading days left in the month the days from the last holiday and the days until the next holiday. We calculate the volatility (standard deviation) over the last 5d and 10d and the change in volatility (vol5 - vol10). For each of the SMA variables, we define two event variables. One is true is the SMA crosses 1.0 from below and the other is true if the SMA crosses from above.

Description of the Response (Y) Variable

In this model we are trying to predict the return for one day two days in the future. For example, if today is Monday, we predict Wednesday. We don’t use Tuesday because if we have to wait for Monday’s close we can’t execute at Monday’s close; we have to trade on Tuesday. We considered using Tuesday’s opening price, but the open data was sparse. To an extent we are being conservative (but we still need to consider transaction costs).

Summary of the Data

summary(data)
##      Close             SMA2             SMA5            SMA10       
##  Min.   :  19.0   Min.   :0.8860   Min.   :0.7887   Min.   :0.7500  
##  1st Qu.:  85.4   1st Qu.:0.9979   1st Qu.:0.9952   1st Qu.:0.9926  
##  Median : 153.7   Median :1.0002   Median :1.0012   Median :1.0026  
##  Mean   : 487.4   Mean   :1.0001   Mean   :1.0006   Mean   :1.0013  
##  3rd Qu.: 963.4   3rd Qu.:1.0025   3rd Qu.:1.0065   3rd Qu.:1.0109  
##  Max.   :2130.8   Max.   :1.0547   Max.   :1.0697   Max.   :1.0977  
##      SMA20            SMA50           SMA200         SMA50_200     
##  Min.   :0.7220   Min.   :0.703   Min.   :0.6035   Min.   :0.7478  
##  1st Qu.:0.9903   1st Qu.:0.987   1st Qu.:0.9862   1st Qu.:0.9891  
##  Median :1.0051   Median :1.011   Median :1.0385   Median :1.0285  
##  Mean   :1.0028   Mean   :1.007   Mean   :1.0305   Mean   :1.0224  
##  3rd Qu.:1.0171   3rd Qu.:1.031   3rd Qu.:1.0822   3rd Qu.:1.0608  
##  Max.   :1.1112   Max.   :1.154   Max.   :1.2313   Max.   :1.1554  
##        BB               wday        tdayofmonth   tdaysleft    
##  Min.   :-0.4615   Min.   :1.000   Min.   : 1   Min.   : 1.00  
##  1st Qu.: 0.2902   1st Qu.:2.000   1st Qu.: 7   1st Qu.: 6.00  
##  Median : 0.6222   Median :3.000   Median :12   Median :11.00  
##  Mean   : 0.5625   Mean   :3.013   Mean   :12   Mean   :11.03  
##  3rd Qu.: 0.8385   3rd Qu.:4.000   3rd Qu.:17   3rd Qu.:16.00  
##  Max.   : 1.3609   Max.   :5.000   Max.   :24   Max.   :23.00  
##      month             vol5              vol10            volchg        
##  Min.   : 1.000   Min.   : 0.04066   Min.   :0.1086   Min.   :-5.21398  
##  1st Qu.: 4.000   1st Qu.: 0.43925   1st Qu.:0.5024   1st Qu.:-0.15074  
##  Median : 7.000   Median : 0.64743   Median :0.6943   Median :-0.01455  
##  Mean   : 6.527   Mean   : 0.77987   Mean   :0.8156   Mean   :-0.03571  
##  3rd Qu.:10.000   3rd Qu.: 0.95903   3rd Qu.:0.9716   3rd Qu.: 0.10591  
##  Max.   :12.000   Max.   :11.47204   Max.   :8.4723   Max.   : 3.40415  
##   daysfromhday     daystilhday     SMA50_200Factor     E_SMA2xb     
##  Min.   :  0.00   Min.   :  0.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:  7.00   1st Qu.:  8.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 16.00   Median : 17.00   Median :1.0000   Median :0.0000  
##  Mean   : 18.47   Mean   : 19.36   Mean   :0.7016   Mean   :0.2326  
##  3rd Qu.: 27.00   3rd Qu.: 28.00   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :107.00   Max.   :108.00   Max.   :1.0000   Max.   :1.0000  
##     E_SMA2xa         E_SMA5xb         E_SMA5xa        E_SMA10xb      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.2326   Mean   :0.1175   Mean   :0.1175   Mean   :0.07649  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##    E_SMA10xa         E_SMA20xb         E_SMA20xa         E_SMA50xb      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.07649   Mean   :0.05293   Mean   :0.05293   Mean   :0.03077  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##    E_SMA50xa         E_SMA200xb        E_SMA200xa      E_SMA50_200xb     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.000000  
##  Mean   :0.03077   Mean   :0.01154   Mean   :0.01154   Mean   :0.001953  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.000000  
##  E_SMA50_200xa           ret           
##  Min.   :0.000000   Min.   :-20.46693  
##  1st Qu.:0.000000   1st Qu.: -0.41200  
##  Median :0.000000   Median :  0.04562  
##  Mean   :0.002014   Mean   :  0.03310  
##  3rd Qu.:0.000000   3rd Qu.:  0.49694  
##  Max.   :1.000000   Max.   : 11.58004
train.idx<-seq(1:floor(nrow(data)/2))
train<-data[train.idx,]
test<-data[-train.idx,]

The First Model

We have 16381 observations with complete data. We use the first 8190 to train our model. The other 8191 are reserved for testing. For our first model we build a random forest named rf1. The model output also includes a measure of variable importance shown here.
## Random Forest 
## 
## 8190 samples
##   33 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 8190, 8190, 8190, 8190, 8190, 8190, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared    RMSE SD     Rsquared SD
##    2    0.7571588  0.06379952  0.01136876  0.01300077 
##   17    0.7482923  0.08316171  0.01311734  0.01443278 
##   33    0.7523365  0.07760161  0.01434747  0.01598202 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 17.
##                 IncNodePurity
## Close             252.1173196
## SMA2              295.0860983
## SMA5              377.6223970
## SMA10             270.7152864
## SMA20             243.0050932
## SMA50             275.1455728
## SMA200            248.6737994
## SMA50_200         233.8618824
## BB                199.7404012
## wday              109.7844505
## tdayofmonth       147.3892576
## tdaysleft         145.0328119
## month             109.9277272
## vol5              499.4715378
## vol10             388.8036401
## volchg            423.6912030
## daysfromhday      175.4385722
## daystilhday       189.6794864
## SMA50_200Factor     9.8764698
## E_SMA2xb           21.3110409
## E_SMA2xa           21.2266711
## E_SMA5xb           13.8492434
## E_SMA5xa           12.9115335
## E_SMA10xb           7.7894632
## E_SMA10xa           7.6872256
## E_SMA20xb           6.2988590
## E_SMA20xa           8.5097551
## E_SMA50xb           4.5850951
## E_SMA50xa           5.5731876
## E_SMA200xb          2.7281166
## E_SMA200xa          4.7138217
## E_SMA50_200xb       0.6961828
## E_SMA50_200xa       0.6759492

Performance of the model

Our model essentially predicts the return for the day after tomorrow. For evaluation purposes, we want to consider an approach which might allow us to profit from the prediction. The obvious rule would be to be long the market when the prediction is non-negative and to be short when the prediction is negative. This means our focus should be on seeing how well we are able to predict up and down days in the market. We start with looking at results using our training (in-sample) data. We basically take the training data and feed it into the rf1 model to predict the returns. Not surprisingly, the results are excellent.
##             Min. 1st Qu.  Median     Mean 3rd Qu.   Max. % Neg
## Pred Neg -6.6760 -0.8007 -0.4234 -0.56670 -0.1908 0.7766  93.9
## Pred Pos -0.4193  0.1675  0.4059  0.53330  0.7368 5.0220   6.2
## All      -6.6760 -0.3776  0.0388  0.02868  0.4451 5.0220  46.4
We used the Caret package to generate the model. It returns a set of predictions different from those above. I believe, but am not positive, that these predictions are the result of 10 fold cross-validation where 9/10th of the training data was used to build a model to predict the other 1/10th. This is repeated 10 times so all the observations have predictions. This approach should give us an indication of how well the model will perform out of sample. The results of these predictions follow. The results are worse than those above but still promising.
##            Min. 1st Qu.  Median     Mean 3rd Qu.  Max. % Neg
## Pred Neg -6.676 -0.6443 -0.1746 -0.18120  0.2630 4.756  59.8
## Pred Pos -6.618 -0.1631  0.1654  0.19400  0.5352 5.022  35.8
## All      -6.676 -0.3776  0.0388  0.02868  0.4451 5.022  46.4

Out of Sample Results

Now we turn to using the rf1 model to predict returns using the test data which the model has not seen before. It’s worth noting that our training and test sets each span over 30 years. Thus over half the predictions below are based on a model using data over a decade old. Even still, the results indicate that we are able to discriminate between positive and negative days. There is a higher percentage of negative days in those we predict will be negative (52.5%) than in the overall data (46.5%). Only 41.5% of the returns are negative for the days we predict will be positive. The model did fail to predict that Black Monday in October 1987 would be negative.
##             Min. 1st Qu.   Median     Mean 3rd Qu.   Max. % Neg
## Pred Neg  -8.807 -0.6364 -0.04975 -0.10640  0.4488  5.417  52.5
## Pred Pos -20.470 -0.3246  0.13530  0.15580  0.6528 11.580  41.5
## All      -20.470 -0.4579  0.05660  0.03752  0.5656 11.580  46.5
From an economic standpoint, if we had invested $1 in the market (buy and hold) it would have grown to $12.74. If we had invested $1 in a strategy that invested in the market when our prediction was positive and shorted it when it was negative, it would have grown to $830.68 over the same period.

Remarks

In the variable importance, I not that Price was left in and seemed to be important. I find it peculiar that that variable would be of value. Another model leaving out many variables will produce almost identical performance.
Transaction costs are an issue. We need to analyze whether there are streaks of predicted negative value. Streaks reduce transaction costs. If every other day is predicted to be negative, we’d execute more frequently and incur higher costs than if we had 3 days of negative followed by 3 days of positive predictions. This model is only a simple random forest. Other algorithms might do better. This approach uses one period of data to predict returns. We’d likely be better off expanding our window. Roughly this means using the data from 1950-1980 to predict 1981; 1950-1981 to predict 1982 and so forth. We might be able to profit more with the use of leverage or scaling our investment based on the size of the predicted return.

No comments:

Post a Comment