Let’s see if we can make some useful predictions about future daily returns of the stock market. Our dataset is derived entirely from daily prices of the S&P 500 since 1950. This analysis ignores dividends and transaction costs. Even the weak form of the Efficient Market Hypothesis holds that past prices should give no information about future returns. This analysis casts doubt on that claim.
Description of Features
The features used in this model closely follow what was described in “A Look At Daily Returns of the S&P 500”. Six variables are defined by the closing price divided by 2d, 5d, 10d, 20d, 50d, and 200d simple moving averages. We also calculate the SMA50/SMA200. Another variable is %B calculation associated with Bollinger bands which is scaled such that 0 represents the price at the lower band and 1 at the higher band. We have a factor variables representing the day of the week and the month. We measure the trading day of the month, the trading days left in the month the days from the last holiday and the days until the next holiday. We calculate the volatility (standard deviation) over the last 5d and 10d and the change in volatility (vol5 - vol10). For each of the SMA variables, we define two event variables. One is true is the SMA crosses 1.0 from below and the other is true if the SMA crosses from above.
Description of the Response (Y) Variable
In this model we are trying to predict the return for one day two days in the future. For example, if today is Monday, we predict Wednesday. We don’t use Tuesday because if we have to wait for Monday’s close we can’t execute at Monday’s close; we have to trade on Tuesday. We considered using Tuesday’s opening price, but the open data was sparse. To an extent we are being conservative (but we still need to consider transaction costs).
Summary of the Data
summary(data)
## Close SMA2 SMA5 SMA10
## Min. : 19.0 Min. :0.8860 Min. :0.7887 Min. :0.7500
## 1st Qu.: 85.4 1st Qu.:0.9979 1st Qu.:0.9952 1st Qu.:0.9926
## Median : 153.7 Median :1.0002 Median :1.0012 Median :1.0026
## Mean : 487.4 Mean :1.0001 Mean :1.0006 Mean :1.0013
## 3rd Qu.: 963.4 3rd Qu.:1.0025 3rd Qu.:1.0065 3rd Qu.:1.0109
## Max. :2130.8 Max. :1.0547 Max. :1.0697 Max. :1.0977
## SMA20 SMA50 SMA200 SMA50_200
## Min. :0.7220 Min. :0.703 Min. :0.6035 Min. :0.7478
## 1st Qu.:0.9903 1st Qu.:0.987 1st Qu.:0.9862 1st Qu.:0.9891
## Median :1.0051 Median :1.011 Median :1.0385 Median :1.0285
## Mean :1.0028 Mean :1.007 Mean :1.0305 Mean :1.0224
## 3rd Qu.:1.0171 3rd Qu.:1.031 3rd Qu.:1.0822 3rd Qu.:1.0608
## Max. :1.1112 Max. :1.154 Max. :1.2313 Max. :1.1554
## BB wday tdayofmonth tdaysleft
## Min. :-0.4615 Min. :1.000 Min. : 1 Min. : 1.00
## 1st Qu.: 0.2902 1st Qu.:2.000 1st Qu.: 7 1st Qu.: 6.00
## Median : 0.6222 Median :3.000 Median :12 Median :11.00
## Mean : 0.5625 Mean :3.013 Mean :12 Mean :11.03
## 3rd Qu.: 0.8385 3rd Qu.:4.000 3rd Qu.:17 3rd Qu.:16.00
## Max. : 1.3609 Max. :5.000 Max. :24 Max. :23.00
## month vol5 vol10 volchg
## Min. : 1.000 Min. : 0.04066 Min. :0.1086 Min. :-5.21398
## 1st Qu.: 4.000 1st Qu.: 0.43925 1st Qu.:0.5024 1st Qu.:-0.15074
## Median : 7.000 Median : 0.64743 Median :0.6943 Median :-0.01455
## Mean : 6.527 Mean : 0.77987 Mean :0.8156 Mean :-0.03571
## 3rd Qu.:10.000 3rd Qu.: 0.95903 3rd Qu.:0.9716 3rd Qu.: 0.10591
## Max. :12.000 Max. :11.47204 Max. :8.4723 Max. : 3.40415
## daysfromhday daystilhday SMA50_200Factor E_SMA2xb
## Min. : 0.00 Min. : 0.00 Min. :0.0000 Min. :0.0000
## 1st Qu.: 7.00 1st Qu.: 8.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 16.00 Median : 17.00 Median :1.0000 Median :0.0000
## Mean : 18.47 Mean : 19.36 Mean :0.7016 Mean :0.2326
## 3rd Qu.: 27.00 3rd Qu.: 28.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :107.00 Max. :108.00 Max. :1.0000 Max. :1.0000
## E_SMA2xa E_SMA5xb E_SMA5xa E_SMA10xb
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.2326 Mean :0.1175 Mean :0.1175 Mean :0.07649
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## E_SMA10xa E_SMA20xb E_SMA20xa E_SMA50xb
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.07649 Mean :0.05293 Mean :0.05293 Mean :0.03077
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## E_SMA50xa E_SMA200xb E_SMA200xa E_SMA50_200xb
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.000000
## Mean :0.03077 Mean :0.01154 Mean :0.01154 Mean :0.001953
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.000000
## E_SMA50_200xa ret
## Min. :0.000000 Min. :-20.46693
## 1st Qu.:0.000000 1st Qu.: -0.41200
## Median :0.000000 Median : 0.04562
## Mean :0.002014 Mean : 0.03310
## 3rd Qu.:0.000000 3rd Qu.: 0.49694
## Max. :1.000000 Max. : 11.58004
train.idx<-seq(1:floor(nrow(data)/2))
train<-data[train.idx,]
test<-data[-train.idx,]
The First Model
We have 16381 observations with complete data. We use the first 8190 to train our model. The other 8191 are reserved for testing. For our first model we build a random forest named rf1. The model output also includes a measure of variable importance shown here.
## Random Forest
##
## 8190 samples
## 33 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8190, 8190, 8190, 8190, 8190, 8190, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared RMSE SD Rsquared SD
## 2 0.7571588 0.06379952 0.01136876 0.01300077
## 17 0.7482923 0.08316171 0.01311734 0.01443278
## 33 0.7523365 0.07760161 0.01434747 0.01598202
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 17.
## IncNodePurity
## Close 252.1173196
## SMA2 295.0860983
## SMA5 377.6223970
## SMA10 270.7152864
## SMA20 243.0050932
## SMA50 275.1455728
## SMA200 248.6737994
## SMA50_200 233.8618824
## BB 199.7404012
## wday 109.7844505
## tdayofmonth 147.3892576
## tdaysleft 145.0328119
## month 109.9277272
## vol5 499.4715378
## vol10 388.8036401
## volchg 423.6912030
## daysfromhday 175.4385722
## daystilhday 189.6794864
## SMA50_200Factor 9.8764698
## E_SMA2xb 21.3110409
## E_SMA2xa 21.2266711
## E_SMA5xb 13.8492434
## E_SMA5xa 12.9115335
## E_SMA10xb 7.7894632
## E_SMA10xa 7.6872256
## E_SMA20xb 6.2988590
## E_SMA20xa 8.5097551
## E_SMA50xb 4.5850951
## E_SMA50xa 5.5731876
## E_SMA200xb 2.7281166
## E_SMA200xa 4.7138217
## E_SMA50_200xb 0.6961828
## E_SMA50_200xa 0.6759492
Performance of the model
Our model essentially predicts the return for the day after tomorrow. For evaluation purposes, we want to consider an approach which might allow us to profit from the prediction. The obvious rule would be to be long the market when the prediction is non-negative and to be short when the prediction is negative. This means our focus should be on seeing how well we are able to predict up and down days in the market. We start with looking at results using our training (in-sample) data. We basically take the training data and feed it into the rf1 model to predict the returns. Not surprisingly, the results are excellent.
## Min. 1st Qu. Median Mean 3rd Qu. Max. % Neg
## Pred Neg -6.6760 -0.8007 -0.4234 -0.56670 -0.1908 0.7766 93.9
## Pred Pos -0.4193 0.1675 0.4059 0.53330 0.7368 5.0220 6.2
## All -6.6760 -0.3776 0.0388 0.02868 0.4451 5.0220 46.4
We used the Caret package to generate the model. It returns a set of predictions different from those above. I believe, but am not positive, that these predictions are the result of 10 fold cross-validation where 9/10th of the training data was used to build a model to predict the other 1/10th. This is repeated 10 times so all the observations have predictions. This approach should give us an indication of how well the model will perform out of sample. The results of these predictions follow. The results are worse than those above but still promising.
## Min. 1st Qu. Median Mean 3rd Qu. Max. % Neg
## Pred Neg -6.676 -0.6443 -0.1746 -0.18120 0.2630 4.756 59.8
## Pred Pos -6.618 -0.1631 0.1654 0.19400 0.5352 5.022 35.8
## All -6.676 -0.3776 0.0388 0.02868 0.4451 5.022 46.4
Out of Sample Results
Now we turn to using the rf1 model to predict returns using the test data which the model has not seen before. It’s worth noting that our training and test sets each span over 30 years. Thus over half the predictions below are based on a model using data over a decade old. Even still, the results indicate that we are able to discriminate between positive and negative days. There is a higher percentage of negative days in those we predict will be negative (52.5%) than in the overall data (46.5%). Only 41.5% of the returns are negative for the days we predict will be positive. The model did fail to predict that Black Monday in October 1987 would be negative.
## Min. 1st Qu. Median Mean 3rd Qu. Max. % Neg
## Pred Neg -8.807 -0.6364 -0.04975 -0.10640 0.4488 5.417 52.5
## Pred Pos -20.470 -0.3246 0.13530 0.15580 0.6528 11.580 41.5
## All -20.470 -0.4579 0.05660 0.03752 0.5656 11.580 46.5
From an economic standpoint, if we had invested $1 in the market (buy and hold) it would have grown to $12.74. If we had invested $1 in a strategy that invested in the market when our prediction was positive and shorted it when it was negative, it would have grown to $830.68 over the same period.
Remarks
In the variable importance, I not that Price was left in and seemed to be important. I find it peculiar that that variable would be of value. Another model leaving out many variables will produce almost identical performance.
Transaction costs are an issue. We need to analyze whether there are streaks of predicted negative value. Streaks reduce transaction costs. If every other day is predicted to be negative, we’d execute more frequently and incur higher costs than if we had 3 days of negative followed by 3 days of positive predictions. This model is only a simple random forest. Other algorithms might do better. This approach uses one period of data to predict returns. We’d likely be better off expanding our window. Roughly this means using the data from 1950-1980 to predict 1981; 1950-1981 to predict 1982 and so forth. We might be able to profit more with the use of leverage or scaling our investment based on the size of the predicted return.
No comments:
Post a Comment