Thursday, November 5, 2015

Issues to Ponder

For those who have offered to collaborate/review, thank you.  Here are some areas that I have run into questions regarding or tasks to do soon..   I'm probably missing some.

  • What to do about missing values (NAs)?  I have a lot of them.  Currently I replace them with the median values.  I briefly experimented with imputation using the MICE package and others. It appears that the computational time is enormous.  
  • Parallel Processing on a 4 core Windows 7 machine.  I haven't gotten this to work.  This would help speed up the process.
  • Variable Importance and Feature selection.  I have a lot of variables.  I spent some time thinking about these, but I probably missed some.  I'm also thinking that I should use the variable importance reporting from the random forests to eliminate variable.  Not sure how to do this. 
  • I'm using price appreciation, not total return for the Y variable.  Each month Stock Investor Pro provides the last 120 monthly prices for a stock along with the dividends for the last 8 fiscal quarters.  It might be ok the way it is, but it would be better to have total returns.  I'll post more on this.  
  • RStudio and Github (testing before merging).  One collaborator has given code changes.  If I understand the process, I need to fork those changes test them and then do something to merge them back.
  • Right now I equally weight the forecasts of stocks using the models created using each of the previous 12 months? Should I be using 12? Should I equally weight?  If I don't, how should I weight?  Let's say X(t) represents features at a point in time.  Perhaps I should find the X data previous to t most like X(t).  Sounds like a nearest neighbor.  While the columns in X will represent the same feature, the rows will vary as companies enter and leave.


No comments:

Post a Comment