Thursday, September 17, 2015

Data source (002.0)

Summary: AAII Stock Investor Pro is the primary data source. It includes data on over 8,000 U.S. companies (and ADRs). The data begins in 2003.
The stock data come from the American Association of Individual Investor’s Stock Investor Pro (SIP) software. I’ve had a lifetime membership to the AAII for many years. For the additional, but more than reasonable price of $198/yr, one can license SIP. What makes this source valuable is that it is survivorship-bias free historical data. Subscribers have access to the old software and data as it was when it was distributed going back to 2003. The data include balance sheet, income statement, cash flow, price, and many calculated fields. The list of fields runs to 22 pages. In 2003, over 8,500 companies were covered.
For info on SIP, check out the AAII webpage and this presentation. I downloaded about 150 install files from the AAII archives page site access to which requires membership ($29) and a subscription. I installed them one by one putting each into its own directory. I downloaded the month-end updates though weekly data was sometime available. I watched an entire season of Friends while doing this and probably lost three IQ points. Because this is licensed data, I am not making the raw data available on Github.
The AAII data files are in a Foxpro/DBF format. Fortunately R has the read.dbf function in the foreign package to handle this.
Biases in the data
*Suvivorship - The AAII data itself is free of survivorship bias in that it includes companies that have disappeared. This should be clear because we are using the data made available historically as it was available then. That is the January 31, 2003 data is just as AAII released it then. I introduce some bias here as I wrangle the data. Specifically, when I create the Y (dependent) variables which are future returns, I struggle with companies that disappear before the evaluation horizon. For example, if working with data from the end of June 2003, I look at the September 2003 file to calculate the 3 month return. If a company in June is not in September then I use August (and July if it’s not in August). These data files only have monthly data so the future return calculations are problematic when companies disappear.
*Look-ahead - The data in the AAII files are the data that existed at the time of their release so AAII is not committing any look-ahead bias. However, I use the month-end data files. These may not be available exactly at the month end. I’m assuming buys and sells at month even though the AAII data is not available then.
Code Reference
convertdbf2rdata.r converts the dbf files to rdata files.

Finding Stock Market Winners (and Losers) - An Introduction (001.0)

Summary: Develop a model to predict winners and losers in the U.S. stock market using machine learning techniques.
This blog documents a “stock market winners” research project which marries machine learning and finance. The inspiration for this project is the 1988 article “The Anatomy of a Stock Market Winner” by Marc Reinganum published in the Financial Analysts Journal. The goal is to improve upon it using machine learning techniques.
Many stock screens have been created based on the premise that stocks with certain characteristics should outperform. For example, people might screen for stocks with low price-to-book ratios. Generally, people have an idea of what characteristics might lead to good future stock performance and then screen for those. The American Assocation of Individual Investors website has many such screens and their historical performance.
In contrast Reinganum first identified stocks which had doubled over a 12 month period, and then searched for common characteristics among those. He identified nine. The goal is to apply the power to today’s computer along with machine learning techniques such as random forests in fashion similar to that of Reinganum. This research will consider time frames other than 12 months and will also search for stocks that are likely to perform badly (“Losers”) as candidates for short sales.