This is a post-script to the previous entry in which I looked at the performance of 12 forests. While looking more deeply into the performance I came across a mistake. I included "COMPANY_ID" which is a unique identifier for each company as a variable. Including this would not have helped the model performance. However because this was treated as a factor, it created over a thousand new variables (explaining the high variable count). I suspect this had little influence on performance. It probably slowed down the computer as it considered more variables. It may have hurt performance as better variable were crowded out of some trees. I am re-running and we will see.
There's another lesson I need to pay closer attention to. When something seems off, it is off. I noted the oddly high variable count in the previous post. I was too quick to use sector, industry, sector and other factor variables to explain the increase. But the increase was too large for these.
No comments:
Post a Comment