How do I deal with non-IID data in gradient boosting?

Question

Michael on 5 Jun 2015

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/222623-how-do-i-deal-with-non-iid-data-in-gradient-boosting

Dear community,

I am working on a stock market decision system. I have currently centered on gradient boosting as the likely best machine learning solution for the problem.

However, I have 2 fundamental issues with my data owing to it being from the stock market having to do with it not being IID.

First, because of the duration of average in some indicators used, some data-points are highly correlated. For example, the 2-year trailing return of a stock is not very different if measured a month ago. My understanding is that this requires a sampling (for ensembles) where I choose datapoints that are "far away" in time to make trees more independent. From what I can tell so far, Matlab does not have functionality to pick a random subspace with this criteria. When I was previously thinking of using simple bagging, I figured I would just build the trees myself from custom subspaces and aggregate them into an ensemble, but this won’t work if I want to do gradient boosting. Now, on this point I am not totally sure that it is so critical to have samples “far away.” My intuition is that it is better if they are, but even if they are not perhaps by right-sizing the percent of data sampled and having enough trees it gives the same result. I would love any insight on that issue and how I might be able to use LSboost in matlab on custom samples.

The second fundamental problem is that data from a given stock is correlated/related to itself. I realized after thinking about it that this is of critical importance. Consider, it would likely be better, if there is enough data, to make a prediction for stock A from training data only or mostly from stock A than to use the entire market. Thus, I had been thinking of a “system” where I train on stock-specific data, stock-group data (where I use a special algorithm to group stocks), and the entire market, and then use a calculation (I can elaborate if interested) that determines which of these models is more likely to give the better result. If the input looks very different from the stock-specific training data, for example, then it will use the group or entire market. I am pretty convicted that some form of taking into account which stock the system is looking at is important to optimizing performance.

Now, on the second issue the question is what is the best way to organize this. Thinking naively, it would be great to simply feed categories to the predictor that indicate what stock it is looking at. However, my belief here from what I know about these algorithms is that this will have poor results on new data, because this predictor will assume that it has seen the full universe of potential outcomes for each stock, when many times this isn’t the case. (Say there is a stock with only a one year history with a big rally – the system will think the rally will continue regardless of how different the new data looks). So I feel like I have to do something like in the previous paragraph. I don’t know if there is some way for the system to “automatically” recognize when new data is sufficiently similar to stock-specific data to focus on a stock-specific prediction vs. when it is different and it should go to the default system with multiple stocks.

If you have any insights on these issues and/or how to address them in Matlab, I would very much appreciate. Thanks in advance.

Best, Mike