What and why?

Imbalanced data (having high frequency of one class over the comparison class in data) can result in bias in machine learning analysis.  There are a few methods that could be used to remedy the issue, such as 1) oversampling the minority class and 2) undersampling the majority class. Though oversampling can lead to overfitting which is still problematic. Here is a solution I found on StackExchange:



I have modified and added more codes on top of the StackExchange solution. Here we have 9 subjects (2 unhealthy and 7 healthy), making it an imbalanced data. What we are doing is to sample all the minority sample (N=2), and use that count (N=2) as the number we will be randomly sampling from the majority class. At the end, these two subsamples are merged into one. However, undersampling also has its problems, namely the new sample will be much less generalizable to the real world since the proportion of the classes is artificially constructed, and the dropping of majority class cases may cause a problem of losing too much relevant information. For a more detailed read for ways of dealing with imbalanced data, visit http://www.chioka.in/class-imbalance-problem/ and http://stats.stackexchange.com/questions/61622/by-using-smote-the-classification-of-the-validation-set-is-bad.

Please rate this