A BOOTSTRAP APPROACH FOR IMPROVING LOGISTIC REGRESSION PERFORMANCE IN IMBALANCED DATA SETS
DOI:
https://doi.org/10.20319/mijst.2018.43.1124Keywords:
Binary Response, Prediction, SMOTE, Under-sampling, Over-sampling, Confusion Matrix, Accuracy, Precision, Recall, F1-measureAbstract
In an imbalanced dataset with binary response, the percentages of successes and failures are not approximately equal. In many real world situations, majority of the observations are “normal” (i.e., success) with a much smaller fraction of failures. The overall probability of correct classification for extremely imbalanced data sets can be very high but the probability of correctly predicting the minority class can be very low. Consider a fictitious example of a dataset with 1,000,000 observations out of which 999,000 are successes and 1,000 failures. A rule that classifies all observations as successes will have very high accuracy of prediction (99.9%) but the probability of correctly predicting a failure will be 0. In many situations, the cost associated with incorrect prediction of a failure is high, and it is therefore important to improve the prediction accuracy of failures as well. Literature suggests that over-sampling of the minority class with replacement does not necessarily predict the minority class with higher accuracy. In this article, we propose a simple over-sampling method which bootstraps a subset of the minority class, and illustrate the bootstrap over-sampling method with several examples. In each of these examples, an improvement in prediction accuracy is seen.
References
Allison, Paul (2012, February 13). Statistical Horizons. Retrieved from https://statisticalhorizons.com/logistic-regression-for-rare-events
Bozorgi, Mandana, Taghva, Kazem, & Singh, Ashok (2017). Cancer Survivability with Logistic
Regression. Computing Conference 2017 (18-20 July 2017) London, UK. https://ieeexplore.ieee.org/document/8252133/citations
Catanghal Jr, R. A., Palaoag, T. D. and Malicdem, A. R. (2017). Crowdsourcing approach for disaster response assessment. Matter: International Journal of Science and Technology.
Chawla, Nitesh V. (2005). Data Mining and Knowledge Discovery Handbook. Maimon, Oded,
Rokach, & Lior (Eds.), Data mining for imbalanced data: an overview, (pp. 853-867). New York, Springer.
Chawla, N. V., Bowyer, K, Hall, L, & Kegelmeyer, W. (2002). SMOTE: Synthetic Minority
Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357.
Concato J, Feinstein AR (1997). Monte Carlo methods in clinical research: applications in multivariable analysis. Journal of Investigative Medicine, 45(6), 394-400.
Crone, S. F. and Finlay, S. (2012). Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting 28 224–238.
Efron, B. and Tibshirani, R. (1986). Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Volume 1, Number 1, pp. 54-75.
Efron, B. and Tibshirani, R. (1991). Statistical Data Analysis in the Computer Age. Science, Vol. 253, pp. 390-395.
Fox, John & Monette, Georges. (1992). Generalized collinearity diagnostics. Journal of the American Statistical Association, 87(417), 178-183.
Guillet, F., & Hamilton, H., J. (Eds.). (2007). Quality measures in data mining. (Vol.43). New York: Springer.
Keleş, Mümine Kaya (2017). An overview: the impact of data mining applications on various sectors. Technical Journal 11, 3(2017), 128-132.
King, Gary & Zeng, Langche. (2001). Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137-163.
Namvar, A., Siami, M., Rabhi, F., Naderpour, M. (2018). Credit risk prediction in an imbalanced social lending environment. arXiv preprint arXiv:1805.00801
Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. (1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology, 49 (12), 1373-1379.
Ramadhan, M. M., Sitanggang, I. S. and Anzani, L. P. (2017). Classification model for hotspot sequences as indicator for peatland fires using data mining approach. Matter: International Journal of Science and Technology, Special Issue Volume 3 Issue 2, pp. 588-597.
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45(4), 427–437.
Syaifudin, Y. W. and Puspitasari, D. (2017). Twitter data mining for sentiment analysis on Peoples feedback against government public policy. Matter: International Journal of Science and Technology, Special Issue Volume 3 Issue 1, pp. 110 – 122.
Vittinghoff, Eric & McCulloch, Charles E. (2007). Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression, American Journal of Epidemiology, 165(6), 710–718. https://doi.org/10.1093/aje/kwk052
Wei Q, Dunbrack RL Jr (2013) The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics. PLoS ONE 8(7): e67863.
Downloads
Published
How to Cite
Issue
Section
License
Copyright of Published Articles
Author(s) retain the article copyright and publishing rights without any restrictions.
All published work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.