A BOOTSTRAP APPROACH FOR IMPROVING LOGISTIC REGRESSION PERFORMANCE IN IMBALANCED DATA SETS

Authors

  • Michael Chang Department of Mathematical Sciences, University of Nevada Las Vegas, Las Vegas, United States of America (USA)
  • Rohan J. Dalpatadu Department of Mathematical Sciences, University of Nevada Las Vegas, Las Vegas, United States of America (USA)
  • Dieudonne Phanord Department of Mathematical Sciences, University of Nevada Las Vegas, Las Vegas, United States of America (USA)
  • Ashok K. Singh William F. Harrah College of Hotel Administration, University of Nevada Las Vegas, Las Vegas, United States of America (USA)

DOI:

https://doi.org/10.20319/mijst.2018.43.1124

Keywords:

Binary Response, Prediction, SMOTE, Under-sampling, Over-sampling, Confusion Matrix, Accuracy, Precision, Recall, F1-measure

Abstract

In an imbalanced dataset with binary response, the percentages of successes and failures are not approximately equal. In many real world situations, majority of the observations are “normal” (i.e., success) with a much smaller fraction of failures. The overall probability of correct classification for extremely imbalanced data sets can be very high but the probability of correctly predicting the minority class can be very low. Consider a fictitious example of a dataset with 1,000,000 observations out of which 999,000 are successes and 1,000 failures. A rule that classifies all observations as successes will have very high accuracy of prediction (99.9%) but the probability of correctly predicting a failure will be 0. In many situations, the cost associated with incorrect prediction of a failure is high, and it is therefore important to improve the prediction accuracy of failures as well. Literature suggests that over-sampling of the minority class with replacement does not necessarily predict the minority class with higher accuracy. In this article, we propose a simple over-sampling method which bootstraps a subset of the minority class, and illustrate the bootstrap over-sampling method with several examples. In each of these examples, an improvement in prediction accuracy is seen.

References

Allison, Paul (2012, February 13). Statistical Horizons. Retrieved from https://statisticalhorizons.com/logistic-regression-for-rare-events

Bozorgi, Mandana, Taghva, Kazem, & Singh, Ashok (2017). Cancer Survivability with Logistic

Regression. Computing Conference 2017 (18-20 July 2017) London, UK. https://ieeexplore.ieee.org/document/8252133/citations

Catanghal Jr, R. A., Palaoag, T. D. and Malicdem, A. R. (2017). Crowdsourcing approach for disaster response assessment. Matter: International Journal of Science and Technology.

Chawla, Nitesh V. (2005). Data Mining and Knowledge Discovery Handbook. Maimon, Oded,

Rokach, & Lior (Eds.), Data mining for imbalanced data: an overview, (pp. 853-867). New York, Springer.

Chawla, N. V., Bowyer, K, Hall, L, & Kegelmeyer, W. (2002). SMOTE: Synthetic Minority

Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357.

Concato J, Feinstein AR (1997). Monte Carlo methods in clinical research: applications in multivariable analysis. Journal of Investigative Medicine, 45(6), 394-400.

Crone, S. F. and Finlay, S. (2012). Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting 28 224–238.

Efron, B. and Tibshirani, R. (1986). Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Volume 1, Number 1, pp. 54-75.

Efron, B. and Tibshirani, R. (1991). Statistical Data Analysis in the Computer Age. Science, Vol. 253, pp. 390-395.

Fox, John & Monette, Georges. (1992). Generalized collinearity diagnostics. Journal of the American Statistical Association, 87(417), 178-183.

Guillet, F., & Hamilton, H., J. (Eds.). (2007). Quality measures in data mining. (Vol.43). New York: Springer.

Keleş, Mümine Kaya (2017). An overview: the impact of data mining applications on various sectors. Technical Journal 11, 3(2017), 128-132.

King, Gary & Zeng, Langche. (2001). Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137-163.

Namvar, A., Siami, M., Rabhi, F., Naderpour, M. (2018). Credit risk prediction in an imbalanced social lending environment. arXiv preprint arXiv:1805.00801

Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. (1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology, 49 (12), 1373-1379.

Ramadhan, M. M., Sitanggang, I. S. and Anzani, L. P. (2017). Classification model for hotspot sequences as indicator for peatland fires using data mining approach. Matter: International Journal of Science and Technology, Special Issue Volume 3 Issue 2, pp. 588-597.

Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45(4), 427–437.

Syaifudin, Y. W. and Puspitasari, D. (2017). Twitter data mining for sentiment analysis on Peoples feedback against government public policy. Matter: International Journal of Science and Technology, Special Issue Volume 3 Issue 1, pp. 110 – 122.

Vittinghoff, Eric & McCulloch, Charles E. (2007). Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression, American Journal of Epidemiology, 165(6), 710–718. https://doi.org/10.1093/aje/kwk052

Wei Q, Dunbrack RL Jr (2013) The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics. PLoS ONE 8(7): e67863.

Downloads

Published

2018-11-15

How to Cite

Chang, M., Dalpatadu, R. J., Phanord, D., & Singh, A. K. (2018). A BOOTSTRAP APPROACH FOR IMPROVING LOGISTIC REGRESSION PERFORMANCE IN IMBALANCED DATA SETS . MATTER: International Journal of Science and Technology, 4(3), 11–24. https://doi.org/10.20319/mijst.2018.43.1124