Dalarna University's logo and link to the university's website

du.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • chicago-author-date
  • chicago-note-bibliography
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Use of machine learning in bankruptcy prediction with highly imbalanced datasets: The impact of sampling methods
Dalarna University, School of Information and Engineering.
2024 (English)Independent thesis Advanced level (degree of Master (One Year)), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Since Altman’s 1968 discriminant analysis model for corporate bankruptcy prediction, there have been numerous studies applying statistical and machine learning (ML) models in predicting bankruptcy under various contexts. ML models have been proven to be highly accurate in bankruptcy prediction up to three years before the event, more so than statistical models. A major limitation of ML models is that they suffer from an inability to handle highly imbalanced datasets, which has resulted in the development of a plethora of oversampling and undersampling methods for addressing class imbalances. However, current research on the impact of different sampling methods on the predictive performance of ML models is fragmented, inconsistent, and limited. This thesis investigated whether the choice of sampling method led to significant differences in the performance of five predictive algorithms: logistic regression, multiple discriminant analysis(MDA), random forests, Extreme Gradient Boosting (XGBoost), and support vector machines(SVM). Four oversampling methods (random oversampling (ROWR), synthetic minority oversampling technique (SMOTE), oversampling based on propensity scores (OBPS), and oversampling based on weighted nearest neighbour (WNN)) and three undersampling methods (random undersampling (RU), undersampling based on clustering from nearest neighbour (CFNN), and undersampling based on clustering from Gaussian mixture methods (GMM) were tested. The dataset was made up of non-listed Swedish restaurant businesses (1998 – 2021) obtained from the business registry of Sweden, having 10,696 companies with 335 bankrupt instances. Results, assessed through 10-fold cross-validated AUC scores, reveal those oversampling methods generally outperformed undersampling methods. SMOTE performed highest in four of five algorithms, while WNN performed highest with the random forest model. Results of Wilcoxon’s signed rank test showed that some differences between oversampling and undersampling were statistically significant, but differences within each group were not significant. Further, results showed that while the XGBoost had the highest AUC score of all predictive algorithms, it was also the most sensitive to different sampling methods, while MDA was the least sensitive. Overall, it was concluded that the choice of sampling method can significantly impact the performance of different algorithms, and thus users should consider both the algorithm’s sensitivity and the comparative performance of the sampling methods. The thesis’s results challenge some prior findings and suggests avenues for further exploration, highlighting the importance of selecting appropriate sampling methods when working with highly imbalanced datasets.

Place, publisher, year, edition, pages
2024.
Keywords [en]
Bankruptcy, class imbalance, oversampling, undersampling
National Category
Business Administration
Identifiers
URN: urn:nbn:se:du-48512OAI: oai:DiVA.org:du-48512DiVA, id: diva2:1857805
Subject / course
Microdata Analysis
Available from: 2024-05-14 Created: 2024-05-14 Last updated: 2025-03-11

Open Access in DiVA

fulltext(1008 kB)120 downloads
File information
File name FULLTEXT01.pdfFile size 1008 kBChecksum SHA-512
f639d0b9ed289b2bce5e8089f2c6c0d3a80b7d4573c4117e9f4c14cf7cbc59c56beeabf92e956672dc316892693529fe50be8059350523d17ba7b809b9cbf881
Type fulltextMimetype application/pdf

By organisation
School of Information and Engineering
Business Administration

Search outside of DiVA

GoogleGoogle Scholar
Total: 120 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 446 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • chicago-author-date
  • chicago-note-bibliography
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf