du.sePublikationer
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • chicago-author-date
  • chicago-note-bibliography
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Evaluation of Calibration Methods to Adjust for Infrequent Values in Data for Machine Learning
Högskolan Dalarna, Akademin Industri och samhälle, Mikrodataanalys.
2018 (Engelska)Självständigt arbete på avancerad nivå (masterexamen), 20 poäng / 30 hpStudentuppsats (Examensarbete)
Abstract [en]

The performance of supervised machine learning algorithms is highly dependent on the distribution of the target variable. Infrequent values are more di_cult to predict, as there are fewer examples for the algorithm to learn patterns that contain those values. These infrequent values are a common problem with real data, being the object of interest in many _elds such as medical research, _nance and economics, just to mention a few. Problems regarding classi_cation have been comprehensively studied. For regression, on the other hand, few contributions are available. In this work, two ensemble methods from classi_cation are adapted to the regression case. Additionally, existing oversampling techniques, namely SmoteR, are tested. Therefore, the aim of this research is to examine the inuence of oversampling and ensemble techniques over the accuracy of regression models when predicting infrequent values. To assess the performance of the proposed techniques, two data sets are used: one concerning house prices, while the other regards patients with Parkinson's Disease. The _ndings corroborate the usefulness of the techniques for reducing the prediction error of infrequent observations. In the best case, the proposed Random Distribution Sample Ensemble reduced the overall RMSE by 8.09% and the RMSE for infrequent values by 6.44% when compared with the best performing benchmark for the housing data set.

Ort, förlag, år, upplaga, sidor
2018.
Nyckelord [en]
Data mining, resampling, ensemble.
Nationell ämneskategori
Ekonomi och näringsliv
Identifikatorer
URN: urn:nbn:se:du-28134OAI: oai:DiVA.org:du-28134DiVA, id: diva2:1231432
Tillgänglig från: 2018-07-06 Skapad: 2018-07-06

Open Access i DiVA

fulltext(1537 kB)51 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 1537 kBChecksumma SHA-512
c383b9bee92cb6b21b858561543a28fb4587434b16cd33e873e531fade40de6ba333399d47e1235b91ef1489f5106a1a16284c7e5af8e65f5d921be75511c0a8
Typ fulltextMimetyp application/pdf

Av organisationen
Mikrodataanalys
Ekonomi och näringsliv

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 51 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

urn-nbn

Altmetricpoäng

urn-nbn
Totalt: 112 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • chicago-author-date
  • chicago-note-bibliography
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf