Dalarna University's logo and link to the university's website

du.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • chicago-author-date
  • chicago-note-bibliography
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
An NLP-based method for information retrieval and integration in Swedish Language
Dalarna University, School of Information and Engineering, Microdata Analysis.
Dalarna University, School of Information and Engineering, Microdata Analysis.
2024 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

This research explores the feasibility of extracting information from Swedish PDF files and providing specific answers to user questions related to the Swedish Transport Administration (Trafikverket). It proposes an open-domain QA system using the Retrieval-Augmented Generation (RAG) architecture and compares the relevancy of two main Large Language Models (LLM): ChatGPT 3.5 and BERT. The implementation involves extracting text from Swedish PDF files, converting the text into manageable chunks, and leveraging large language models to deliver accurate answers. Additionally, a verification dataset comprising 23 questions, 23 generated answers from the solution, and 23 reference answers was created to verify the answers. The solution's accuracy, as indicated by the ROUGE score, was found to be 82% for the GPT model. However, the BERT model was less effective, providing general and lengthy answers instead of specific ones. Consequently, the BERT model was discarded after assessing the output. In conclusion, the study found that GPT 3.5 was a more effective LLM for answering questions related to the Swedish Transport Administration (Trafikverket) in Swedish. Furthermore, the GPT model demonstrated higher accuracy and relevancy in delivering specific answers.

Place, publisher, year, edition, pages
2024.
Keywords [en]
Large Language Model (LLM), Generative Pre-trained Transformer (GPT), Recall Oriented Understudy for Gisting Evaluation (ROUGE) Scores, Retrieval Augmented Generation (RAG) and Open QA (Open Domain Question Answering) Solution
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:du-49225OAI: oai:DiVA.org:du-49225DiVA, id: diva2:1889885
Subject / course
Microdata Analysis
Available from: 2024-08-16 Created: 2024-08-16

Open Access in DiVA

fulltext(2204 kB)183 downloads
File information
File name FULLTEXT01.pdfFile size 2204 kBChecksum SHA-512
120df8571a6515a430c1ea2882d1076cf33cae2b2d9fecf7135a66aa54fbd38cc224e4baf927969048399476d7488897b2ecde4dca562ca8c03d50dfd4bd1167
Type fulltextMimetype application/pdf

By organisation
Microdata Analysis
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 183 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 381 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • chicago-author-date
  • chicago-note-bibliography
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf