The First International Conference on Machine Learning and Knowledge Discovery (MLKD 2024) , 2024-12-18

Title : ( Impact of Feature Types on Boundary Detection Between Metadata and Body in Persian Theses )

Authors: Nima Shadman , Jalal A. Nasiri ,

Citation: BibTeX | EndNote

Abstract

This paper addresses the challenge of distinguishing header metadata from body content in Persian elec- tronic theses and dissertations. Accurate classification of these sections aids tasks such as metadata extraction from scientific documents and plays a crucial role in in- creasing the efficiency and retrieval of information in digital libraries. Several machine learning models were employed to achieve this goal. Additionally, five distinct feature types were utilized: Heuristic, Sequential, Lexical, Formatting, and Geometric. The dataset consisted of nearly 230,000 paragraphs extracted from 106 Persian ETDs, with the metadata class representing only 8.6%. After preprocessing, Random Forest slightly outperformed SVM and Naïve Bayes. Moreover, our findings indicate that features of sequential type notably impact the classification metrics.

Keywords

, Paragraph Classification, Metadata Extraction, Persian Scientific Documents, Features Fusion
برای دانلود از شناسه و رمز عبور پرتال پویا استفاده کنید.

@inproceedings{paperid:1102452,
author = {Shadman, Nima and Nasiri, Jalal A.},
title = {Impact of Feature Types on Boundary Detection Between Metadata and Body in Persian Theses},
booktitle = {The First International Conference on Machine Learning and Knowledge Discovery (MLKD 2024)},
year = {2024},
location = {تهران, IRAN},
keywords = {Paragraph Classification; Metadata Extraction; Persian Scientific Documents; Features Fusion},
}

[Download]

%0 Conference Proceedings
%T Impact of Feature Types on Boundary Detection Between Metadata and Body in Persian Theses
%A Shadman, Nima
%A Nasiri, Jalal A.
%J The First International Conference on Machine Learning and Knowledge Discovery (MLKD 2024)
%D 2024

[Download]