Title : ( Impact of Feature Types on Boundary Detection Between Metadata and Body in Persian Theses )
Authors: Nima Shadman , Jalal A. Nasiri ,
Abstract
This paper addresses the challenge of distinguishing header metadata from body content in Persian elec- tronic theses and dissertations. Accurate classification of these sections aids tasks such as metadata extraction from scientific documents and plays a crucial role in in- creasing the efficiency and retrieval of information in digital libraries. Several machine learning models were employed to achieve this goal. Additionally, five distinct feature types were utilized: Heuristic, Sequential, Lexical, Formatting, and Geometric. The dataset consisted of nearly 230,000 paragraphs extracted from 106 Persian ETDs, with the metadata class representing only 8.6%. After preprocessing, Random Forest slightly outperformed SVM and Naïve Bayes. Moreover, our findings indicate that features of sequential type notably impact the classification metrics.
Keywords
, Paragraph Classification, Metadata Extraction, Persian Scientific Documents, Features Fusion@inproceedings{paperid:1102452,
author = {Shadman, Nima and Nasiri, Jalal A.},
title = {Impact of Feature Types on Boundary Detection Between Metadata and Body in Persian Theses},
booktitle = {The First International Conference on Machine Learning and Knowledge Discovery (MLKD 2024)},
year = {2024},
location = {تهران, IRAN},
keywords = {Paragraph Classification; Metadata Extraction; Persian Scientific Documents; Features Fusion},
}
%0 Conference Proceedings
%T Impact of Feature Types on Boundary Detection Between Metadata and Body in Persian Theses
%A Shadman, Nima
%A Nasiri, Jalal A.
%J The First International Conference on Machine Learning and Knowledge Discovery (MLKD 2024)
%D 2024