Optimizing Urdu Text Tokenization: Morphological Rules for Compound Word Identification

Auteurs

  • Saquib Khushhal Dept. of Computer Science, University of Azad Jammu & Kashmir, Pakistan. Auteur
  • Abdul Majid Dept. of Computer Science, University of Azad Jammu & Kashmir, Pakistan Auteur
  • Ali Abbas Dept. of Computer Science, University of Azad Jammu & Kashmir, Pakistan Auteur
  • Umza Naqvi Dept. of Computer Science, University of Azad Jammu & Kashmir, Pakistan. Auteur
  • Mohammad Babar Department of Computing and Electronics Engineering, Middle East College Muscat, Oman. Auteur

DOI:

https://doi.org/10.5281/zenodo.15069872

Trefwoorden:

Bigram, Compound words, Sentiment analysis, Tokenization, Word segmentation

Samenvatting

Tokenization in a text document is regarded as a primary natural language processing task for feature generation, and it plays a vital role in sentiment analysis, information retrieval, part of speech tagging, and named entity recognition. Urdu is spoken by around 170.2 million people worldwide as their first or second language. It is a morphologically and orthographically rich language. Word tokenization in Urdu text documents is very challenging because word boundaries are not specified by only space, as in other languages. A compound, a multi-word expression, is a more complex word consisting of multiple strings or independent base words. Tokens are the minimal unit of any language with a suitable semantic structure. Traditionally, bigram or trigram approaches represent compound words in the tokenization process. This research proposes a morphological rules-based approach to identify compound words in Urdu text for tokenization. A thorough evaluation is performed on a dataset of reasonable size to compare the performance of the proposed technique with traditional approaches. Results show that the proposed method can accurately identify the compound words for the tokenization of Urdu text documents. Notably, using morphological rule-based techniques for compound words reduces the number of extracted features.

Downloads

Gepubliceerd

2025-03-22

Nummer

Sectie

Articles