Optimizing Urdu Text Tokenization: Morphological Rules for Compound Word Identification

Saquib Khushhal; Abdul Majid; Ali Abbas; Umza Naqvi; Mohammad Babar

doi:10.5281/zenodo.15069872

Optimizing Urdu Text Tokenization: Morphological Rules for Compound Word Identification

Auteurs

Saquib Khushhal Dept. of Computer Science, University of Azad Jammu & Kashmir, Pakistan. Auteur
Abdul Majid Dept. of Computer Science, University of Azad Jammu & Kashmir, Pakistan Auteur
Ali Abbas Dept. of Computer Science, University of Azad Jammu & Kashmir, Pakistan Auteur
Umza Naqvi Dept. of Computer Science, University of Azad Jammu & Kashmir, Pakistan. Auteur
Mohammad Babar Department of Computing and Electronics Engineering, Middle East College Muscat, Oman. Auteur

DOI:

https://doi.org/10.5281/zenodo.15069872

Trefwoorden:

Bigram, Compound words, Sentiment analysis, Tokenization, Word segmentation

Samenvatting

Tokenization in a text document is regarded as a primary natural language processing task for feature generation, and it plays a vital role in sentiment analysis, information retrieval, part of speech tagging, and named entity recognition. Urdu is spoken by around 170.2 million people worldwide as their first or second language. It is a morphologically and orthographically rich language. Word tokenization in Urdu text documents is very challenging because word boundaries are not specified by only space, as in other languages. A compound, a multi-word expression, is a more complex word consisting of multiple strings or independent base words. Tokens are the minimal unit of any language with a suitable semantic structure. Traditionally, bigram or trigram approaches represent compound words in the tokenization process. This research proposes a morphological rules-based approach to identify compound words in Urdu text for tokenization. A thorough evaluation is performed on a dataset of reasonable size to compare the performance of the proposed technique with traditional approaches. Results show that the proposed method can accurately identify the compound words for the tokenization of Urdu text documents. Notably, using morphological rule-based techniques for compound words reduces the number of extracted features.

Downloads

Pdf (Engels)

Gepubliceerd

2025-03-22

Nummer

Vol 10 Nr 10 (2025): 10

Sectie

Articles

Licentie

Dit artikel is gelicentieerd onder de Naamsvermelding-NietCommercieel-GeenAfgeleideWerken 4.0 Internationaal licentie.

All articles published in International Journal of Technology are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). This license allows others to share, copy, distribute, and adapt the work for any purpose, even commercially, as long as appropriate credit is given to the original authors. Authors retain the copyright and agree to have their work published under this license, ensuring the broadest possible dissemination and reuse of their research.

For more information or licensing inquiries, contact mossdigital77@gmail.com.

Optimizing Urdu Text Tokenization: Morphological Rules for Compound Word Identification

Auteurs

DOI:

Trefwoorden:

Samenvatting

Downloads

Gepubliceerd

Nummer

Sectie

Licentie

About the Journal

##plugins.generic.webfeed.blockTitle##

Ontwikkeld door

Maak een inzending

Informatie

Taal