Optimizing Urdu Text Tokenization: Morphological Rules for Compound Word Identification
DOI:
https://doi.org/10.5281/zenodo.15069872Parole chiave:
Bigram, Compound words, Sentiment analysis, Tokenization, Word segmentationAbstract
Tokenization in a text document is regarded as a primary natural language processing task for feature generation, and it plays a vital role in sentiment analysis, information retrieval, part of speech tagging, and named entity recognition. Urdu is spoken by around 170.2 million people worldwide as their first or second language. It is a morphologically and orthographically rich language. Word tokenization in Urdu text documents is very challenging because word boundaries are not specified by only space, as in other languages. A compound, a multi-word expression, is a more complex word consisting of multiple strings or independent base words. Tokens are the minimal unit of any language with a suitable semantic structure. Traditionally, bigram or trigram approaches represent compound words in the tokenization process. This research proposes a morphological rules-based approach to identify compound words in Urdu text for tokenization. A thorough evaluation is performed on a dataset of reasonable size to compare the performance of the proposed technique with traditional approaches. Results show that the proposed method can accurately identify the compound words for the tokenization of Urdu text documents. Notably, using morphological rule-based techniques for compound words reduces the number of extracted features.
Dowloads
Pubblicato
Fascicolo
Sezione
Licenza
Copyright (c) 2025 International Journal of Technology

Questo volume è pubblicato con la licenza Creative Commons Attribuzione - Non commerciale - Non opere derivate 4.0 Internazionale.
All articles published in International Journal of Technology are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). This license allows others to share, copy, distribute, and adapt the work for any purpose, even commercially, as long as appropriate credit is given to the original authors. Authors retain the copyright and agree to have their work published under this license, ensuring the broadest possible dissemination and reuse of their research.
For more information or licensing inquiries, contact mossdigital77@gmail.com.