Issues in Dari Transcription
Abstract
Orthographic transcription is used in representing spoken data and translations in the target language when creating parallel corpora. The development of guidelines and standards is important in assuring a consistent set of transcriptions, especially in machine learning applications. However, in the case of languages with a complex writing system and the absence of official orthographic standards, developing transparent and consistent transcription guidelines becomes very difficult. This paper presents the issues encountered in creating transcription guidelines for Dari for speech translation projects.
In developing guidelines, efforts are typically made to maintain consistency in the transcriptions by following the standard orthographic conventions as much as possible. Guidelines should also be transparent so that transcribers can easily follow them, providing accurate transcriptions of the data. These criteria are difficult to achieve in the Persian writing system since it possesses an opaque orthography and there is a large variance in writing affixes and compounds, making word and morpheme boundaries difficult to detect. In addition, Dari displays diglossia, dialectal variation, a lack of standardization, and low societal literacy. All of these conditions helped affect the quality of the transcriptions in the Dari corpus and hindered the development of transcription standards.
[This document was not public released and is not shareable]
Public released
no
External link:
Download Document
(if available)