Published in Journal of Data Mining & Digital Humanities: Machine transliteration of long text with error detection and correction

By Mohamed Abdellatif, Joel U. Bretheim, and Marina Rustow

Different writing systems have been (historically and contemporarily) used to write out the same language. This is typically done by substituting letters (or symbols, in the case of non-alphanumeric systems). However, depending on the language and the involved writing systems, the process may not be purely deterministic. Quoting Becker and Becker [2000]

even such basic acts as transliteration involve interpretation– to the extent that there is
meaning in the medium itself
.
.
In transliteration itself there is exuberance (that is, meaning is added) and deficiency
(meaning is lost).

This gives significance to the problem of Machine Translation in the intersection of Digital Humanities and Natural Language Understanding. Transformer-based models achieved success modeling human languages. However, many of them have the limitation of handling an input of maximum length of 512 tokens. To reuse a pre-trained model with this limitation for downstream tasks (e.g., Machine Transliteration) on input of sequences longer than 512 tokens, we propose a method to segment the input into interleaving (not mutually exclusive) pieces, invoke the model in a piecewise manner and construct the result. To consolidate the result, we propose a method to detect and correct potential (duplication and elimination) errors that reduces Word Error Rate from 0.0985 to 0.0.

Read the paper: https://zenodo.org/records/14982300

The Princeton Research Software Engineering Group Blog

Princeton University

Like this: