关于平行语料库对齐层面的一点想法

xiao4zhu · 发表于 2008-10-11 18:24:53

下面这个词条抄自Paul Baker, Andrew Hardie and Tony McEnery三位学者撰写的A Glossary of Corpus Linguistics（Edinburgh University Press， 2006）

alignment 对齐

When working on a parallel corpus, it is useful to know exactly which parts of a text in language A correspond to the equivalent corresponding text in language B. The process of adding such information to parallel texts is called alignment.

Alignment can be carried out at the sentence level, in which case each sentence is linked to the sentence it corresponds to in the other language(s). This is not straightforward, as the sentence breaks are not necessarily in the same place in a translation as they are in the original text.

Alternatively, alignment can be done at the word level, in which case each word must be linked to a word or words in the parallel text. This is much more complex, as a given word may correspond to one word, more than one word, or no word at all in the other language, and the word order may be different as well. For example, English I saw it would correspond to French je l’ai vu, where I = je, saw = ai vu, and it = l’. However, word alignment is also much more useful than sentence alignment, for example, for finding translation equivalents and compiling bilingual lexicons.

When a spoken corpus is released alongside the sound recordings from which it was created, the text may contain markup to show the point in time in the recording to which each chunk of text corresponds. This is also referred to as alignment (more specifically, time-alignment or temporal alignment).

简评：
语料对齐是平行语料库中的一个关键问题。一般来说，对齐的语言单位越小，价值越大，但技术难度更大，费时更多。根据我的经验，从实用角度出发，个人自建平行语料库时，能做到段落对齐就有相当的利用价值了。

		自动登录	找回密码
密码			注册

[【外语类】] 关于平行语料库对齐层面的一点想法