The techniques used so far to extract bilingual lexis have been mostly oriented towards extracting lexis from parallel corpora or translation memories. This strategy does, however, pose some weaknesses. Firstly, phrases translated manually beforehand constitute the source of the extractions, and consequently, the equivalences obtained are the result of a translation process.
That could lead to a certain lack of naturalness. Secondly, parallel corpora are a very scarce resource, and even scarcer in the case of minority languages. Consequently, there has been growing interest in recent times in comparable corpora and the techniques for exploiting them.
Multilingual comparable corpora are collections made up of texts written in at least two languages, but unlike parallel corpora, these texts are not translations of each other, yet nevertheless have features in common like subject matter, publication date, text genre or register, etc. Depending on the nature of these common features, information of one kind or another can be taken from these corpora. For example, if the texts deal with the same subject (Medicine, for example), it is possible to take multilingual terminology from the field.
The main aim of AzerHitz is to research into and develop techniques for extracting bilingual lexis from comparable corpora in a semi-automatic way. The work done in the Erauzterm and ELexBI projects has been taken as the starting point. Even if the raw material of these projects and the techniques for exploiting the raw material are very different, there are some aspects that are the same in these two types of extraction.
The idea underpinning AzerHitz is this: “The words appearing in the vicinity of a specific word are similar in two languages.” In other words, the context of a word is similar in two languages. AzerHitz searches for the translations of words by taking advantage of this similarity.
The second aim of this project is to design measures to specify the level of comparability of comparable corpora. So if we can measure the extent to which collections of documents in one language and in another language are similar, we will be capable of achieving comparable corpora that are more suitable for conducting the extraction.
Assisted by funding in the Saiotek 2007 (Saiotek grant application period for 2007) of the Department for Industry, Tourism and Trade of the Basque Autonomous Community Government.
Copyright © 2007 Elhuyar Fundazioa | Legal notice | Site Map | Erabiltzaile-kopurua: 856789
Diseinua: Blanco