The Science and Technology Corpus Project revealed how laborious it is to obtain a corpus by collecting texts, and that the growing international trend is to obtain these texts through the Internet. As a service that allows the web to be consulted as if it were a huge Basque corpus, CorpEus was developed in line with this strategy. The use of corpora in language technologies is becoming more and more widespread, and that is why suitable corpora need to be obtained in the short term.
The main aim of Co3 is to develop a tool to obtain comparable corpora using the Web as the starting point. Even though this is its general aim, it has more specific intermediate aims:
The tools that attempt to obtain corpora of languages with limited resources are oriented towards obtaining the largest corpora possible, and are not designed for obtaining specialized corpora. Moreover, in the few attempts made to obtain specific corpora, languages with a large presence on the Internet have been used, without getting involved in the problems posed by small languages for obtaining corpora with proper size.
As pointed out already, Co3 attempts to build corpora using the Internet as the source. To do this, documents from Internet are gathered, and their suitability is analysed by applying a range of techniques until a corpus of suitable size possible has been obtained.
If this process is used to obtain corpora of different languages but always on the same subject, it is possible to obtain comparable corpora. In this respect, the research that the AzerHitz project is carrying out on techniques to measure the level of comparability of corpora could be very useful.
Copyright © 2007 Elhuyar Fundazioa | Legal notice | Site Map | Erabiltzaile-kopurua: 856789
Diseinua: Blanco