Tesouro Informatizado da Lingua Galega


TILG, Tesouro Informatizado da Lingua Galega, came into being as an auxiliary tool for a dictionary-making project and was called the "Lexicographical database for a Galician dictionary" at first. Its original aim was to complement the (mainly dialectal) lexicographical index which the Instituto da Lingua Galega had been compiling since 1971 using old-fashioned file cards. Its data was drawn from Galiciain texts written in the Modern and Present-Day Period.

However, the TILG’s own story really began in 1986 when these texts were digitalized by means of scanning with OCR treatment, except when that wasn’t practical, in which case copy-typing by hand was the method resorted to. Once they were in a digital format, these texts were processed to reduce spelling variants and thereby facilitate further computer processing. The lemmatization and parsing process was semi-automated, using custom software. But lemmatizing tools at that time were subject to serious limitations, in particular for an incompletely standardized language. In addition to ambiguities of the kind found in any language, Galician presented a daunting jumble of morphological and phonetic variants, which were often dialectal, but were also sometimes the fruit of the purist or hyper-purist prejudices characterising certain periods in the history of written Galician. Despite these challenges and the substantial effort required to achieve the corpus’ lemmatization and annotation, it was deemed to be worth the trouble, given the benefits to be reaped from a listing of all spelling variants of every lexical unit grouped under a single canonical form in each case.

The first version of TILG took the shape of an on-line searchable database containing 1464 texts; it was made publicly available in 2003. José Ramom Pichel from Imaxin Software was in charge of programming and software maintenance on that occasion.

Starting in 2006, the texts in the corpus were reprocessed and converted into XML documents. This was achieved in a joint project with the Universidade de Vigo’s Seminario de Lingüística Informática under the leadership of Xavier Gómez Guinovart. The purpose was to make the corpus more useful. It resulted in a new version of the corpus, TILG 2.0, which became accessible on line in 2010 through the RILG portal (Recursos integrados da lingua galega) maintained by the Seminario de Lingüística Informática in Vigo.

That was followed by a new revision begun in 2013, TILG 3.0, which mainly involved updating the search engine and user interface. New features introduced at that time included concordancing of search items and basic statistical information on absolute and relative frequency. César Osorio was in charge of the computing aspects of this project.

In the new version presented here, TILG 4.1, which has been developed together with NLPgo Technologies, S.L., the texts have again been reprocessed to enhance the usefulness of the corpus. The system of tags used for morphosyntactic parsing has been modified and the information in headers, which was formerly quite limited, has been expanded substantially. Consequently, in addition to the two basic kinds of simple search (by Headwords / Grammatical units and Graphic words), TILG is not able to offer an advanced search configuration that permits the refinement of searches and filtering of search results.

The work briefly described here was achieved thanks to a great deal of effort and expenditure in human and economic resources. Financing have been supplied throughout by the Xunta de Galicia’s Dirección Xeral de Política Lingüística (and later on the Secretaría Xeral de Política Lingüística), which has supported this project since it began around 1986. The human resources comprise all those people who have participated directly or indirectly in the project over a period of more than thirty years. Our thanks to all of them!