Galician and Portuguese word bank

USC. Instituto da Lingua Galega

Galego | Português | English

The project


Tesouro's aims

Tesouro do léxico patrimonial galego e portugués is a joint initiative by Brazilian, Galician and Portuguese universities coordinated by Rosario Álvarez (Instituto da Lingua Galega, Universidade de Santiago de Compostela). The initial purpose of the project is to develop a digital corpus of lexical material linked to traditional life, with special emphasis on lexis associated with activities and lore which are dying out or extinct owing to cultural and social change. The project also aims to supply all the material's geographical details in order to support comparative analyses and studies of the distribution and spread of lexical items.

This project hopes furthermore to contribute to the study of the inherited traditions of the participating countries, both material and immaterial, and thus to constitute a significant contribution to ethnographic studies. The culture and the technology associated with rural life has already disappeared from most parts of our countries owing to the fargoing economic and social changes of recent years.

Yet another of Tesouro's objective is to provide a broad lexical corpus of use in a wide variety of lines of research, whether from synchronic or diachronic perspectives. Apart from its obvious interest for dialectology and lexicologoy, the package is able to supply many other research initiatives with pertinent data in fields ranging from etymology, phonetics and phonology to lexicography and morphology.

Last but not least, Tesouro has the further objective of functioning as a dialectal lexical portal which not only provides an electronic edition of heritage vocabulary but will also contain a wide range of full information of diverse kinds for researchers in different disciplines. To this end, besides being an open data base which includes photographs, drawings and various types of ethnographic information, the Tesouro package also contains an exhaustive inventory of Galician and Portuguese dialect sources with bibliographical data for all publications that describe the dialect lexicon in these areas from any perspective. This inventory is searchable by various criteria such as date, author or region.


The sources

Tesouro do léxico patrimonial galego e portugués covers any kind of lexical study containing dialect material from with the geographical boundaries of Brazil, Galicia or Portugal. The data base includes highly diverse material in its appearance, external and internal structure and level of territorial variation. On the basis of the materials incorporated so far, three chief types of study can be identified:

a) Ethnographical monographs about the speech of a given town, village or district. Generally such studies were mainly conceived of as supervised academic dissertations undertaking to study the language variety of a narrowly defined area, typically a parish, municipality or group of municipalities.

b) Language atlases and dialect surveys. This second group includes material collected from the field in a broad network of geographical points spread across one or more of the countries forming part of Tesouro. Language atlases stand in a place of special significance within this set of materials.

c) Other studies of dialect vocabulary. There are many studies in the ambit of Galician and European Portuguese which examine different aspects of local culture from an onomastic or ethnolinguistic perspective. Typically these studies have been published as articles in scientific journals or as monographs.


Data treatment

It was agreed from early on in the project design stage that the computer package providing access to the Tesouro do léxico patrimonial galego e portugués should retain all the information contained in the original sources while at the same time providing quick and easy access to that information in a way that is useful to a variety of types of user.

Despite the challenges posed to this task of data organisation and classification by the varied nature of the sources, the structure of the data base is sufficiently comprehensive and flexible to accommodate all the information found in the sources.

There follows a descriptive list of the chief fields constituting the Tesouro data base that are used to codify that information:

a) Variant. This is the form that constitutes the heading of a given entry in each of the glossaries in Tesouro. Details of the spelling used by the author are scrupulously respected, since this often provides pertinent information. When the original source only cites the form in phonetic transcription, the corresponding item in conventional spelling has been constructed.

b) Phonetic transcription. Although the project's main focus is lexicographical, it was decided to retain pertinent phonetic information too when the documents in question provide any. Since the conventions used to indicate the pronunciation vary widely, it was necessary to standardise and adapt phonetic transcriptions to IPA standards.

c) Part of speech. The part of speech categorisation that appears in the source is reproduced verbatim. As a result, the same word might be labelled different ways, e.g. s, sm, subst, subst m and so on. To help classify and order material from different sources, a standard part of speech label is also added according to the headword and stored in its own field in the database.

d) Headword. Different phonetic or phonological variants found in sources are brought together under a single headword which makes it easier to see all such variants at a glance. The headword is distinct for each of the language varieties (Galician, European Portuguese and Brazilian Portuguese), so for example the variant forms dereito, dreito, direito in Galician are grouped under the headword dereito and in Portuguese under direito. Derived forms are treated as separate headwords (e.g. queixo, queixelo, queixal).

e) Examples. In some cases sources provide examples of the use of forms. Sometimes grammatical information such as government is provided, other times collocational information is given, and very often verses, sayings or other constructions that contain the word under consideration are cited.

f) Cross-references. It is quite usual in some of the sources for an entry to include cross-references to related items occurring in the same source. Such cross-references may serve, for instance, to link two formally related variants of the same word (e.g. albitanas/albitanes); or the cross-reference may indicate a meaning relationship, for example of hyponymy or heteronymy (e.g. jugo ~ tchabielha, canzile, molida, solada, solinho, temoeiro, canga...); yet again, cross-references may link a number of forms considered secondary to a primary entry which contains in one place all the information about meanings, examples and other aspects.

g) Definitions. Into this field goes semantic information, considered the fundamental part of the sources we compile. Definition is to be understood in a broad sense, including not only the components normally thought of as a definition in the narrow, dictionary sense, but also information that may be interpreted as explaining the meaning of the item in question. Information that does not fit into other, more specific fields are also put in this field, such as footnotes, geographical indications about the place where the variant was collected or observations about whether or not the form is listed in the dictionaries.

h) Semantic classification. When designing the present project it was found desirable to carry out a data-oriented semantic classification of all the incorporated materials. This classification should make it possible to extract information grouped by semantic fields, such that users can obtain a listing of all words linked to a single semantic cluster (such as weather, types of agricultural terrain, plants and trees, buildings, etc.). A system of semantic classification was developed to this end based on earlier studies which results in twelve major types with internal subdivisions. In the case of sources that already have some kind of ordering of items, semantic classifiers are always assigned here following the lead of the author of the material, and this may explain some divergences in the way a given word has been classed semantically in different sources.

i) Geographical index. One of the conditions that must be met by materials to be incorporated into the Tesouro's data base is geographical specification. This requirement implies the attribution of all lexicographical data to a village, parish, municipality or other identifiable geographical entity. Between the two options for indicating the origin of forms, geographical point or area, the latter was chosen because it admits of mapping and gives an idea at a glance of the distribution of forms across the three countries' territories. In view of the differences in size between the three territories, we agreed to use different administrative entities to represent the data. For Galicia and Portugal the concello or municipality is used as the entity of reference. For Brazil we chose the mesorregião, an administrative division covering several munipalities with similar economic and social characteristics. 

j) Pictures and drawings. Many of the materials that have gone into the Tesouro have graphic content serving to illustrate the objects that are described and defined. These photographs and sketches are also placed in the data base and can be consulted at the same time as the textual information is being accessed.


Aid and funding


Ministry of Science and Innovation (Spain), FFI2009-12110.

Science and Technology Foundation. Ministry of Science, Technology and Higher Education (Portugal). PTDC/CLE-LIN/102650/2008.


AECID (Spanish Agency for Internation Cooperation for Development). Ministry of Foreign Affairs and Cooperation (Spain) 2009.


Aid for consolidation and organisation of research units of the Galician University System (SUG), General Secretariat of Universities, Education and University Regulation Council (Galicia).

© 2018 Instituto da Lingua Galega - USC