Piek Vossen

CECL

Piek Vossen

 

From WordNet, EuroWordNet to the Global Wordnet Grid

 



Prof. Dr. Piek Vossen has a chair as Professor in Computational Lexicology at the Faculty of Arts (Vrije Universiteit Amsterdam, The Netherlands) and he is the CTO at Irion Technologies, which is a language-technology company in Delft, the Netherlands. He is also co-founder and co-president of the Global Wordnet Association (GWA), together with Dr. Christiane Fellbaum from Princeton University. GWA supports the development of wordnets for all languages in the world.

 

 

In this presentation, I will give an overview of the English Wordnet and EuroWordNet and sketch the perspective of the future of the Global Wordnet Grid. English Wordnet is the resource that is used mostly in language technology. It had and still has an enormous impact in the development of language technology. The English Wordnet is organized around the notion of a synset, which is a set of synonymous words and expressions in a language. Each synset represents a concept, and lexical semantic relations, such as hyponymy and meronymy, are expressed between synsets. Wordnet thus deviates from the tradional lexical resources that take individual word meanings as a basis.

EuroWordNet not only extended the model to other languages but also added a cross-lingual perspective to lexical semantic resources. In EuroWordNet, the synsets of all languages have been related through equivalence relations to the synsets from the English wordnet, which thus functions as an interlingua between the different languages. Through English, the vocabulary of any language can be mapped to any other language in the model. This raises fundamental issues about the cross-lingual status of the semantic information, as a system of a language or a system of knowledge of the world. Many lexical resources duplicate semantic information and knowledge that is not specific to a language. This not only leads to inconsistencies across resources but also complicates specifying the relations across the vocabularies of different languages.

This matter is taken a step further in the Global Wordnet Grid, where wordnets are related to a shared ontology that makes a common world knowledge model explicit. An ontology as an interlingua has many advantages to using a real language:

  1. specific features of English, both cultural and linguistic, are not complicating the definition of the equivalence relations of languages to the index;
  2. concepts that do not occur in English can easily be added to the ontology;
  3. the meaning of the concepts can be defined by formal axioms in logic;
  4. it will be possible to make a more fundamental distinction between knowledge of the world and knowledge of a language;
  5. the ontology can be used for making semantic inferences by computer programs in a uniform way despite the language that is linked to it;

Obviously, developing such an ontology and defining the mappings from the vocabularies to the ontology will be a long and painstaking process. In recent years though, a lot of progress has been made in the area of ontology development, which makes such an enterprise more realistic. Many ontologies and semantic lexicons have been developed and represented more and more in standardized formats. Proposals are being developed in ISO working groups how the structure each and how to relate lexicons to ontologies. Distributed resources are published on the web and are intensively used both in the Semantic Web 2.0 community of social networks and the Semantic Web 3.0 community of active knowledge repositories. The time is therefore ripe for a project such as the Global Wordnet Grid.

A first implementation of the Global Wordnet Grid is built in the current FP7 project KYOTO. An important feature of KYOTO is that the building of wordnets and the ontology is done by communities in specific domains through a Wiki environment. In this environment, the people in these communities discuss and define the meanings and concepts of the terms in their field, even across languages. As a starting point, the Wiki environment is pre-loaded with terms that are automatically derived from documents that can be uploaded. The rich term database with pointers to textual occurrences of these terms, will make it easier to define the meanings in a formal way. The Wiki uses textual examples and paraphrases in interviews to validate the relations and formal definitions. The derived knowledge structures are hidden to the user but can directly be applied by other computer programs to mine important facts and data from the sources provided by the community. KYOTO is therefore not just another Wikipedia but a platform for defining and anchoring meaning across languages and across people and computers. Likewise, KYOTO allows communities to build lexicons as a form of knowledge and language acquisition from which they directly benefit to handle knowledge and facts. Since built lexicons and ontologies are also anchored to generic wordnets and a generic ontology, the distributed community effort will eventually lead to the development of the Global Wordnet Grid.