CENTAL Seminars (2016-17)

The goal of the CENTAL seminars is to gather teachers, students and researchers (from both the academic and the industrial sectors) who are interested in the field of Natural Language Processing. The seminars are free and open to anyone and usually take place on Friday from 2pm to 3pm in room c.142 of the Collège Erasme (CENTAL's seminar room). If you wish to be up-to-date with the upcoming seminars, please subscribe to our newsletter by entering your email address in the form.

Contact

Serge Bibauw
Anaïs Tack

Programme for 2016-2017

FIRST TERM

Friday 7 Oct 2016, 14:00-15:00

Leonardo Zilio (INF, Universidade Federal do Rio Grande do Sul & CENTAL, UCL)

Semantic Role Labeling and Lexical Simplification: two samples of NLP applications

This seminar will present to two studies that have different goals. On the first part, we will show the processes behind a semantic role labeling study, starting from corpus selection and parsing, and moving to argument extraction and semantic annotation. This includes the development of a subcategorization frames extractor for Brazilian Portuguese, and an annotation process that was carried out manually. The resulting resource contains more than 15 thousand arguments annotated for 192 verbs. On the second part, we will show two experiments that were developed around an overarching lexical simplification project. The first one deals with word embeddings and semantical relations among words, where the objective was to use word embeddings and a lexical resource (BabelNet) to generate a dictionary of synonyms, hypernyms and antonyms. The resource was automatically and manually validated, presenting 60,7% of validated entries, resulting in a dictionary of 2,875 validated relations. The second experiment was the generation of a gold standard, a training/test set for complex word identification and a dictionary for lexical simplification, using classic literature texts as corpus. The texts were processed using parsing and a frequency list index to facilitate the manual annotation process. A total of 3,720 manual annotations were carried out and later transformed into each of the resources.

Friday 28 Oct 2016, 14:00-15:00

Orphée De Clerq (LT3, UGent)

UCL: great beer at the “cercles", but very dirrrrty!!
Aspect-based sentiment analysis of customer reviews: an overview of the task and its main challenges

The original objective of sentiment analysis, a very popular NLP task, has been to automatically classify an entire document or sentence as positive, negative or neutral. This, however, does not allow to discover what people like and dislike exactly. Often, users are not only interested in people’s general sentiments about a certain product, but also in their opinions about specific features, i.e., parts or attributes of that product. This comes down to a very fine-grained task, known as aspect-based sentiment analysis (ABSA) and is the topic of this seminar.

We will see that ABSA actually comprises several subtasks -aspect term extraction, aspect term classification and aspect polarity classification- each requiring a different approach. We will have a closer look at the current state of the art for each of these subtasks and focus on supervised machine learning techniques for processing English and Dutch restaurant reviews. To conclude, we will discuss some of the main challenges the domain is still facing, which illustrates that this task is far from solved.

Friday 18 Nov 2016, 14:00-15:00

Cédric Lopez (VISEO, Grenoble)

SMILK : du TALN au LOD. Représentation des connaissances, extraction d’entités et relations, liage et visualisation

Un des objectifs du laboratoire commun SMILK (Social Media Intelligence and Linked Knowledge, LabCom ANR) concerne l’étude du couplage du Traitement Automatique du Langage Naturel (TALN) au Linked Open Data (LOD). Pour atteindre cet objectif, nos recherches portent sur : 1) l’extraction d’entités d’intérêt et de leurs relations dans un contenu textuel non structuré, 2) la représentation des connaissances extraites, 3) le liage des données extraites avec les données du LOD, 4) la visualisation et l’exploration des données liées.

La présentation fera l’état de nos recherches et nous démontrerons les possibilités issues des résultats de recherche par le biais d’un prototype prenant la forme d’un plugin de navigateur ayant pour principale ambition d’enrichir les connaissances des utilisateurs naviguant sur le Web. Au fur et à mesure de la navigation sur le Web, le système peuple la base de connaissance et tisse des liens avec le Web des données ouvertes que l’utilisateur peut parcourir.

Friday 25 Nov 2016, 14:00-15:00

Mathieu Constant (ATILF, Université de Lorraine)

Identification des expressions poylexicales et analyse syntaxique en dépendances

Les expressions polylexicales (EP) sont des séquences formées de plusieurs mots se caractérisant par un certain degré de non-compositionalité que ce soit au niveau morphologique, lexical, syntaxique, sémantique ou/et pragmatique. Leur identification est cruciale pour les différentes applications du traitement automatique des langues.

Dans cet exposé, nous nous intéressons à l’intégration de l’identification des EP au sein de l’analyse syntaxique en dépendances statistique. Après avoir évoqué les différents défis liés à l’identification automatique des EP, nous aborderons ce sujet en essayant de répondre à deux problématiques: (1) trouver une représentation la plus riche possible des expressions polylexicales au regard de l’analyse syntaxique; (2) adapter les algorithmes d’analyse existants pour prédire de manière jointe l’analyse lexicale et syntaxique d’une phrase dans cette représentation. En particulier, nous montrerons de nouvelles représentations factorisées sur deux dimensions, ainsi que de nouveaux algorithmes d’analyse syntaxique intégrant des mécanismes spécifiques pour l’identification des EP.

Cette présentation est le fruit d'un travail collaboratif avec Marie Candito (Univ. Paris-Diderot), Joseph Le Roux (Univ. Paris-Nord), Joakim Nivre (Uppsala University) et Nadi Tomeh (Univ. Paris-Nord).

Friday 2 Dec 2016, 14:00-15:00 (COUB 01 auditorium)
Seminar co-organised with the Louvain School of Translation and Interpreting (LSTI)

Ruslan Mitkov (RGCL, University of Wolverhampton)

The new generation of translation memories

SECOND TERM

Wednesday 15 Feb 2017, 11:00-12:00
Extraordinary session

Victoria Yaneva (RGCL, University of Wolverhampton)

Do You See What I Mean?
The Use of Eye Tracking Data in Readability and Accessibility Research

Gaze data has received a lot of interest in the NLP community recently – as a means to evaluate, as well as induce our models. This is based on findings that eye tracking data reveals important information about the cognitive effort of readers, their level of comprehension and their reading patterns. Gaze data is particularly valuable for studying reading in neurodiverse populations such as people with autism, who are often reported to exhibit idiosyncratic reading strategies and lower comprehension levels.

This talk introduces a collection parallel gaze data and comprehension scores obtained by readers with autism and a control group of neurotypical participants during a natural reading task. It presents studies using gaze data for document-level and sentence-level readability estimation, comprehension prediction based on gaze (within groups and across groups), as well as how lexical properties influence the cognitive effort required to understand a text. These findings are discussed from the perspective of improving readability and text accessibility for people with autism. We will also open the debate about the hidden misconceptions when using gaze data.

Friday 17 Feb 2017, 14:00-15:00
Seminar co-organised with the Centre d'études sur le Moyen Âge et la Renaissance (CEMR)

Gilles Souvay (ATILF, Université de Lorraine)

LGeRM : un outil de gestion des états anciens du français

LGeRM (Lemmes Graphies et Règles Morphologiques, prononcer "elle germe") est au départ un lemmatiseur conçu pour gérer la flexion et la variation graphique du français médiéval. Il avait pour but de faciliter la consultation du Dictionnaire du Moyen Français (1330-1500).

Par la suite l'outil a évolué pour traiter des éditions de textes médiévaux afin d'aider à la construction du glossaire. LGeRM glossaire est un outil en ligne permettant de vérifier le texte, de corriger les erreurs de lemmatisation, de lever les ambiguïtés des homographes, de sélectionner les mots à gloser et de générer au final le glossaire. L'outil permet ainsi de réaliser une édition lemmatisée en ligne.

L'outil a été adapté pour traiter la langue du XVIe-XVIIe qui présente des flexions et variantes graphiques différentes des états plus anciens du français. Un lexique morphologique pour chacun de ces états de langue est distribué. Ces lexiques sont utilisés dans la base de données textuelles Frantext et permet de valoriser les textes anciens du corpus en permettant l'interrogation par lemme.

Cet exposé présentera les concepts théoriques derrière l'outil et montrera des réalisations et applications. Ce sera aussi l'occasion de présenter en plus de LGeRM, deux ressources développées à l'ATILF : le DMF et Frantext.

Friday 24 Feb 2017, 14:00-15:00

Leen Sevens (CCL, KU Leuven)

Text-to-Pictograph Translation and Vice Versa for People with Intellectual Disabilities

We describe, demonstrate and evaluate a Text-to-Pictograph translation system that is used in an online platform for Augmentative and Alternative Communication (AAC), which is intended for people who are not able to read and write, but who still want to communicate with the outside world (Vandeghinste et al., 2015). The system is set up to translate from Dutch, English and Spanish text into Sclera and Beta, two publicly available pictograph sets consisting of several thousands of pictographs each. We have linked large amounts of these pictographs to synsets or combinations of synsets in WordNets, lexical-semantic databases. We also describe the other direction and how it works to generate text from sequences of pictographs (Sevens et al., 2015).

Sevens, L., Vandeghinste, V., Schuurman, I., and Van Eynde, F. (2015). Natural Language Generation from Pictographs. In: Proceedings of the 15th European Workshop on Natural Language Generation (ENLG), pp. 71-75. Brighton. September 2015. Association for Computational Linguistics.

Vandeghinste, V., Schuurman, I., Sevens, L., and Van Eynde, F. (2015). Translating Text into Pictographs. Natural Language Engineering. Cambridge University Press.

Friday 17 Mar 2017, 14:00-15:00

Thomas Drugman (Amazon Development Center Germany)

Active and Semi-Supervised Learning in Automatic Speech Recognition

This presentation focuses on Automatic Speech Recognition (ASR), as used in various Amazon products such as Alexa (Amazon Echo) and FireTV. For such applications, a lot of data is available but only a small portion of them can be labeled.

Because speech data labeling is a time-consuming and hence costly process, it is crucial to find an optimal strategy to select the data to be transcribed via Active Learning (AL). In addition, the unselected data might also be helpful in improving the performance of the ASR system by Semi-Supervised Training (SST).

After an overview of the ASR technology, we will investigate the benefits of jointly applying AL and SST. Our data selection approach relies on confidence filtering, and its impact on the two main ASR modules (acoustic and language models) will be studied. Our results indicate that, while SST is crucial at the beginning of the labeling process, its gains degrade rapidly as AL is set in place. The final simulation reports that AL allows a transcription cost reduction of about 70% over random selection. Alternatively, for a fixed transcription budget, the proposed approach improves the word error rate by about 12.5% relative.

Friday 28 Apr 2017, 14:00-15:00

Damien De Meyere (CENTAL - Social Media Lab, UCL Mons)

L’annotateur iMediate, un outil pour l’encodage de dossiers médicaux en SNOMED-CT

Un des grands défis actuels auxquels sont confrontés les acteurs de la santé est le déploiement des systèmes de dossiers médicaux informatisés. Si ces derniers visent à organiser et à faciliter l'accès aux informations collectées tout au long du parcours médical d’un patient, force est de constater que les informations importantes sont souvent disséminées à travers de nombreux textes peu ou pas structurés, ce qui rend l’information difficilement exploitable par des outils informatiques tels que les moteurs de recherche. C’est dans ce contexte que s'inscrit le projet pluridisciplinaire iMediate (Innoviris), qui vise à développer un ensemble de ressources et d’outils mobilisables au sein des services hospitaliers belges francophones.

Ce séminaire présentera les différentes étapes du développement de l’annotateur iMediate, capable de produire un résumé structuré des textes médicaux sur la base de la nomenclature internationale SNOMED-CT. Ce logiciel combine une ressource terminologique spécifique ainsi qu’un algorithme d’extraction flexible capable de prendre en compte certaines variations linguistiques inhérentes à toute pratique langagière. Cette présentation sera également l’occasion de sensibiliser le public aux nombreux défis liés à l’exploitation de données médicales.