In the framework of a contrastive approach to fluency, the recordings that form the corpus will be selected according to the following criteria: modality (spoken language vs. sign language); language (French, English); type/genre (scripted vs. unprepared speech, monologue vs. dialogue). Due to the limitations of annotation work (see below), it will not be possible to annotate all the (dis)fluency phenomena in each of the corpora, but comparable sub-corpora will be constituted for different languages and modalities.
For each criterion, the sub-corpus will include a sufficient number of speakers so as to observe individual variation and idiosyncratic behaviors,
according to the hypothesis that there are various fluency profiles (Götz 2011).
Our approach to fluency focuses on three crucial properties that have an impact on corpus annotation.
First, “a single measure taken in isolation is not necessarily a reliable indication of proficiency, so that overall fluency is best measured as a group of features” (Osborne 2011: 255, cited by Götz 2011: 141).
Second, the same feature (e.g. a filled pause, a discourse marker) may be a cue to fluent or disfluent speech, according to its location, its function or its
frequency.
Thirdly, we aim at contrasting fluency markers in different modalities (spoken vs. sign languages) and different languages and genres.
Consequently, we need a multi-level annotation scheme (prosody, lexis, grammar, discourse) adaptable to each language and modality, and a flexible
framework for integrating annotations from different tools, and even different tag sets (Chiarcos et al. 2008).
Working on various languages and modalities requires multiple annotation schemes with “either different terms [being] used for the same phenomenon, or
the phenomenon [being] conceptualized in different ways” (Chiarcos et al. 2008: 222).
The corpus annotation WP takes this conceptual complexity into account by linking annotation sets and reference concepts (meta-tags) within an explicit
ontology. As a result, we will have the possibility of searching across heterogeneously annotated data by means of simple instructions formulated in
the query language of the database.
From a technical point of view, annotations will be in a stand-off, time aligned format and imported into a database, such as the open-source linguistic
information system ANNIS8 (Chiarcos et al. 2008) which
● supports audio and video annotations created in different tool formats;
● helps to visualize multi-level annotations;
● allows the user to query simultaneously on different levels of annotations.
The data annotation procedure will concern the fluencemes identified in WP1.
Promoters:
C. Fairon & A.C. Simon
Researches involved:
S. Roekhaut