Twenty Years of Learner Corpus Research. Looking Back, Moving Ahead

IL&C Louvain-La-Neuve, Mons

To purchase this volume, visit our publisher's website Presses Universitaires de Louvain.

TABLE OF CONTENTS

Katherine ACKERLEY
A comparison of learner and native speaker writing in online self-presentations:
Pedagogical applications

Theodora ALEXOPOULOU, Helen YANNAKOUDAKIS & Angeliki SALAMOURA
Classifying intermediate learner English: A data-driven approach to learner
corpora

Margit BRECKLE & Heike ZINSMEISTER
L1 transfer versus fixed chunks: A learner corpus-based study of L2 German

Julian BROOKE & Graeme HIRST
Native language detection with ‘cheap’ learner corpora

Marcus CALLIES & Ekaterina ZAYTSEVA
The Corpus of Academic Learner English (CALE) – A new resource for the study and assessment of advanced language proficiency

Erik CASTELLO
Integrating learner corpus data into the assessment of spoken interaction in English in an Italian university context

Evelyne CAUVIN
Intonational phrasing as a potential indicator for establishing prosodic learner profiles

Meilin CHEN
Phrasal verbs in a longitudinal learner corpus: Quantitative findings

Pieter DE HAAN & Monique VAN DER HAAGEN
The search for sophisticated language in advanced EFL writing: A longitudinal study

Deise P. DUTRA & Tony Berber SARDINHA
Referential expressions in English learner argumentative writing

Anna ESPUNYA
Investigating lexical difficulties of learners in the error-annotated UPF learner translation corpus

Michael FLOR & Yoko FUTAGI
Producing an annotated corpus with automatic spelling correction

Costas GABRIELATOS
If-conditionals in ICLE and the BNC: A success story for teaching or learning?

Thomas GAILLAT
This and that in native and learner English: From typology of use to tagset characterisation

Francesca GALLINA
The Lexicon of Spoken Italian by Foreigners: A study on the acquisition of vocabulary by L2 Italian learners between measures of lexical richness and lexical fields

Pascale GOUTERAUX
Learners of English and conversational proficiency

Jonė GRIGALIŪNIENĖ & Rita JUKNEVIČIENĖ
Recurrent formulaic sequences in the speech and writing of the Lithuanian learners of English

Hagen HIRSCHMANN, Anke LÜDELING, Ines REHBEIN, Marc REZNICEK & Amir ZELDES
Underuse of syntactic categories in Falko: A case study on modification

Jarmo Harri JANTUNEN & Sisko BRUNNI
Morphology, lexical priming and second language acquisition: A corpus-study on learner Finnish

Régis KAWECKI
A beginner French learner corpus

Elma KERZ
Concessive adverbial clauses in L2 academic writing

Yuichiro KOBAYASHI
A comparison of spoken and written learner corpora: Analyzing developmental 277
patterns of vocabulary used by Japanese EFL learners

Sun-Hee LEE, Markus DICKINSON & Ross ISRAEL
Corpus-based error analysis of Korean particles

Stéphanie LOPEZ, Anne CONDAMINES & Amélie JOSSELIN-LERAY
An LSP learner corpus to help with English radiotelephony teaching

Cristóbal LOZANO & Amaya MENDIKOETXEA
Corpus and experimental data: Subjects in second language research

Iakira MURAKAMI
Cross-linguistic influence on the accuracy order of L2 English grammatical morphemes

Susana MURCIA-BIELSA & Penny MACDONALD
The TREACLE project: Profiling learner proficiency using error and syntactic analysis

Susan NACEY & Anne-Line GRAEDLER
Communication strategies used by Norwegian students of English

Masumi NARITA
The use of articles in Japanese EFL learners’ essays

Barbara Malveira ORFANÒ
Analysing the use of vague language in spoken interlanguage: A corpus-based study of a group of Brazilian university students learning English as a second language

Magali PAQUOT, Hilde HASSELGÅRD & Signe OKSEFJELL EBELING
Writer/reader visibility in learner writing across genres: A comparison of the French and Norwegian components of the ICLE and VESPA learner corpora

Nina RESHÖFT & Linn GRALLA
On the use of spatial prepositions: Differences in L1 and L2 English

Sylvi RØRVIK & Thomas EGAN
Connectors in the argumentative writing of Norwegian novice writers

Christine S. SING
Shell noun patterns in student writing in English for specific academic purposes

Marianne SPOELMAN
The (under)use of partitive objects in Estonian, German and Dutch learners of Finnish

Barbora ŠTINDLOVÁ, Svatava ŠKODOVÁ, Alexandr ROSEN & Jirka HANA
A learner corpus of Czech: Current state and future directions

Misuzu TAKAMI & Naoko AKAHORI
Inappropriate uses of psychological verbs by Japanese learners of English

Hiroko USAMI
Using a learner corpus to improve distractors in multiple choice grammar
questions

Elaine W. VINE
Corpora and coursebooks compared: Category ambiguous words

Nina VYATKINA
Analyzing part-of-speech variability in a longitudinal learner corpus and a pedagogic corpus

Leo WANNER, Margarita ALONSO RAMOS, Orsolya VINCZE,
Rogelio NAZAR, Gabriela FERRARO, Estela MOSQUEIRA & Sabela PRIETO

Annotation of collocations in a learner corpus for building a learning environment

Chih-Yi WU, Hintat CHEUNG & Zao-Ming GAO
The adverbialization of BUT in Taiwan EFL writing

 _______________________________________________

Katherine ACKERLEY
A comparison of learner and native speaker writing in online self-presentations:
Pedagogical applications

This paper investigates the language used by both learners and native speakers of English when introducing themselves to peers in an online community, and then goes on to discuss the pedagogical potential of the findings. A small corpus of self-presentations written by 220 first-year students majoring in English at an Italian university was compiled during the 2009-2010 academic year. The learner corpus was compared with a reference corpus consisting of self-presentations produced by native speaker students in higher education in English-speaking countries and posted on online forums. The paper first considers why it is important that language majors aim to write in a way that is appropriate to a given genre, rather than merely focusing on morpho-syntactic accuracy. It then focuses on aspects of divergence between learner and native speaker production, presenting some of the linguistic choices made by learners when presenting themselves to peers. It goes on to discuss how the creation of awareness-raising materials based on the analysis can enhance learning by directing students’ attention towards the differences between their texts and those of native speaker students.

Theodora ALEXOPOULOU, Helen YANNAKOUDAKIS & Angeliki SALAMOURA
Classifying intermediate learner English: A data-driven approach to learner
corpora

We demonstrate how data-driven approaches to learner corpora can support Second Language Acquisition research when integrated with visualisation tools. We employ a visual user interface supporting the investigation of a set of automatically determined features discriminating between pass and fail First Certificate in English (FCE) exam scripts. We illustrate how the interface can support the investigation of individual features. The analysis of the most discriminative features indicates that the development of grammatical categories allowing reference to complex events, referents and discourse relations is a crucial property of the upper-intermediate level.

Margit BRECKLE & Heike ZINSMEISTER
L1 transfer versus fixed chunks: A learner corpus-based study of L2 German

This study deals with the question of what strategies Chinese L2 learners of German follow when starting a declarative sentence in German. The investigation is based on the ALeSKo corpus, a linguistically annotated learner corpus of written German. In previous studies, we observed that the L2 texts show a significant overuse of sentences that start with an information-structural function in comparison to comparable L1 texts. In this paper, we pursue an alternative line of explanation that explores whether the observed difference is due to an overuse of chunks in the L2 texts. We perform a chunk classification and also automatically detect all material copied from the title and the task description – a particular type of chunk. Our findings indicate that although L2 learners use chunks to a substantial degree, an overuse with respect to the beginnings of the sentences could not be confirmed.  

Julian BROOKE & Graeme HIRST
Native language detection with ‘cheap’ learner corpora

We begin by showing that the best publicly available, multiple-L1 learner corpus, the International Corpus of Learner English (Granger et al. 2009), has issues when used directly for the task of native language detection (NLD). The topic biases in the corpus are a confounding factor that results in cross-validated performance that appears misleadingly high, for all the feature types which are traditionally used. Our approach here is to look for other, cheap ways to get training data for NLD. To that end, we present the web-scraped Lang-8 learner corpus, and show that it is useful for the task, particularly if large quantities of data are used. This also seems to facilitate the use of lexical features, which have been previously avoided. We also investigate ways to do NLD that do not involve having learner corpora at all, including double-translation and extracting information from L1 corpora directly. All of these avenues are shown to be promising.

 Marcus CALLIES & Ekaterina ZAYTSEVA
The Corpus of Academic Learner English (CALE) – A new resource for the study and assessment of advanced language proficiency

This paper introduces the Corpus of Academic Learner English (CALE), a Language for Specific Purposes learner corpus that is currently being compiled for the quantitative and qualitative study of advanced learners' written academic English. CALE is designed to comprise seven academic genres produced by learners of English as a foreign language in a university setting and thus contains discipline- and genre-specific texts. The corpus will serve as an empirical basis to produce detailed case studies that examine linguistic determinants of lexico-grammatical variation, i.e. semantic, structural, discourse-motivated and processing-related factors that influence constituent order and the choice of structural variants, but also those that are potentially more specific to the acquisition of L2 academic writing such as task setting, genre and writing proficiency. Another major goal is to develop a set of linguistic criteria for the assessment of advanced proficiency conceived of as "sophisticated language use in context". 

 Erik CASTELLO
Integrating learner corpus data into the assessment of spoken interaction in English in an Italian university context

This paper reports on ongoing research conducted at the University of Padua on the teaching and assessment of spoken interaction in English at level B2 of the Common European Framework of Reference for Languages (CEFR, Council of Europe 2001). The study is mainly based on a small learner corpus (about 18,000 words) composed of transcripts of interactions between second-year English as a Foreign Language (EFL) students recorded during assessment sessions. It presents the context of the interactions, the corpora used and the results of a series of investigations carried out into some pragmatic aspects of the interactions. The paper then explores how these findings can help us to flesh out the construct for ‘Discourse Management’ and, ultimately, to set more reliable scoring criteria.

 Evelyne CAUVIN
Intonational phrasing as a potential indicator for establishing prosodic learner profiles

Prosodic profiles have been extensively used in forensics and language pathology. However, they are rarely used in second language acquisition as yet. The aim of this paper is to show how prosody can be used to define learner profiles, possibly their learning styles and their different cognitive abilities. It is our claim that different segmentation modes of utterances define different prosodic learner profiles and we aim to characterise these. We will show that prosodic profiles of French learners of English can be drawn on the basis of phrasing and that a cluster of prosodic properties corroborates this typology. Our analysis is first based on read speech and the subsequent classifications on recorded interviews of the same speakers. It reveals the limitations in the assessment phonological criteria the Common European Framework of Reference for Languages (CEFRL) (Council of Europe 2001) advocates and makes a good case for reconsidering them.

Meilin CHEN
Phrasal verbs in a longitudinal learner corpus: Quantitative findings

This study analyses Chinese learners’ use of phrasal verbs from a longitudinal perspective. Through a comparison of the learners’ output of phrasal verbs with that of two groups of native English speakers (American university students and British secondary school leavers), Chinese learners were found to be capable of producing an adequate number of phrasal verbs. Yet, they did not demonstrate appropriate choice of phrasal verbs. The longitudinal data reveal that the learners’ acquisition of phrasal verbs during their three years of study was not always linear. A considerable decrease in the number of phrasal verbs used in the students’ writing in their second year was noticed. No considerable increase in the use of phrasal verbs was observed at the end of their third year. Another important finding of this study is that the American students tend to use far more phrasal verbs than their British and Chinese counterparts.

 Pieter DE HAAN & Monique VAN DER HAAGEN
The search for sophisticated language in advanced EFL writing: A longitudinal study

Even very advanced EFL writing tends to be less sophisticated than native writing. One of the problems seems to be finding the right collocations and the correct register. The aim of this article is to pinpoint what characterizes the development in very advanced Dutch EFL students’ written language production, more specifically the use of appropriate intensifiers. Compared to their native English speaking contemporaries, the Dutch students initially tend to use intensifiers that are found typically in spoken English, such as really and a bit, but these gradually disappear. Alternatively, as students progress, the use of the intensifiers so, quite, and rather, becomes more native-like. A qualitative analysis of a selection of essays written by four individual students shows that some students get more out of academic input than others. 

Deise P. DUTRA & Tony Berber SARDINHA
Referential expressions in English learner argumentative writing

The aim of this paper is to report our findings of the investigation on lexical bundle types in learner argumentative writing. Our data consisted of the International Corpus of Learning English (ICLE), the Louvain Corpus of Native English Essays (LOCNESS), and Br-ICLE, the Brazilian sub-corpus of ICLE. Our classification followed the functional taxonomy proposed by Biber et al. (2004) and expanded by Simpson-Vlach & Ellis (2010). The research methodology included the extraction of 3-, 4- and 5-word bundles followed by manual and automatic categorization in broad categories (referential expressions, stance expressions and discourse organizing functions) as well as 18 specific subcategories (e.g. intangible and tangible framing attributes and quantity specification). Second, the most frequent categories in each corpus were identified. Third, we focused on the most frequent one: referential expressions. Fourth, the chi-square test, cluster analysis and ANOVA were used to detect significant differences across corpora. The subcategories that contributed the most to statistically significant differences across corpora were: specification of intangible framing attributes, identification and focus, and contrast and comparison. The results also show that there is more internal lexical variation of nouns in the intangible framing attribute bundles produced by native than non-native speakers. The conclusions are that referential expressions might need to receive more attention in pedagogical contexts so their discourse functions become more salient to learners.

 Anna ESPUNYA
Investigating lexical difficulties of learners in the error-annotated UPF learner translation corpus

The aim of this article is two-fold. First, it describes the learner translation corpus developed at the Universitat Pompeu Fabra School of Translation and Interpreting (UPF-LTC). A learner translation corpus is a corpus of translations written by students; the UPF-LTC has two search configurations: as a bilingual, sentence-aligned, English-Catalan translation corpus and as a monolingual Catalan translation corpus. It has been annotated both with linguistic information and with error tags according to a set taxonomy of translation errors. The second aim is to illustrate the applications of the corpus for research into the types of translation errors involving lexical use such as false friends and deficient or imprecise lexical choices. The results are relevant not only for the didactics of translation but also for translation-oriented bilingual lexicography.

Michael FLOR & Yoko FUTAGI
Producing an annotated corpus with automatic spelling correction

This paper describes ConSpel, a software system for automatic detection and correction of non-word misspellings. We also present an ongoing research project for constructing an ETS (Educational Testing Service) Spelling Corpus. The corpus consists of essays written by native and non-native speakers of English to the writing prompts of TOEFL® and GRE® tests. Essays are annotated for misspellings by trained annotators, using a semi-automated methodology. An evaluation of the ConSpel system was conducted, using the data from the completed phase of the annotation project. The ConSpel system achieves above 95% accuracy in error detection. The evaluation also indicates that an advanced correction algorithm, which takes into account the local context of misspellings, achieves correction accuracy of 77% and consistently outperforms a baseline context-blind approach.

 Costas GABRIELATOS
If-conditionals in ICLE and the BNC: A success story for teaching or learning?

This paper aims to contribute to the methodological toolbox of “pedagogy-driven corpus-based research” (Gabrielatos 2006), that is, research which is situated at the intersection of language description, pedagogical lexicogrammar, and pedagogical materials evaluation (e.g. Harwood 2005; Hunston & Francis 1998; Kennedy 1992; Owen 1993). The contribution of the present paper mainly lies in proposing a method of triangulating the corpus-based evaluation of lexicogrammatical information in English as a Foreign Language coursebooks, by way of examining a relevant corpus sample of learner written output.

 Thomas GAILLAT
This and that in native and learner English: From typology of use to tagset characterisation

Learner corpus research is now faced with a multiplicity of tagsets. It is therefore difficult to carry out cross-corpus analysis due to the variety of tags used for each part-of-speech (POS). In this paper, we envisage this issue through a specific linguistic point. We propose a typology of uses in both native and non-native corpora. Various tagsets are analysed so as to measure the relevance of the linguistic information provided for this and that. Overall, a comparative analysis of this and that in tagsets is proposed and the benefits and flaws of manual fine-grained annotation versus automatic annotation are assessed. This study comes as a first step towards automated annotation of this and that in various corpora as this process would pave the way to corpus interoperability at POS level.

 Francesca GALLINA
The Lexicon of Spoken Italian by Foreigners: A study on the acquisition of vocabulary by L2 Italian learners between measures of lexical richness and lexical fields

The aim of this paper is to present a corpus-based study of the acquisition of the vocabulary by learners of L2 Italian. The goal of the research is to study the lexical uses of non-native speakers and the processes of lexical acquisition underlying these uses, applying some measures of lexical richness and analysing the lexical fields of the corpus. The informants of the corpus were non-native speakers with different proficiency levels, learning Italian both in Italy and outside of it. The main results show how lexical competence develops above all quantitatively at the beginning and intermediate levels, as well as how it develops qualitatively at more advanced levels in particular. Different learning inputs greatly affect the development of lexical competence: learners acquiring Italian in Italy have a deeper knowledge of the Italian vocabulary compared to learners learning Italian outside of Italy. Regardless of the learning context or proficiency level, the most relevant categories among the lexical fields are those linked to everyday life, whereas those categories linked to more abstract domains are less relevant, but show a higher level of lexical richness compared to categories linked to daily life.

 Pascale GOUTERAUX
Learners of English and conversational proficiency

This study focuses on the inter-relatedness of fluency and complexity as explanatory factors and criteria for the assessment of conversational proficiency within the framework of two current cognitive models. It has been carried out on a cross-sectional corpus of 28 one-to-one conversations between native English teaching assistants and French English as Foreign Language (EFL) university students from the DIDEROT-LONGDALE project. 

Jonė GRIGALIŪNIENĖ & Rita JUKNEVIČIENĖ
Recurrent formulaic sequences in the speech and writing of the Lithuanian learners of English

The present article reports an investigation of recurrent formulaic sequences (FSs) in the speech and writing of Lithuanian learners of English as a foreign language (EFL). Evidence from corpus research has shown that language makes an extensive use of recurrent multi-word units whose successful acquisition contributes to the naturalness of expression and is thus very important in language teaching and learning. The aim of this study is to identify and describe the recurrent FSs in the spoken and written English of Lithuanian EFL learners both quantitatively and qualitatively, and to check whether the current hypothesis that FSs are more frequent in speech than in writing is applicable to the Lithuanian EFL learner language as well. The data for the research comes from the Lithuanian component of the International Corpus of Learner English (ICLE), viz. LICLE, and a pilot version of LINDSEI-LITH, the Lithuanian component of the Louvain International Database of Spoken Interlanguage. The findings of the study show that although the speech of Lithuanian EFL learners is more formulaic than their written language, there is a considerable overlap between spoken and written language in terms of formulaicity. The learners have built a core set of FSs which recur both in speech and writing. The most frequent FSs in writing are expressions of discourse organization while high-frequency FSs in spoken language, which often appear in clusters of several FSs, usually indicate the speaker’s hesitation and uncertainty.

Hagen HIRSCHMANN, Anke LÜDELING, Ines REHBEIN, Marc REZNICEK & Amir ZELDES
Underuse of syntactic categories in Falko: A case study on modification

This paper shows how the automatic syntactic analysis of a corpus of advanced learners of German as a foreign language helps in understanding the acquisition of modification. In former corpus research modification has been studied only by comparing the distributions of single words (or groups of words) in learner and native speaker data. We argue that in order to study modification as a syntactic category it is necessary to work with syntactically analyzed corpora. In this vein, we sketch out our approach to parsing learner language and conduct two contrastive interlanguage studies on modification in the syntactically annotated corpus, showing that not only lexical modifiers can be underused (as shown in many other studies), but that modification as a whole category (including multi-word modifiers such as prepositional phrases, and clausal modifiers such as relative clauses) is underused in our learner corpus data.  

Jarmo Harri JANTUNEN & Sisko BRUNNI
Morphology, lexical priming and second language acquisition: A corpus-study on learner Finnish

The present article discusses morphological priming in the context of second language acquisition. Morphological priming is a characteristic of both the core and cotextual items in a phraseological unit. It occurs when a word is repeatedly encountered in certain inflectional forms. Similarly to lexical priming on the whole (e.g. collocations and other cotextual qualities), it poses challenges for language learners. The paper focuses on atypicalities in morphophonological forms and, in addition, describes errors in inflection. It is hypothesized that learners of Finnish have problems in morphological priming, and that learners whose mother tongue is closely related to the target language and has inflection produce more target-language-like phraseological units.

 Régis KAWECKI
A beginner French learner corpus

This paper introduces the beginner French learner corpus built at the Centre for Language Learning at the University of the West Indies in Trinidad and Tobago. The primary objective of this project is to improve the way French is taught in this particular Caribbean context. It is original in the sense that it targets learners with a low or intermediate proficiency in French. Since it was collected during a period of two and a half years, the corpus allows for both longitudinal and same-level studies. The interlanguage associated with this specific population of students shows the influence played by the L1 (English) that is sometimes reinforced by that of another prevalent L2 (Spanish). The learners’ productions also point to the strong impact that the textbooks and pedagogical approach to language teaching have on the students’ written production. This research project calls for adapting the teachers’ pedagogy and textbooks in order to help these beginner learners write more accurately and originally right from the beginning of instruction.

Elma KERZ
Concessive adverbial clauses in L2 academic writing

In a recent study, Wulff & Gries (2011) put forward the constructionist definition of accuracy in L2 production as the selection of a construction in its preferred context within a particular target variety and genre. By focusing on the use of concessive adverbial clauses in L2 academic writing, the current study takes up this definition of accuracy in L2 production and sets out to explore whether, and to what extent, the ‘genre-specific construction’ (i.e. genre-specific repository of symbolic form-function alignments) of advanced German learners of academic English is similar/different to that of native expert academic writers of English. To this end, all instances of concessive adverbial clauses were extracted from a 216,418 word-token learner corpus and coded for the various factors proposed in the literature. For comparison purposes, a data set of all relevant data points was distilled from a native expert corpus of the same size and annotated in terms of the same factors. The two annotated data sets were then submitted to a Hierarchical Configural Frequency Analysis (Gries 2009). A comparison of the findings revealed a slightly different set of ‘entrenched’ adverbial concessive clauses in the learner corpus, suggesting that the learners’ genre-specific panoply of certain constructional types is still not fully established. In accordance with Wulff & Gries (2011), the findings presented here give support to a usage-based constructionist approach as a promising and viable way of measuring accuracy in L2 production. 

Yuichiro KOBAYASHI
A comparison of spoken and written learner corpora: Analyzing developmental 277
patterns of vocabulary used by Japanese EFL learners

The purpose of this study is to compare the spoken and written language of Japanese learners of English. The man focus is on the developmental patterns of vocabulary in the different production modes. Two types of learner data were compared in this study. The spoken data were extracted from the National Institute of Information and Communications Technology Japanese Learner English Corpus (NICT JLE Corpus), and the written data were extracted from the Japanese EFL Learner Corpus (JEFLL Corpus). The approach adopted in this research has three characteristics. First of all, it is corpus-based. Second, it focuses on very common word-types. Third, it is based on multivariate analysis. Using these 100 common word-types, I will conduct a correspondence analysis in order to explore complex interrelationships between the word-types and subcorpora in the spoken and written data. The result of this study shows a contrast between spoken and written data as well as a contrast between novice and advanced learners.

Sun-Hee LEE, Markus DICKINSON & Ross ISRAEL
Corpus-based error analysis of Korean particles

We discuss the development of a corpus of learner Korean, performing an error analysis of particle usage with it. Although the corpus was largely developed for the evaluation of natural language processing (NLP) systems – as discussed in Lee et al. (2012) – there are two major design decisions which affect the use of the corpus and its annotation for qualitatively and quantitatively studying learner behavior and which have not been fully discussed before. First is the composition of the corpus, specifically what learner data to include. Second is how we define grammaticality, a particularly thorny problem for error annotation of Korean particles, which are, to some extent, optional. After explaining the nuances of particles in Korean in general, we turn to these two issues and then provide an error analysis, showing the differential error patterns between heritage and non-heritage learners. In particular, particle omission rates differ, illustrating the importance of clearly defining grammaticality for (sometimes) optional elements, both for annotation and for pedagogy.

 Stéphanie LOPEZ, Anne CONDAMINES & Amélie JOSSELIN-LERAY
An LSP learner corpus to help with English radiotelephony teaching

The French Civil Aviation University (ENAC) is in charge of the French controllers’ initial training in English and has therefore specific needs in terms of English radiotelephony teaching. Consequently, an observation of the usage of English made by French controllers with international pilots, that is to say ongoing foreign language learners, was initiated. The aim of this project is to describe and categorise the different uses of English within pilot-controller communications through the means of a comparative study between two corpora . The ultimate purpose of this comparative analysis is foreign language (English for Specific Purposes) teaching. 

Cristóbal LOZANO & Amaya MENDIKOETXEA
Corpus and experimental data: Subjects in second language research

This paper shows how corpus and experimental data can be combined to gain an insight into the processes that shape and constrain second language (L2) acquisition, by focusing on the L1 Spanish – L2 English acquisition of preverbal vs. post-verbal subject position: S-V vs. (XP-)V-S. The initial corpus study (Lozano & Mendikoetxea 2010) revealed that subject position in L1 Spanish – L2 English is constrained by the same principles as in native English (verb type, information structure and phonological weight), but learners show difficulties with the preverbal XP constituent: even advanced learners overuse it as the generic expletive (It occurred many important events) or omit XP (i.e., they use Ø as in Exist other means of obtaining money), while the use of there with verbs other than be is highly limited (There exist about two hundred organizations). To (dis)confirm these corpus findings, a follow-up online experiment was designed to test learners’ (N=250) knowledge of the preverbal XP element in XP-V-S structures whose design was structurally similar to those produced in the corpora (Ø/it/there/PP-V-S). The experimental results show a very robust pattern, which mostly confirms the corpus results. In the conclusion we advocate for the combined use of naturalistic and experimental data in a cyclic fashion.

Iakira MURAKAMI
Cross-linguistic influence on the accuracy order of L2 English grammatical morphemes

Contrary to the accepted notion of the ‘natural order’ that claims for the fixed L2 acquisition order of English grammatical morphemes, Luk & Shirai (2009) reviewed the literature and argued that the order may differ depending on learners’ L1. The present study empirically investigates whether the accuracy order of L2 English grammatical morphemes varies across L1 groups. By targeting over 3,000 essays across seven L1 groups in the Cambridge Learner Corpus, the study computed the accuracy of six morphemes in each L1 group and clustered them through statistical bootstrapping. The study, then, compared the accuracy order of the morphemes between L1 groups and demonstrated clear L1 influence. Overall, the groups whose L1s do not obligatorily mark the morpheme tend to have a lower accuracy order with respect to the morpheme compared to those whose L1s mark it. This was particularly the case for articles

 Susana MURCIA-BIELSA & Penny MACDONALD
The TREACLE project: Profiling learner proficiency using error and syntactic analysis

This article describes ongoing research within the TREACLE project. TREACLE aims to profile the specific grammatical skills of Spanish university learners of English at various proficiency levels, and, on the basis of these profiles, develop proposals for re-designing curriculum and teaching materials particularly focused on the real needs of Spanish students at distinct proficiency levels. To this end, we are developing a methodology for grammatical profiling of proficiency levels using learner corpora. Some approaches (e.g. Dagneaux et al. 1998) have explored grammatical competence of learners by looking at the errors they make at each proficiency level. However, we believe that to get a clear picture of learner competence, we need to measure not only what they do wrong (errors), but also what they do right. We thus take a two-pronged approach, involving automatic syntactic tagging of the corpus to see what structures students are attempting, and manual error annotation to see what they do wrong. This paper presents our approach and reports on some preliminary results in profiling provided by our combined approach.

Susan NACEY & Anne-Line GRAEDLER
Communication strategies used by Norwegian students of English

This paper investigates the use of communication strategies by Norwegian learners of English, based on transcribed interviews recorded as part of the Louvain International Database of Spoken English Interlanguage (LINDSEI) (Gilquin et al. 2010). The data consists of 380 instances of communication strategies which have been categorized according to a taxonomy compiled from various pre-existing taxonomies of such strategies. The study reveals that the learners resort to achievement strategies in 96% of the cases. Among the achievement strategies, L2-based strategies are the most common, which makes sense considering the learners’ fairly high competence level in English. A substantial number of instances of L1-based strategies, such as code switching, can be attributed to the fact that the interviewers understand Norwegian perfectly despite being native speakers of English. This strategy type thus contributes positively to fluency, rather than disrupts communication. Other aspects that are analyzed include the tendency for different strategy types to occur in clusters, and the success of different types of cooperation strategies, where the learner implicitly or explicitly appeals to the interviewer for assistance.  

Masumi NARITA
The use of articles in Japanese EFL learners’ essays

This paper explores how article use changes according to the development of L2 writing proficiency. Argumentative essays were collected from 61 Japanese EFL learners who were in their first year at Tokyo International University. Writing proficiency was evaluated on the basis of the essay scores given by two human raters and all the essays were manually annotated with descriptions of article errors. Detailed analyses of article errors showed that omission-type errors were prevalent, regardless of the writing proficiency, and that the present L2 learners tended to make fewer errors as their writing skill level became higher. The definite article presented a far more complex picture of article acquisition in L2 than did the indefinite articles and remained problematic even for advanced-level learners. The present learner corpus-based study revealed non-uniform aspects of L2 development in article use, necessitating further qualitative investigation.

Barbara Malveira ORFANÒ
Analysing the use of vague language in spoken interlanguage: A corpus-based study of a group of Brazilian university students learning English as a second language

This paper will look at the issue of vague language, and in particular at vague category markers (VCMs), comparing a learner corpus of a group of Brazilian university students with a sub-corpus from the Santa Barbara Corpus of Spoken American English. Fundamental to the analysis is the backdrop of research within the field of corpus linguistics, discourse analysis, and pragmatics. Analytical tools from the area of corpus linguistics will be employed, as well as insights from the areas of pragmatics, discourse analysis and conversation analysis. The analysis aims to determine the most prevalent forms present in the data and if their use and functions are the same by comparing and contrasting the VCMs.

Magali PAQUOT, Hilde HASSELGÅRD & Signe OKSEFJELL EBELING
Writer/reader visibility in learner writing across genres: A comparison of the French and Norwegian components of the ICLE and VESPA learner corpora

Previous studies have shown that learner writing is often characterized by a more involved style than the writing of their native peers, as evidenced by a high number of writer/reader (W/R) visibility features such as first and second person pronouns, let’s imperatives, epistemic modal adverbs (e.g. certainly, maybe) and questions (cf. e.g. Petch-Tyson 1998; Altenberg & Tapper 1998). The aim of this study is to analyse French and Norwegian learners’ use of W/R visibility features across genres to investigate whether learners are generally more overtly present within their academic writing or whether the features commonly attributed to EFL learners’ involved style are prompted by the argumentative type of texts that has usually been analysed in learner corpus research. We compare argumentative texts from the International Corpus of Learner English (ICLE) and discipline-specific texts from the Varieties of English for Specific Purposes dAtabase (VESPA). Results show that, when compared to native speakers’ writing within the same discipline, texts produced by French and Norwegian learners display an overuse of W/R visibility features. There are, however, generally fewer features of W/R visibility in the discipline-specific texts, thus suggesting that learners adapt to genre requirements to some extent. 

Nina RESHÖFT & Linn GRALLA
On the use of spatial prepositions: Differences in L1 and L2 English

This study looks at the different ways in which spatial prepositions are used by native speakers and German learners of English. It is based on learner and reference corpora containing elicited written narratives of a wordless picture story (Mayer 1969). A Contrastive Interlanguage Analysis reveals that native speakers of English used spatial prepositions significantly more often in dynamic contexts than did German learners of English. Moreover, we found considerable differences in the use of behind and over. A closer look at the contexts in which these two prepositions occurred showed that they were mainly used for descriptions of the same scene. The results suggest that certain events are conceptualized in different ways by speakers of English and German. We will discuss L1 transfer and teaching-induced factors as possible explanations for these differences and propose to incorporate corpus methodology into the foreign language classroom.

Sylvi RØRVIK & Thomas EGAN
Connectors in the argumentative writing of Norwegian novice writers

This paper investigates the use of two categories of connectors, i.e. coordinating conjunctions and adverbial conjuncts, in argumentative texts written in English by Norwegian novice writers. These are compared to texts from four other text categories: texts written by expert L1 writers of English and Norwegian and texts written by L1 novice writers of English and Norwegian. The investigation is carried out according to the principles of the Integrated Contrastive Model. The results show that there is no transfer from Norwegian to English in the use of connectors by the L2 novice writers. However, novices whose L1 is Norwegian do overuse connectors, both when writing in their L1 and when writing in English. Further research is required to determine why this should be the case.

Christine S. SING
Shell noun patterns in student writing in English for specific academic purposes

This case study examines the uses of a particular class of English abstract nouns in two learner corpora. These ‘shell nouns’ (Schmid 2000) are typified by their occurrence in specific lexico-grammatical patterns, which trigger co-interpretation. This reliance on contextual information for meaning interpretation implies that the shell nouns themselves tend to be semantically underdetermined and require postnominal clauses to fill the conceptual shells they provide with content. It was hypothesized that two groups of students, using English L1 vs. L2 in writing in English for specific purposes (ESP), would both rely on the patterns, with L1 writers showing more aptitude for choosing a wider range of prepositions to use with the nouns in question. In part, the results corroborate Schmid’s (2007) earlier finding while also suggesting a number of learner specific variables.

Marianne SPOELMAN
The (under)use of partitive objects in Estonian, German and Dutch learners of Finnish

The use of the partitive case has often been acknowledged as problematic for L2 learners of Finnish. The current study, focusing on Estonian, German and Dutch learners of Finnish as a foreign language, therefore aimed to explore the influence of prior linguistic knowledge on learners' (under)use of partitive objects. Research materials were selected from the International Corpus of Learner Finnish and aligned with the CEFR profi-ciency scales. Considering the outcomes, the finding that partitive object underuse errors occurred significantly less frequently in the Estonian learner corpus than in the other learner corpora (positive L1 influence resulting from the similarities between object case-marking in Finnish and Estonian) seemed to be contradicted by the frequent replacement of partitive singular by nominative singular objects in all learner corpora, including the Estonian learner corpus. However, it was shown that the underuse errors of this type observed from the Estonian learner corpus particularly reflected negative influence of L1 morphology (triggered by phonological similarity between Finnish nominative singular and Estonian partitive singular forms), while those observed from the other learner corpora indicated a tendency to leave the object uninflected for the sake of simplification. The rapid decrease of this underuse error category with increasing L2 proficiency observed from all learner corpora provides supporting evidence suggesting an inverse relation between L2 proficiency and the likelihood of simplification strategies or negative L1 influence to occur.

Barbora ŠTINDLOVÁ, Svatava ŠKODOVÁ, Alexandr ROSEN & Jirka HANA
A learner corpus of Czech: Current state and future directions

The paper describes CzeSL, a learner corpus of Czech as a Second Language, together with its design properties. We start with a brief introduction of the project within the context of AKCES, a programme addressing Acquisition Corpora of Czech; in connection with the programme we are also concerned with the groups of respondents, including differences due to their L1; further we comment on the choice of the sociocultural metadata recorded with each text and related both to the learner and the text production task. Next we describe the intended uses of CzeSL. The core of the paper deals with transcription and annotation. We explain issues involved in the transcription of handwritten texts and present the concept of a multi-level annotation scheme including a taxonomy of captured errors. We conclude by mentioning results from an evaluation of the error annotation and presenting plans for future research.

 Misuzu TAKAMI & Naoko AKAHORI
Inappropriate uses of psychological verbs by Japanese learners of English

By ‘psychological verbs here are meant verbs like astonish, comfort, disappoint, excite, interest, please, satisfy, surprise and thrill, which refer to a change in the mental/emotional state of a person. Data from a learner corpus clearly show that there are two types of inappropriate uses of these verbs made by Japanese learners of English. Firstly, verbs are used intransitively rather than transitively, as in I surprised at the news instead of I was surprised at the news’. Secondly, Japanese learners of English rarely use the inanimate-subject construction of the type The news surprised me’. These two characteristically unnatural features on the part of Japanese learners of English are correlated and are apparently derived from the fact that psychological processes are construed differently by English and Japanese speakers. Considering the construal differences, we compare the sentences involving psychological verbs in the Japanese sub-corpus of the International Corpus of Learner English (ICLE-JP) with those from a collection of English essays written by American university students, the Louvain Corpus of Native English Essays (LOCNESS), focusing on the differences in both corpus data in the ratio of psychological verbs used in active and passive voice, and in animate and inanimate subjects followed by these verbs.

 Hiroko USAMI
Using a learner corpus to improve distractors in multiple choice grammar questions

This paper shows that multiple choice questions containing distractors based on frequent Japanese learners’ errors are more effective for the acquisition of English as a foreign language than traditional non-corpus-based questions. The two types of tests have been compared using a concrete sample question. Statistical analyses of item facility and discrimination index are provided; a detailed analysis of the distractors is also given.

Elaine W. VINE
Corpora and coursebooks compared: Category ambiguous words

Three types of corpora are drawn on in an investigation of category ambiguity in high frequency words: general English corpora, learner English corpora and a corpus of English language teaching coursebooks. Four words are analysed and discussed: down, like, round, up. Variation is found within and across the corpora, which gives rise to discussion of some pedagogical implications. 

Nina VYATKINA
Analyzing part-of-speech variability in a longitudinal learner corpus and a pedagogic corpus

This study investigates the development of part-of-speech variety in the writing of a cohort of beginning college-level learners of German over three semesters of study in comparison with the pedagogical input they received from their workbook. The study fills existing gaps in Second Language Acquisition research by targeting beginner learners of German as a foreign language, analyzing semi-automatically annotated corpora (a learner corpus and a corresponding workbook corpus), and eliciting learner data over a long period of time at dense time intervals. As a result, it presents a developmental Second Language (L2) profile of the target learner population in terms of verb classes and verb morphology. The study shows how participants gradually enrich their verb form repertoire, both in accordance with and diverging from the pedagogical input they receive.

 Leo WANNER, Margarita ALONSO RAMOS, Orsolya VINCZE,
Rogelio NAZAR, Gabriela FERRARO, Estela MOSQUEIRA & Sabela PRIETO

Annotation of collocations in a learner corpus for building a learning environment

Collocations in the sense of idiosyncratic lexical co-occurrences are one of the main barriers and challenges for any second language (L2) learner. In Computer Assisted Language Learning (CALL), a number of works deal with the automatic recognition of collocation errors and compilation of candidate lists for their correction. However, this is not sufficient. Firstly, to obtain a clear picture of the difficulties experienced by learners in order to be able to offer targeted aid to learners, a fine-grained linguistic analysis of collocation errors and their annotation in learner corpora is necessary. Secondly, programs must be developed that make concrete correction suggestions, besides providing correction candidate lists, and supply a learner with illustration and didactic material that is oriented towards the types of collocations with which this learner has difficulties. In our work, we attempt to push the state of the art one step further in both of these strands of research, focusing on Spanish as L2. Within the first strand, we carry out a detailed collocation-oriented annotation of a fragment of the corpus of learners of Spanish (CEDEL2). Within the second strand, we experiment with a number of strategies for choosing the most likely correction of a collocation error. 

Chih-Yi WU, Hintat CHEUNG & Zao-Ming GAO
The adverbialization of BUT in Taiwan EFL writing

In EFL writing, BUT shows a regular occurrence at sentence-initial position, a non-standard use that is prescriptively prohibited in written texts. In this present study, we use a corpus approach to first identify the frequency pattern of the positional variation of BUT in learners’ writing. The pattern is then analyzed in terms of its variance between different proficiency levels to see if there is any developmental difference. We also compare the learners’ positional pattern with native speakers’ usage with reference to the British National Corpus. The result shows that the positional pattern of BUT in EFL writing resembles that in native speakers’ oral (rather than written) production. As we further examine the use of sentence-initial BUT, we detect grammatical and functional features that point to the adverbialization of BUT. We further argue that such adverbialization is discourse-motivated, which in part explains the increasing colloquialism in EFL writing. As revealed, the lack of register awareness continues to be a major problem for Taiwan EFL learners.