Learner corpora around the world

CECL

 

This list is still work in progress. We would like it to be as comprehensive as possible. If you have a learner corpus or know of one that is not listed on this webpage, send a message to Magali Paquot and we'll add it to the list. We hope you will find the list useful for your research!

The list below only contains learner corpora, i.e. electronic collections of continuous written or spoken data produced by foreign or second language learners.
For a list of learner corpus-based datasets (treebanks, error lists, etc.), click here.

To refer to this list :

Centre for English Corpus Linguistics (date of access): Learner Corpora around the World. Louvain-la-Neuve: Université catholique de Louvain. https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html

 

© 2009, Université catholique de Louvain

Learner corpora

Corpus

Target 
language

First
language
Medium Text type/ task type Proficiency level Size
in words
Project director Availability
The Arabic Learner Corpus
(ALC)
Arabic 66 languages written and spoken Narrative and discussion Intermediate and advanced

written:
c. 283,000

audio:
c. 3h30

Abdullah Alfaifi & Eric Atwell

Available
The Pilot Arabic Learner Corpus Arabic English written Narrative Intermediate and advanced c. 9,000 Ghazi Abuhakema
Reem Faraj
Anna Feldman
Eileen Fitzpatrick
Montclair State University, USA
 
The Jinan Chinese Learner Corpus
(JCLC)
Chinese 50 languages written Exams and assignments Beginners, intermediate and advanced

c. 6 m. Chinese characters

c. 9,000 texts

Maolin Wang
Shervin Malmasi
Minggxuan Huang

 
Croatian Learner Text Corpus (CroLTeC)  Croatian 36 languages (Afrikaans, Arabic, Bulgarian, Catalan, Czech, Danish, German, English, Estonian, Persian, Finnish, French, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Lari, Mandinka, Dutch, Norwegian, Polish, Portuguese, Russian, Slovak, Slovenian, Spanish, Albanian, Swedish, Thai, Turkish, Ukrainian, Vietnamese, Chinese, Malay) written exam essays, argumentative and literary essays, letters, diaries, picture descriptions, book reviews, short dialogues, etc. A1-C2 c. 1 million Nives Mikelic Preradovic, University of Zagreb, Croatia Freely available
The AKCES/CZESL corpus
(Acquisition corpora of Czech/Czech as a second language)
Czech Various written and spoken Student essays and
interviews
Various 2 m. Karel Sebesta
Charles University in Prague
Technical University in Liberec, Czech Republic
Available
Leerdercorpus Nederlands als Vreemde Taal Dutch French written       Liesbeth Degand
Université catholique de Louvain, Belgium
 
Arab Learner English Corpus (ALEC) English Arabic written Essays written by freshman students as part of first level college writing course University students (second language learners)
Analysis 184749 
Narrative 67527 
Synthesis 66015 
Argumentation 192298
imahfouz@auk.edu.kw">Dr. Inas Mahfouz, American University of Kuwait
https://dspace.auk.edu.kw/handle/11675/1757

The Aachen Corpus of Academic Writing
(ACAW)

English German written Academic research writing Advanced

c. 240,000 words

c. 225,000 words (L1 component)

Elma Kerz, RWTH Aachen University Under development
The Advanced Learner English Corpus
(ALEC)
English Mainly Swedish written Essays written by university students of English linguistics and English literature Advanced c. 1,3 m. Tove Larsson, Uppsala University Not freely available
The ANGLISH corpus English French spoken Readings of texts and sentences, spontaneous oral language. Various c. 5h30 Anne Tortel
University of Provence, France.
 Freely available
Asao Kojiro’s Learner Corpus Data English Japanese written Essays and stories written or reproduced by Japanese college students.     Asao Kojiro Texts available for download
The Barcelona English Language Corpus
(BELC)
English Spanish
Catalan
spoken and written

4 tasks:
Written composition
Oral narrative
Oral interview
Role-play

Longitudinal data (children and young adults learning English)

 Various   Carmen Muños
University of Barcelona, Spain
 
The BATMAT Corpus English Swedish
Finnish
written BA dissertations
MA dissertations
Advanced c. 2,5 m. (expanding) Signe-Anita Lindgrén, English language and literature, Åbo Akademi University, Finland Under development
Belarussian Learner Corpus of English (BELLCE) English Russian; Belarussian written argumentative essays High intermediate to advanced unknown Anastasia Rakhuba  
The Bilingual Corpus of Chinese English Learners
(BICCEL)
English Chinese spoken and written

Spoken: National Oral English test.

Written: in-class assignments

  c. 2 m. Wen Qiufang
National Research Center for Foreign Language Education Beijing Foreign Studies University, China
 
The British Academic Written English (BAWE) corpus English

Mainly L1 speakers

Also includes data produced by L2 speakers

written ESP papers

4 levels of study (from undergraduate levels to final year and taught masters level)

 

c. 6,5 m. Hilary Nesi
Sheena Gardner
Warwick, UK
Paul Thompson
University of Birmingham, UK
Paul Wickens
Oxford Brookes, UK
baseplus@warwick.ac.uk

The BAWE corpus can be accessed through the corpus analysis interface, Sketch Engine.

prototype interface that allows filtered searching of the BAWE corpus files is available.

The BUiD Arab Learner Corpus (BALC) English Arabic written School examination essays Various c. 290,000 Mick Randall
The British University in Dubai,
United Arab Emirates
Nicholas Groom
University of Birmingham, UK
At present, copies of the current version of the corpus is available on request from mick.randall@buid.ac.ae
The Cambridge Learner Corpus (CLC) English Various written Exam scripts Various c. 50 m. Cambridge University Press and Cambridge ESOL, UK Commercial
The Corpus of Academic Learner English
(CALE)
English German written Various academic text types that are typically produced in university courses of English, e.g. term papers, reading reports, research plans, abstract, reviews, and summaries. Advanced under development Marcus Callies
University of Bremen, Germany
 
The Corpus of English Essays Written by Asian University Students (CEEAUS) English Various written Student essays Various c. 200,000 Shin Ishikawa
Kobe University, Japan
Freely downloadable from the website
The Chinese Academic Written English corpus
(CAWE)
English Chinese written Dissertations written by Chinese undergraduates majoring in English linguistics or applied linguistics.   c. 400,000 David Yong Wey Lee
City University of Hong Kong, Hong Kong
 
The Chinese Learner English Corpus
(CLEC)
English Chinese written   Various c. 1 m. Gui Shichun
Guangdong University of Foreign Studies & Yang Huizhong, Shanghai Jiatong, China
The corpus can only be accessed by users in the Department of English at HKPU.
The City University Corpus of Academic Spoken English (CUCASE) English

Chinese

Also includes data produced by L1 speakers

multimedia     c. 2 m. David Yong Wey Lee
City University of Hong Kong, Hong Kong
 
The Cologne-Hanover Advanced Learner Corpus (CHALC) English German written term papers and essays Advanced c. 210,000 Ute Römer
University of Michigan, USA
 
The College Learners’ Spoken English Corpus
(COLSEC)
English Chinese spoken National spoken English test for non-English majors.   c. 700,000 Yang and Wei  
The Corpus Archive of Learner English in Sabah/Sarawak (CALES) English Malay written Argumentative essays Various c. 400,000 Simon Botley@Faizal Hakim
Doreen Dillah
Universiti Teknologi MARA Sarawak, Malaysia
 
CORpus del ESPañol de los Italianos (CORESPI) Spanish Italian Written Written compositions A1 to B2 c.125,000

Sonia Bailini
sonia.bailini@unicatt.it
Università Cattolica del Sacro Cuore, Milan, Italy

Online access

CORpus del ITaliano de los Españoles (CORITE) Italian Spanish Written Written compositions A1 to B2 c.103,000 Sonia Bailini
sonia.bailini@unicatt.it
Università Cattolica del Sacro Cuore, Milan, Italy

Online access

The Corpus of Business Letters English Italian written

Tagged part: BEC1 writting tests (letters, emails, faxes, memos, reports)

Untagged part: business writing exam tests

  c. 32,000 Anna Romagnuolo  
The Corpus of Multilingual Opinion Essays by College Students (MOECS) English varied written opinion essays college students unknown Megumi Okugiri available
Corpus of writing, pronunciation, reading, and listening by learners of English as a Foreign Language English Japanese written and spoken varied beginners to advanced 29h audio + 30.000 words Katsunori Kotani  
The Corpus of Young Learner Interlanguage (CYLIL) English

Dutch
French
Greek
Italian

spoken English L2 data elicited from European School pupils.
Longitudinal data
Various c. 500,000 Alex Housen
Vrije Universiteit Brussel, Belgium
 
The Eastern European English learner corpus English Russian
Ukrainian
Polish
Slovak
spoken Spontaneaous spoken production data elicited by means of a semi-structured interview Various c. 60,000 Elena Salakhian
Eberhard Karls University of Tübingen, Germany
 
The EFL Teacher Corpus
(ETC)
English Korean
 
spoken Teacher talks in language classrooms Upper-intermediate to advanced c. 123,000 Ye-eun Kwon
Eun-Joo Lee
Under development
The English of Malaysian School Students corpus (EMAS) English Malay written Student essays + oral interviews various c. 500,000 Arshad Abd. Samad et al.
Universiti Putra Malaysia, Malaysia
 
The English Speech Corpus of Chinese Learners
(ESCCL)
English Chinese spoken Dialogue reading-aloud Middle school and college   Chen Hua
Nantong University, China
Wen Qiufang
Beijing Foreign Studies University, China
Li Aijun
Chinese Academy of Social Sciences, China
 
The ETS Corpus of Non-Native Written English English 11 languages written 12,100 TOEFL English essays /   Daniel Blanchard

Information avout the score level is available for each essay

Samples are available

The Europarl corpus of Native Non-native and Translated Texts
(ENNTT)
English 24 EU languages written Proceedings of the European Parliament Advanced

NNS: c. 780,000

NS: c. 3 m.

Translated: c. 22m.

Sergiu Nisioi Available
The EVA Corpus of Norwegian School English English Norwegian spoken Picture-based tasks  / c. 35,000 Angela Hasselgren
University of Bergen, Norway
 
The Gachon Learner Corpus English Korean
(+ a few Chinese & Spanish speaking students) 
written Written Journal Assignments Lower intermediate c. 2,5 m. Brian Carlstrom Freely available
The GICLE corpus (German component of ICLE) English German written Mainly non-academic argumentative essays Advanced c. 234,000    
The Giessen-Long Beach Chaplin Corpus
(GLBCC)
English German spoken Transcribed interactions between native English speakers, ESL and EFL speakers Various c. 350,000 Andreas Jucker
Sara Smith
University of Giessen, Germany
Restricted use: apply for approval to get a copy.
The Hong Kong University of Science & Technology learner corpus
(HKUST)
English Chinese - mostly Cantonese written Untimed assignments written for EFL courses and school leaving exams University and advanced high school students c. 25 m. John Milton
Hong Kong University of Science &Technology, Hong Kong
 
The Indianapolis Business Learner Corpus
(IBLC)
English Various written Job application letters and résumés of business communication students from the U.S., Belgium, Finland, Germany, and Thailand, spanning the years 1990-1998     Ulla Connor
Kristen Precht
Thomas Albin Upton
Indiana University, USA
 
The International Corpus of Crosslinguistic Interlanguage (ICCI) English Various written Essays (20-min in-class tasks without the use of a dictionary)  Beginner to lower-intermediate 9,000 essays Yukio Tono
Tokyo University of Foreign Studies, Japan
Freely available
The International Corpus Network of Asian Learners of English
(ICNALE)
English Chinese
Indonesian
Japanese
Koren
Malay
etc.
written and spoken

Controlled speeches and essays

L1 productions by 350 NS

Various c. 1,8 m. Shin'ichiro Ishikawa
Kobe University, Japan
Freely available
The International Corpus of Learner English
(ICLE)
English Various written Argumentative and literary essays High-intermediate to advanced c. 3 m. Sylviane Granger
Centre for English Corpus Linguistics
Université catholique de Louvain, Belgium
CD-Rom + handbook: order online.
The International Teaching Assistants corpus
(ITAcorp)
English Various spoken Learner language from a variety ofspoken classroom tasks: office hours role plays, presentations, discussions   c. 500,000 Steven L. Thorne
Paula Golombek
Jonathon Reinhardt
Pennsylvania State University, USA
 
The Iranian Corpus of Learner English English Farsi written Expository essays University students (English majors) 436,035 Parviz Maftoon, Parviz Birjandi, Hossein Khazaee CD-ROM, data gathered for PhD dissertation by Hossein Khazaee; this corpus is an intellectual property of Science and Research Branch, Islamic Azad University, Tehran, Iran
The ISLE speech corpus English German
Italian
spoken Recorded sentences from several blocks of differing types (reading simple sentences, using minimal pairs, giving answers to multiple choice questions) Intermediate  c. 18h ecisle@nats.informatik.uni-hamburg.de CD-Rom
The Israeli Learner Corpus of Written English English Hebrew written Argumentative and descriptive essays   c. 750,000 Tina Waldman
Kibbutzim College of Education, Israel
 
The Japanese English as a Foreign Language Learner Corpus
(JEFLL)
English Japanese written Student essays From beginning to intermediate c. 700,000

Yukio Tono, Meikai University, Japan

jefll.inquiry@corpuscobo.net

The JEFLL Corpus will be freely available for research, first via the web query system (already available in Japanese) and then the entire data will be distributed under license in the future.
The Janus Pannonius University Corpus
(JPU)
English Hungarian written Essays and research papers University students c. 500,000 József Horváth
University of Pécs, Hungary
Searchable online
Lancaster Corpus of Academic Written English
(LANCAWE)
English various written IELTS academic writing tests (descriptive and argumentative tasks); assignments.
Longitudinal data.
       
The Lang-8 Learner Corpora English Various written texts from Lang-8, a social networking site for language learning / / Toshikazu Tajiri & Mamoru Komachi Available
The LeaP Corpus : Learning Prosody in a Foreign Language English German spoken Four types of speech styles were recorded:
  • nonsense word lists
  • readings of a short story
  • retellings of the story
  • free speech in an interview situation
Various  c. 12h Ulrike Gut
Albert-Ludwigs-University Freiburg, Germany

The annotated corpus is available to the scientific community. Please contact Ulrike Gut at the University of Augsburg.

LeaP manual

The Learner Corpus of Engineering Abstracts
(LCEA)
English Malaysian written Abstracts of the Computer and Communication Systems Engineering Final Year Projects Various

c. 550,000

998 abstracts

Helen Tan, University Putra Malaysia

Chan Swee Heng

Ain Nadzimah

Syamsiah bt Mashohor

Available
The Learner Corpus of English for Business Communication English Chinese written Different types of business correspondence written for simulated business situations, including memos, faxes, reports, letters of enquiry and complaint letters   c. 117,500 Li Lan
Hong Kong Polytechnic University, Hong Kong
Searchable online
The Learner Corpus of Essays and Reports English  Chinese written Essays and project reports covering a range of topics from Science, IT and New Media to Nursing, Business and Economics, and the Social Sciences   c. 188,000

Sima Sengupta
Hong Kong Polytechnic University, Hong Kong

 

Searchable online
A Learners' Corpus of Reading Texts English French spoken Unprepared reading of English texts.
The texts are short abstracts of fiction or made-up dialogues.
 University students   Sophie Herment
Valérie Kerfelec
Laetitia Leonarduzzi
Gabor Turcsan
Freely available
The LONGDALE project: LONGitudinal DAtabase of Learner English English Various spoken and written Range of text types/task types.
Longitudinal data.
From intermediate to advanced   Fanny Meunier
Centre for English Corpus Linguistics
Université catholique de Louvain, Belgium
Under development
The Longman Learners' Corpus English Various written Essays and exam scripts Various c. 10 m. Longman Commercial
The Louvain International Database of Spoken English Interlanguage (LINDSEI) English Various spoken Interviews and picture descriptions High-intermediate to advanced c. 800,000 Gaëtanelle Gilquin
Centre for English Corpus Linguistics
Université catholique de Louvain, Belgium
CD-Rom and handbook: order online
The Malaysian Corpus of Learner English
(MACLE)
English Malay written       Gerry Knowles
Zuraidah Mohd. Don
University of Malay, Malaysia
 
The Malaysian Corpus of Students' Argumentative Writing
(MCSAW)
English Malay
Chinese Indian
written Argumentative essays

Form 4
Form 5
College

c. 565,500

Seyed Ali Rezvani Kalajahi
Jayakaran Mukundan
University Putra Malaysia

Available from developers
The Michigan Corpus of Academic Spoken English (MICASE) English Mainly L1 speakers but also includes data produced by L2 speakers spoken Transcipts of academic speech events   c. 1,8 m.

Ute Römer
University of Michigan, USA

micase@umich.edu

Searchable online
The Michigan Corpus of Upper-level Student Papers (MICUSP) English Semi-balanced sample of native and non-native speakers of English written ESP papers
A-grade papers or ungraded papers that have been assessed and accepted (such as research proposals), but not published
  c. 2,6 m.

Ute Römer
University of Michigan, USA

micusp@umich.edu

Searchable online
The Montclair Electronic Language Database
(MELD)
English Various written Student essays Various c. 100,000 Eileen Fitzpatrick
Milton S. Seegmiller
Monclair State University, USA

Searchable online

Includes error annotations

The Multimedia Adult ESL Learner Corpus
(MAELC)
English ESL environment multimedia Video of classroom interaction and associated written materials Beginner to upper-intermediate  

Stephen Reder
Kathryn Harris
Kristen Setzler
Portland State University, USA

labschool@pdx.edu

The Lab School would like to share the extensive resources from MAELC with interested researchers and teacher trainers. Those interested should make inquiries to the Lab School by e-mail.
The Neungyule Interlanguage Corpus of Korean Learners of English (NICKLE) English Korean spoken and written

Written part: student essays
Spoken part: student interviews and oral speech tests transcriptions

Mainly from beginning to intermediate 

Written:
c. 890,000

Spoken:
c. 100,000

 Ji-Myoung Choi
Yonsei University, Seoul, Korea
The corpus will be available to the scientific community for research purposes upon request.
The Japanese Learner English Corpus
(NICT JLE)
English Japanese spoken English oral proficiency interview test various 2 m. Emi Izumi
Kiyotaka Uchimoto
Hitoshi Isahara
National Institute of Information and Communications Technology, Kyoto, Japan.
Freely available (downloadable)
The NOn-native Spanish corpus of English
(NOSE)
English Spanish written Argumentative and descriptive student essays Intermediate and upper-intermediate c. 300,000 words  Ana Diaz-Negrillo
Universidad de Granada, Spain
 
The NUS Corpus of Learner English English Several East Asian languages, predominantly Chinese written Student essays on a wide range of topics including environmental pollution, healthcare, etc.   various c. 1 m. Hwee Tou Ng
Siew Mei Wu
Daniel Dahlmeier
National University of Singapore, Singapore.
Freely available
The PELCRA Learner English Corpus
(PLEC)
English Polish spoken and written Written: Argumentative, descriptive, narrative and quasi-academic essays; formal letters From beginning to post-advanced

Under development

Aim spoken:
c. 200,000

Aim written:
c.2,8 m.

Piotr Pęzik
Barbara Lewandowska-Tomaszczyk
University of Lodz, Poland

Online search engine and corpus analysis tools
The PICLE corpus (Polish component of ICLE) English Polish written Student essays Advanced c. 330,000 Przemyslaw Kaszubski
AMU, Poznan, Poland
Searchable online
The Qatar learner corpus English Arabic (mostly from Qatar) spoken Spoken interviews with Qatari learners of English     Yun Zhao Helen
Carnegie Mellon University, USA
Freely available
The Québec learner corpus English French (from Québec) written Argumentative essays Intermediate and advanced c. 250,000 Tom Cobb
Université du Québec à Montréal, Canada
 
The Romanian Corpus of Learner English
(RoCLE)
English Romanian written Student essays     Chitez Madalina
Zurich University, Switzerland
 
Russian Error-Annotated English Learner Corpus English Russian written

examination essays of the kind similar to IELTS Task 1 and Task 2, with errors annotated manually

Intermediate to Advanced

c.800,000 by November 2017 and growing (together with the old part of the corpus less consistently annotated or not annotated, available at http://realec.org/index.xhtml#/ - c.2,000,000)

Olga Vinogradova, School of Linguistics, Research University Higher School of Economics

freely available

The Russian Learner Translator Corpus
(RusLTC)
English
Russian
Russian written Translations produced by trainee translators Trainee translators c. 1.5 m. tokens Project directors: Andrey Kutuzov and Maria Kunilovskaya Freeliy available
The Santiago University Learner of English Corpus (SULEC) English Spanish spoken and written

Written: compositions or argumentative essays.

Spoken: semistuctured interviews, short oral presentations and brief story descriptions.

Various Aim: c. 1 m. words Ignacio M. Palacios Martínez, Santiago University Available after registration
The Scientext English Learner Corpus English French written Academic argumentative texts    c. 1.1 m. scientext@u-grenoble3.fr Searchable online
Second Language Research Tasks
(SLRT)
English Various

written

spoken

written paragraphs

various oral tasks

Various c. 300,000

Bill Crawford (Northern Arizona University)

Kim McDonough (Concordia University)

Under development
The Seoul National University Korean-speaking English Learner Corpus (SKELC) English Korean written Student essays Various c. 900,000 Heokseung Kwon
Seoul National University
Korea
 
The SILS Learner Corpus of English English Various (mainly Japanese) written Student essays Basic, intermediate and advanced

 c. 3.2 m.

(first and second drafts included)

Victoria Muehleisen
Waseda University, Japan
 
The Soochow Colber Student Corpus (SCSC) English Chinese written Student essays   c. 227,000 Colman Bernath
Soochow University, Taiwan
 
The Spoken and Written English Corpus of Chinese Learners
(SWECCL)
English Chinese spoken (SECCL)
and written (WECCL)

Written: argumentative and narrative essays.

Spoken: National Spoken English Test – longitudinal data

  c. 2 m. Wei Qiufang
Liang Maocheng
Wang Lifei

CD-rom

The Taiwanese Corpus of Learner English
(TLCE)
English Chinese written Journals and essays (descriptive, narrative, expository, argumentative) Intermediate to advanced c. 2 m. Rebecca Hsue-Huch Shih
Sun Yat-sen University, Taiwan
 
The Tawainese learner academic writing corpus (TaiwanLAWC) English Chinese written Theses and dissertations written by Taiwanese graduate students.     Howard Chen
National Taiwan Normal University, Taiwan
 

The TELEC Secondary Learner Corpus
(TSLC) 

English Chinese written and spoken Compostions from secondary classroom   c. 2 m. Quentin Allan
University of Hong Kong, Hong Kong
 
The Telecollaborative Learner Corpus of English and German Telekorp English German written Bilingual, longitudinal database comprising computer-mediated NS-NNS interactions between approximately 200 Americans and Germans collected during six different telecollaborative partnerships from 2000-2005.   c. 1,5 m. Julie Belz
Pennsylvania State University, USA.
Not publicly available
The Ten-Thousand English Compositions of Chinese Learners
(TECCL)
English Chinese written Essays (various topics) written in and after class, and in testing context. Also contains some collaborative writing samples. Various (mainly undergraduates) c. 1,8 m. Project initiator: Jiajin Xu, National Research Centre for Foreign Language Education, Beijing Foreign Studies University Raw texts and part-of-speech tagged texts are available
The Tswana Learner English Corpus (TLEC) English Tswana written Argumentative essays Advanced c. 200,000 Bertus Van Rooy
North-West University, South Africa
Available in ICLE
The Uppsala Student English Corpus
(USE)
English Swedish written Student essays Various c. 1,200,000 Ylva Berglund Prytz
Margareta Westergren Axelsson
Uppsala University, Sweden
The corpus can be used for research and educational purposes. It can be accessed on the Internet from the Oxford Text Archive.
The Uppsala WordReference Corpus English, Spanish, French, Italian Various Written Forum posts

 

 

 

 

 

 

 

 

 

English learner subcorpus: 38M

English native subcorpus: 50M

Spanish learner subcorpus: 5M

Spanish native subcorpus: 22M

French learner subcorpus: 4M

French native subcorpus: 7M

Italian learner subcorpus: 1M

Italien native subcorpus: 3M

Aleksandrs Berdicevskis
Uppsala University
Freely available 
The UPF Learner Translation Corpus English Catalan written Translations written by the students of the Translation and Interpreting degree at UPF.    c. 200,000 Anna Espunya
Pompeu Fabra University, Barcelona, Spain 
 
The UPV Learner Corpus English Catalan written essays Various c. 150,000 Universitat Politècnica de València, Spain  
The Varieties of English for Specific Purposes dAtabase learner corpus
(VESPA)
English Various written ESP texts (term papers, reports, MA dissertations) Various c. 220,000 (under development) Magali Paquot
Centre for English Corpus Linguistics
Université catholique de Louvain, Belgium
 
The Written Corpus of Learner English corpus
(WriCLE)
English Spanish written Essays Various c. 750,000 Paul Rollinson
Universidad Autonoma de Madrid, Spain
The corpus is available for free, and can be downloaded from this website. There is also a search interface to retrieve sentences and clauses.
The Yonsei English Learner Corpus (YELC) English Korean written Yonsei University English Diagnostic Tests (Part 1: Descriptive task, max. 100 words; Part 2: Argumentative tast, max. 300 words) 9 levels
(A1, A1+, A2, B1, B1+, B2, B2+, C1, C2)
c. 1 m. Seok-Chae Rhee
CK Jung
Yonsei University, Korea
The YELC corpus will be available to the scientific community for research purposes from 31 March 2012.
The Young Learner Corpus of English
(YOLECORE)
English Greek spoken Pedagogic Corpus of video-recorded EFL language classes.  

170 school hours (126  hours of videotaped material)

1,5 m. types

Project director: Marina Mattheoudakis, Aristotle University of Thessaloniki, Greece

Thomas Zapounidis

 
The Estonian Interlanguage Corpus of Tallinn University
(EIC)
Estonian Russian
Finnish
English
German
Latvian
Lithuanian
Ukrainian
Belorussian
written Spontaneously produced texts in language learning situations: argumentative and literary essays, written stories, letters, term papers, reading reports. A1-C2 c. 1 m. Project director: Pille Eslon
Tallinn University, Estonia
Restricted online access
Linguistic Basis of the Common European Framework for L2 English and L2 Finnish
(CEFLING)
Finnish
English
Various written Various Various  

Maisa Martin, University of Jyväskylä, Finland

 
Paths in Second Language Acquisition
(TOPLING)
Finnish
English
Swedish
Various written Various Various  

Maisa Martin, University of Jyväskylä, Finland

 
The Advanced Finnish Learner Corpus
(LAS2)
Finnish  Russian
Czech
Swedish
Estonian
Lithuanian
Komi
English
Hungarian
German
Icelandic
Japanese
written Exam essays, theses, essays and writings Advanced c. 630,000

Kirsti Siitonen, University of Turku, Finland

Ilmari Ivaska, University of Turky, Finland

 
The Finnish National Foreign Language Certificate Corpus (YKI) Finnish

English
Finnish
French
German
Italian
Lappish (Sami)
Spanish
Swedish
Russian

written

spoken

Various Beginner, intermediate and advanced  

Ari Maijanen, Centre for Applied Language Studies, University of Jyväskylä, Finland

Tiina Lammervo, Centre for Applied Language Studies, University of Jyväskylä, Finland

Available with user ID and Password
The International Corpus of Learner Finnish
(ICLFI)
Finnish Various written Finnish learners’ spontaneously produced texts in language learning situations, large variety of text types Beginner, intermediate and advanced Under development

Jarmo Harri Jantunen

University of Oulu, Finland

Free download after applying for a user licence
The Chy-FLE (Cypriot Learner Corpus of French) French Modern Greek
(and Cypriot Greek)
written Argumentative and descriptive essays From intermediate to advanced c. 250,000 (under development) Freiderikos Valetopoulos
Université de Poitiers, France
In collaboration with the University of Cyprus
 
The COREIL corpus French
English
  spoken       Elisabeth Delais-Roussarie
Hiyon Yoo
Université Paris-Diderot, France
 
The "Dire Autrement" corpus French (Second Language) Mainly L1 speakers of English written Narrative, injunctive, persuasivle and informative texts   c. 50,000 Marie-Josée Hamel
Jasmina Milicevic
Dalhousie University, Canada
Available after registration
French Interlanguage Database
(FRIDA)
French Various written Free compositions: desciptive, argumentative and narrative texts, news & mail  Intermediate   Sylviane Granger
Centre for English Corpus Linguistics
Université catholique de Louvain, Belgium
 
French Learner Language Oral Corpora
(FLLOC)
French Various spoken See description of the 7 corpora Various   Florence Myles
Newcastle University
Rosamund Mitchell
University of Southampton, UK

The contents of the database are being made freely available to the research community, in the form of digital sound files and related transcripts formatted using CHILDES software.

Searchable online

The InterFra corpus French Swedish spoken Interviews, retellings of video clips and picture stories Various  

Inge Bartning 
Stockholm University, Sweden.

interfra@fraita.su.se

Available
The "Interphonologie du Français Contemporain" corpus
(IPFC)
French Cypriot Greek
Dutch
English (Canada)
German
Japanese
Norwegian
Spanish

 

spoken Reading aloud, repeating words, guided interviews, interactions between two learners. Various Under development Sylvain Detey
Waseda University, Japan
Université de Rouen, France
Isabelle Racine
Université de Genève, Switzerland
Yuji Kawaguchi
Tokyo University of Foreign Studies, Japan
Under development; samples available
The Learner Corpus French
(LCF)
French Dutch written

Argumentative essays
Informative texts
Journalistic texts
Formal letters
Summaries

Written compositions by Flemish students of French

Intermediate to advanced c. 500,000 K.U.Leuven Campus Kortrijk, UGent and Lessius
Hans Paulussen
Under development
The Lund CEFLE Corpus (Corpus Écrit de Français Langue Étrangère) French Swedish written Descriptive and narrative essays; picture-based stories. Various c. 100,000 Malin Ågren
Lund University, Sweden
A sub-part of the corpus is available online.
The University of the West Indies learner corpus
(UWi)
French

English

Jamaican Creole

spoken Conversations during oral exams and in informal contexts Various   Hugues Peters
University of New South Wales, Sydney, Australia
 
Comasan Labhairt ann an Gàidhlig (CLAG)
-
Gaelic Adult Proficiency
(GAP)
Gaelic Various spoken

Conversation task

Narrative

Elicited oral imitation task

Question and answer activity

Various  

Roibeard Ó Maolalaigh (University of Glasgow)

Nicola Carty (University of Glasgow)

 
The AleSKO corpus German

Chinese

Also German L1 data from the FALKO corpus

written Argumentative essays    c. 13,600 Heike Zinsmeister
University of Konstanz, Germany
Margrit Breckle
Vilnius Pedagogical University, Lithuania.
 
Analyzing Discourse Strategies: A Computer Learner Corpus German English
(mainly American English)
written Threaded Discussion
Chat
Essays
Longitudinal data
From beginner to intermediate-mid Under development Christina Frei
Edward Nixon
University of Pennsylvania, USA
 
The Corpus of Learner German (CLEG13) German English written Argumentative, free compositions
Longitudinal over 4 years, undergraduate students
Intermediate to advanced c. 320,000 Ursula Maden-Weinberger

Online access through the FALKO platform.
The corpus is also available as txt files to the scientific community. Please contact Ursula Maden-Weinberger

The deL1L2IM corpus German

Russian-Belorussian bilinguals

written Instant messaging dialogues Advanced c. 52,000

Sviatlana Höhn
University of Luxemburg

Available
The Fehlerannotiertes Lernerkorpus (‘error annotated learner corpus’)
(FALKO)
German

Learner subcorpus: various

Native subcorpus: German

written

1. Summaries

2. Essays

3. Letters, fiction writing, journal articles, book reviews (= longitudinal data from American learners)

1. Advanced

2. Advanced

3. Beginners - advanced

 

1. c. 40,000 (learner subcorpus) + c. 20,000 (native subcorpus)

2. c. 150,000 (learner corpus) + c. 70,000 (native subcorpus)

3. c. 78,000 (learner subcorpus)

Anke Lüdeling
Maik Walter
Humboldt-Universität zu Berlin
Institut für deutsche Sprache und Linguistik, Germany

falko-korpus@hu-berlin.de

Online access
The KOLIPSI corpus German Italian written Two written language production tasks of a standardized test (email/letter) A2-C1 under development Andrea Abel
Aivars Glaznieks
European Academy Bolzano/Bozen, Italy
 
The Learning the Prosody of a Foreign Language
(LeaP)
German Various spoken The LeaP corpus covers four different types of speech:
- read speech
- prepared speech
- free speech
- nonsense word lists
Various  62 speakers Ulrike Gut
University of Augsburg, Germany

The annotated corpus is available to the scientific community. Please contact Ulrike Gut at the University of Augsburg.

Manual

The LeKo (Lernerkorpus) corpus German         c. 55,000 Anke Lüdeling, Humboldt-Universität Berlin, Germany

Online access (password protected)

Register here

The LINCS Corpus

1. German

2. German

3. German

1. English

2. German

1. Written

2. Written

3. Written

1. Essays, examination, answers.
Longitudinal and cross-sectional data.

2. Essays

3. Teaching output

1. Intermediate to Advanced

2. Advanced

Under development Elizabeth Thoday
Heriot-Watt University Edinburgh, UK
Not currently publicly available
Multilingual Platform for the European Reference Levels: Exploring Interlanguage in Context
(MERLIN)

German

Italian

Czech

Various written writing tasks from standardized tests (telc/UJOP) A1 to C1 c. 280,000 Katrin Wisniewski Available
Rhodes University Deutsch als Fremdsprache (RUDaF) German  English, Afrikaans, isiXhosa, XiTsonga written Short descriptive and argumentative writing paragraphs (300 words each) A2-B2 34,000

Gwyndolen Ortner

Dr Undine S. Weber

Rhodes University, South Africa

Not available
The Telecollaborative Learner Corpus of English and German Telekorp German English written Bilingual, longitudinal database comprising computer-mediated NS-NNS interactions between approximately 200 Americans and Germans collected during six different telecollaborative partnerships from 2000-2005.   c. 1,5 m.

Julie Belz
Pennsylvania State University, USA.

 

Not publicly available
The Langman corpus Hungarian Chinese spoken Interviews conducted in 1994 with 11 Chinese immigrants living in Hungary.
Interviews focused on issues related to their arrival in Hungary as well as their daily life activities
    Juliet Langman
University of Texas at San Antonio, USA
Freely available
Corpus di Apprendenti di Italiano L2
(CAIL2)
Italian Various written Essays Intermediate to advanced c. 237,000 Stefania Spina, Università per Stranieri di Perugia Searchable via CQPweb
Corpus parlato di italiano L2 Italian English
German
Japanese
spoken Transcriptions of interviews Various   Stefania Spina
Silvio Pazzaglia
Mirco Perini
Università per Stranieri di Perugia, Italy
Searchable online
The KOLIPSI corpus Italian German written Two written language production tasks of a standardized test (email/letter) A2-C1 Under development Andrea Abel
European Academy Bolzano/Bozen, Italy
 
The Lexicon of Spoken Italian by Foreigners
(LIPS)
Italian Various spoken Proficiency exams of the Certification of Italian as a Foreign Language (CILS) A1-C2 c. 700,000

Francesca Gallina
Università per Stranieri di Siena, Italy

Freely available
MISTiC (Multiple Italian Student TranslatIon Corpus) Italian English, French written translations produced by trainee translators (mainly specialised texts) post-graduate trainee translators ca. 125,000 (English-Italian), ca. 50,000 (French-Italian) Sara Castagnoli, University of Bologna, Italy not available
Varietà di Apprendimento della Lingua Italiana: Corpus Online
(VALICO)
Italian Various written   Various c. 570,000 Manuel Barbera:manuel.barbera@bmanuel.org
Carla Marello
Freely available and searchable online.
Longitudinal Corpus of Chinese Learners of Italian (LOCCLI) Italian Chinese written essays beginners and pre-intermediate 97,000  The LOCCLI is part of a joint project between Stefania Spina (University for Foreigners of Perugia, Italy) and Anna Siyanova-Chanturia (Victoria University of Wellington, New Zealand). It is freely searchable via CQPweb (registration required) from https://www.unistrapg.it/cqpweb/
Corpus of Chinese Learners of Italian (COLI) Italian Chinese written and spoken

essays and answers to open questions

interviews

intermediate and advanced 82,300

 

Contact: Stefania Spina

The COLI is freely searchable via CQPweb (registration required) from https://www.unistrapg.it/cqpweb/
The Korean learner corpus Korean Various written Various: letters, essays, formal writing... Beginner and intermediate c. 10,000 Seok Bae Jang
Georgetown University, USA
Sun Hee Lee
Wellesley College, USA
Sang kyu Seo
Yonsei University, South Korea
 
ESAM Latvian and Lithuanian Latvian and Lithuanian written   Beginner 52,000 Inga Znotiņa Available online 
The ASK corpus Norwegian German
Dutch
English
Spanish
Russian
Polish
Bosnian-Croatian-Serbian
Albanian
Vietnamese
Somali
written Essays from language tests  B1 and B2   Kari Tenfjord
University of Bergen, Norway
 
The Persian Learner Corpus
(PLC)
Persian (Farsi) Various written Narratives and essays Intermediate and advanced Academic/Restricted online access

Saeed Safari

University of Belgrade, Faculty of Philology

Academic/Restricted online access
The Salam Farsi Learner Corpus
(SFLC)
Persian (Farsi) Serbian written Narratives, descriptive essays Beginner and upper-intermediate Under development

Saeed Safari

University of Belgrade, Faculty of Philology

Academic, under development
Learner Corpus of Portuguese L2 (COPLE2) Portuguese 15 languages: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian and Swedish Written and spoken Exams and assignments A1-C1 written: 171.461
oral: 25.783
Iria del Río Available
Russian Learner Corpus Russian varied written and spoken academic and non academic teachers and heritage speakers unknown Ekaterina Rakhilina Available online
The PIKUST pilot learner corpus Slovene Various written Mostly argumentative essays Majority advanced – but also intermediate and beginner c. 35,000 Mojca Stritar
University of Ljubljana, Slovenia
 
The Anglia Polytechnic University (APU) Learner Spanish Corpus Spanish Various written     c. 120,000 Anne Ife
Anglia Ruskin University, UK
 
Aprescrilov ("Aprendera Escribiren Lovaina") Spanish Dutch written Written assignments and tests; several text types (letters, expository, descriptive, argumentative, narrative) A1 to C1 c. 1 m.

Kris Buyse
KU Leuven, Belgium

Restricted online access

The Corpus de aprendices de español
(CAES)

Spanish Various written   A1 to C1

c. 575,000

CAES team

Universidade de Santiago de Compostela

Online access
Corpus Escrito del Español L2
(CEDEL2)
Spanish English written Written compositions by learners of Spanish   c. 730,000

Amaya Mendikoetxea
Universidad Autónoma de Madrid, Spain
Cristobal Lozano
Universidad de Granada, Spain

 Please contact Cristobal Lozano to get a free sample of the corpus

Corpus de textos escritos para el análisis de errores de aprendices de E/LE
(CORANE)

Spanish Various written Essays A2 to C1 /

Cestero Mancera, A. M.
Penadés Martínez, I.

Universidad de Alcalá Henares

CD-ROM available
The Corpus of Taiwanese Learners of Spanish (Corpus de Aprendices Taiwaneses de Español)
(CATE)
Spanish Chinese written Student essays Various c. 340,000 hclu@mail.ncku.edu.tw Under development
The DIAZ corpus Spanish

German
Swedish
Icelandic
Korean
Chinese

spoken Semi-spontaneous (structured interviews) and experimental (structured questionnaires) Adult Spanish L2/L3 oral data Various   Lourdes Diaz Rodriguez
Universitat Pompeu Fabra, Spain
Freely available
The Japanese learner corpus of Spanish Spanish Japanese written Student essays   c. 83,400 Yoshihito Kamakura
University of Birmingham, UK
 
The Spanish Corpus Proficiency Level Training
(SPT)
Spanish English (heritage language learners) spoken Dialogues about a given set of questions Beginner to advanced   Dr Dale Koike, University of Texas, Austin Liberal Arts Instructional Technology Center

Videos are available

Spanish Learner Language Oral Corpus
(SPLLOC)
Spanish English spoken Learner narratives, interviews and picture description tasks Beginner to advanced c. 50,000 Laura Dominguez
University of Southampton, UK
Searchable online
Data freely available for download
Spanish Learner Oral Corpus Spanish Various
(9+ languages - especially Portuguese, French, Italian)
spoken Semi-spontaneous interviews, narrative and descriptive tasks A2-B1 c. 50,000 words Leonardo Campillos Llanos
Laboratorio de Lingüistica Informatica
Universidad Autonoma de Madrid, Spain
Online access
The Tartu Learner Corpus of Spanish as a L3+ Spanish Estonian written Academic research writing Advanced c. 885,000 Mari Kruse, University of Tartu, Estonia  
The ASU corpus Swedish  Chinese
English
German
Greek
Polish
Portuguese
Spanish
...
spoken and written Transcribed audio-recorded conversations and written texts from adult learners of Swedish – longitudinal data   c. 490,000 words
(c. 415,000 spoken and c. 75,000 written)
Björn Hammarberg
Stockholm University, Sweden
Manual
Leiden Learner Corpus Multilingual (Dutch, French, Italian, Portuguese and Spanish) various written and spoken written data: short essays; oral data: picture-based story telling various 200 participants M. Carmen Parafita Couto  

The European Science Foundation Second Language Database
(ESF database)

Multilingual:

Dutch
English
French
German
Swedish

Punjabi
Italian
Turkish
Arabic
Spanish
Finnish

spoken Spontaneous second language acquisition of forty adult immigrant workers living in Western Europe, and their communication with native speakers in the respective host countries Various   Wolfgang Klein
Clive Perdue
Max Planck Institut, Nijmegen, Netherlands
Freely available
The Foreign Language Examination Corpus
(FLEC)
Multilingual Polish written Data from the Warsaw University
Certification Exams
Various Under development Piotr Banski
Romuald Gozdawa-Golebiowski
Warsaw University, Poland
 
The MeLLANGE Learner Translator Corpus
(LTC)
Multilingual various written Legal, technical, administrative and journalistic texts Trainee translators  

Natalie Kübler
Université Paris Diderot, France.

mellange_p7@eila.univ-paris-diderot.fr

Searchable online
The MiLC Corpus

Multilingual:

Catalan
English
French
Spanish

Catalan written Formal and informal letters, summaries, curriculum vitae, essays, reports, translations, synchronous and asynchronous communication exchanges, business letters   c. 150,000 Angeles Andreu Andrés et al
Universidad Polytecnica de Valencia, Spain
 
The Multilingual Learner Corpus (MLC)

Multilingual:

English
German
Italian
Spanish

Brazilian Portuguese written Argumentative and marrative essays    Aim: c. 200,000 Stella E.O. Tagnin
University of São Paulo, Brazil
Accessible online to registered researchers
The Padova Learner Corpus

Multilingual:

English
French
Spanish

Italian CMC
(Computer-Mediated Communication)

Student work produced in blended language courses using FirstClass conferencing software.
Variety of genres: diaries, debate contributions, formal reports, résumés etc. 
Longitudinal data

 

  Under development Fiona Dalziel
Francesca Helm
University of Padua, Italy
 

The corpus PARallèle Oral en Langue Etrangère
(PAROLE)

 

Multilingual:

English
French
Italian

(Mainly L2 speakers but also includes data produced by L1 speakers)

Various spoken 5 oral production tasks Various   Heather Hilton
John Osborne
Marie-Jo Derive
Nejma Succo
Jean O'Donnell
Sandra Billard
Sandrine Rutigliano-Daspet
Université de Savoie, France
Manual
The University of Toronto Romance Phonetics Database
(RPD)

Multilingual:

English
French
Italian
Portuguese
Romanian
Spanish

Various
(including English, Mandarin, Russian, Spanish, etc.)
spoken Elicited production - sentence and passage reading, story narration, description of favourite meal Various   Laura Colantoni
Jeffrey Steele
University of Toronto, Canada
Password available from directors

  

Learner corpus-based datasets

 

  

Corpus Target language First language Medium Text type / task type Proficiency level Size in words Project director Availability
 The Treebank of Learner English
(TLE)
 English Various written  Sentences from the CLC FCE (annotated with syntactic trees)  Upper-intermediate

 97,681
(5,124 sentences)

Yevgeni Berzak Publicly available through the UD repository ('English-ESL')