This list is still work in progress. We would like it to be as comprehensive as possible. If you have a learner corpus or know of one that is not listed on this webpage, send a message to Magali Paquot and we'll add it to the list. We hope you will find the list useful for your research!
The list below only contains learner corpora, i.e. electronic collections of continuous written or spoken data produced by foreign or second language learners.
For a list of learner corpus-based datasets (treebanks, error lists, etc.), click here.
To refer to this list : Centre for English Corpus Linguistics (date of access): Learner Corpora around the World. Louvain-la-Neuve: Université catholique de Louvain. https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html
© 2009, Université catholique de Louvain |
Learner corpora
Corpus |
Target |
First language |
Medium | Text type/ task type | Proficiency level | Size in words |
Project director | Availability |
---|---|---|---|---|---|---|---|---|
The Arabic Learner Corpus (ALC) |
Arabic | 66 languages | written and spoken | Narrative and discussion | Intermediate and advanced |
written: audio: |
Available | |
The Pilot Arabic Learner Corpus | Arabic | English | written | Narrative | Intermediate and advanced | c. 9,000 | Ghazi Abuhakema Reem Faraj Anna Feldman Eileen Fitzpatrick Montclair State University, USA |
|
The Jinan Chinese Learner Corpus (JCLC) |
Chinese | 50 languages | written | Exams and assignments | Beginners, intermediate and advanced |
c. 6 m. Chinese characters c. 9,000 texts |
Maolin Wang |
|
Croatian Learner Text Corpus (CroLTeC) | Croatian | 36 languages (Afrikaans, Arabic, Bulgarian, Catalan, Czech, Danish, German, English, Estonian, Persian, Finnish, French, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Lari, Mandinka, Dutch, Norwegian, Polish, Portuguese, Russian, Slovak, Slovenian, Spanish, Albanian, Swedish, Thai, Turkish, Ukrainian, Vietnamese, Chinese, Malay) | written | exam essays, argumentative and literary essays, letters, diaries, picture descriptions, book reviews, short dialogues, etc. | A1-C2 | c. 1 million | Nives Mikelic Preradovic, University of Zagreb, Croatia | Freely available |
The AKCES/CZESL corpus (Acquisition corpora of Czech/Czech as a second language) |
Czech | Various | written and spoken | Student essays and interviews |
Various | 2 m. | Karel Sebesta Charles University in Prague Technical University in Liberec, Czech Republic |
Available |
Leerdercorpus Nederlands als Vreemde Taal | Dutch | French | written | Liesbeth Degand Université catholique de Louvain, Belgium |
||||
Arab Learner English Corpus (ALEC) | English | Arabic | written | Essays written by freshman students as part of first level college writing course | University students (second language learners) |
Analysis 184749
Narrative 67527
Synthesis 66015
Argumentation 192298 |
imahfouz@auk.edu.kw">Dr. Inas Mahfouz, American University of Kuwait |
https://dspace.auk.edu.kw/handle/11675/1757 |
The Aachen Corpus of Academic Writing |
English | German | written | Academic research writing | Advanced |
c. 240,000 words c. 225,000 words (L1 component) |
Elma Kerz, RWTH Aachen University | Under development |
The Advanced Learner English Corpus (ALEC) |
English | Mainly Swedish | written | Essays written by university students of English linguistics and English literature | Advanced | c. 1,3 m. | Tove Larsson, Uppsala University | Not freely available |
The ANGLISH corpus | English | French | spoken | Readings of texts and sentences, spontaneous oral language. | Various | c. 5h30 | Anne Tortel University of Provence, France. |
Freely available |
Asao Kojiro’s Learner Corpus Data | English | Japanese | written | Essays and stories written or reproduced by Japanese college students. | Asao Kojiro | Texts available for download | ||
The Barcelona English Language Corpus (BELC) |
English | Spanish Catalan |
spoken and written |
4 tasks: Longitudinal data (children and young adults learning English) |
Various | Carmen Muños University of Barcelona, Spain |
||
The BATMAT Corpus | English | Swedish Finnish |
written | BA dissertations MA dissertations |
Advanced | c. 2,5 m. (expanding) | Signe-Anita Lindgrén, English language and literature, Åbo Akademi University, Finland | Under development |
Belarussian Learner Corpus of English (BELLCE) | English | Russian; Belarussian | written | argumentative essays | High intermediate to advanced | unknown | Anastasia Rakhuba | |
The Bilingual Corpus of Chinese English Learners (BICCEL) |
English | Chinese | spoken and written |
Spoken: National Oral English test. Written: in-class assignments |
c. 2 m. | Wen Qiufang National Research Center for Foreign Language Education Beijing Foreign Studies University, China |
||
The British Academic Written English (BAWE) corpus | English |
Mainly L1 speakers Also includes data produced by L2 speakers |
written | ESP papers |
4 levels of study (from undergraduate levels to final year and taught masters level)
|
c. 6,5 m. | Hilary Nesi Sheena Gardner Warwick, UK Paul Thompson University of Birmingham, UK Paul Wickens Oxford Brookes, UK baseplus@warwick.ac.uk |
The BAWE corpus can be accessed through the corpus analysis interface, Sketch Engine. A prototype interface that allows filtered searching of the BAWE corpus files is available. |
The BUiD Arab Learner Corpus (BALC) | English | Arabic | written | School examination essays | Various | c. 290,000 | Mick Randall The British University in Dubai, United Arab Emirates Nicholas Groom University of Birmingham, UK |
At present, copies of the current version of the corpus is available on request from mick.randall@buid.ac.ae |
The Cambridge Learner Corpus (CLC) | English | Various | written | Exam scripts | Various | c. 50 m. | Cambridge University Press and Cambridge ESOL, UK | Commercial |
The Corpus of Academic Learner English (CALE) |
English | German | written | Various academic text types that are typically produced in university courses of English, e.g. term papers, reading reports, research plans, abstract, reviews, and summaries. | Advanced | under development | Marcus Callies University of Bremen, Germany |
|
The Corpus of English Essays Written by Asian University Students (CEEAUS) | English | Various | written | Student essays | Various | c. 200,000 | Shin Ishikawa Kobe University, Japan |
Freely downloadable from the website |
The Chinese Academic Written English corpus (CAWE) |
English | Chinese | written | Dissertations written by Chinese undergraduates majoring in English linguistics or applied linguistics. | c. 400,000 | David Yong Wey Lee City University of Hong Kong, Hong Kong |
||
The Chinese Learner English Corpus (CLEC) |
English | Chinese | written | Various | c. 1 m. | Gui Shichun Guangdong University of Foreign Studies & Yang Huizhong, Shanghai Jiatong, China |
The corpus can only be accessed by users in the Department of English at HKPU. | |
The City University Corpus of Academic Spoken English (CUCASE) | English |
Chinese Also includes data produced by L1 speakers |
multimedia | c. 2 m. | David Yong Wey Lee City University of Hong Kong, Hong Kong |
|||
The Cologne-Hanover Advanced Learner Corpus (CHALC) | English | German | written | term papers and essays | Advanced | c. 210,000 | Ute Römer University of Michigan, USA |
|
The College Learners’ Spoken English Corpus (COLSEC) |
English | Chinese | spoken | National spoken English test for non-English majors. | c. 700,000 | Yang and Wei | ||
The Corpus Archive of Learner English in Sabah/Sarawak (CALES) | English | Malay | written | Argumentative essays | Various | c. 400,000 | Simon Botley@Faizal Hakim Doreen Dillah Universiti Teknologi MARA Sarawak, Malaysia |
|
CORpus del ESPañol de los Italianos (CORESPI) | Spanish | Italian | Written | Written compositions | A1 to B2 | c.125,000 |
Sonia Bailini |
|
CORpus del ITaliano de los Españoles (CORITE) | Italian | Spanish | Written | Written compositions | A1 to B2 | c.103,000 | Sonia Bailini sonia.bailini@unicatt.it Università Cattolica del Sacro Cuore, Milan, Italy |
|
The Corpus of Business Letters | English | Italian | written |
Tagged part: BEC1 writting tests (letters, emails, faxes, memos, reports) Untagged part: business writing exam tests |
c. 32,000 | Anna Romagnuolo | ||
The Corpus of Multilingual Opinion Essays by College Students (MOECS) | English | varied | written | opinion essays | college students | unknown | Megumi Okugiri | available |
Corpus of writing, pronunciation, reading, and listening by learners of English as a Foreign Language | English | Japanese | written and spoken | varied | beginners to advanced | 29h audio + 30.000 words | Katsunori Kotani | |
The Corpus of Young Learner Interlanguage (CYLIL) | English |
Dutch |
spoken | English L2 data elicited from European School pupils. Longitudinal data |
Various | c. 500,000 | Alex Housen Vrije Universiteit Brussel, Belgium |
|
The Eastern European English learner corpus | English | Russian Ukrainian Polish Slovak |
spoken | Spontaneaous spoken production data elicited by means of a semi-structured interview | Various | c. 60,000 | Elena Salakhian Eberhard Karls University of Tübingen, Germany |
|
The EFL Teacher Corpus (ETC) |
English | Korean |
spoken | Teacher talks in language classrooms | Upper-intermediate to advanced | c. 123,000 | Ye-eun Kwon Eun-Joo Lee |
Under development |
The English of Malaysian School Students corpus (EMAS) | English | Malay | written | Student essays + oral interviews | various | c. 500,000 | Arshad Abd. Samad et al. Universiti Putra Malaysia, Malaysia |
|
The English Speech Corpus of Chinese Learners (ESCCL) |
English | Chinese | spoken | Dialogue reading-aloud | Middle school and college | Chen Hua Nantong University, China Wen Qiufang Beijing Foreign Studies University, China Li Aijun Chinese Academy of Social Sciences, China |
||
The ETS Corpus of Non-Native Written English | English | 11 languages | written | 12,100 TOEFL English essays | / | Daniel Blanchard |
Information avout the score level is available for each essay |
|
The Europarl corpus of Native Non-native and Translated Texts (ENNTT) |
English | 24 EU languages | written | Proceedings of the European Parliament | Advanced |
NNS: c. 780,000 NS: c. 3 m. Translated: c. 22m. |
Sergiu Nisioi | Available |
The EVA Corpus of Norwegian School English | English | Norwegian | spoken | Picture-based tasks | / | c. 35,000 | Angela Hasselgren University of Bergen, Norway |
|
The Gachon Learner Corpus | English | Korean (+ a few Chinese & Spanish speaking students) |
written | Written Journal Assignments | Lower intermediate | c. 2,5 m. | Brian Carlstrom | Freely available |
The GICLE corpus (German component of ICLE) | English | German | written | Mainly non-academic argumentative essays | Advanced | c. 234,000 | ||
The Giessen-Long Beach Chaplin Corpus (GLBCC) |
English | German | spoken | Transcribed interactions between native English speakers, ESL and EFL speakers | Various | c. 350,000 | Andreas Jucker Sara Smith University of Giessen, Germany |
Restricted use: apply for approval to get a copy. |
The Hong Kong University of Science & Technology learner corpus (HKUST) |
English | Chinese - mostly Cantonese | written | Untimed assignments written for EFL courses and school leaving exams | University and advanced high school students | c. 25 m. | John Milton Hong Kong University of Science &Technology, Hong Kong |
|
The Indianapolis Business Learner Corpus (IBLC) |
English | Various | written | Job application letters and résumés of business communication students from the U.S., Belgium, Finland, Germany, and Thailand, spanning the years 1990-1998 | Ulla Connor Kristen Precht Thomas Albin Upton Indiana University, USA |
|||
The International Corpus of Crosslinguistic Interlanguage (ICCI) | English | Various | written | Essays (20-min in-class tasks without the use of a dictionary) | Beginner to lower-intermediate | 9,000 essays | Yukio Tono Tokyo University of Foreign Studies, Japan |
Freely available |
The International Corpus Network of Asian Learners of English (ICNALE) |
English | Chinese Indonesian Japanese Koren Malay etc. |
written and spoken |
Controlled speeches and essays L1 productions by 350 NS |
Various | c. 1,8 m. | Shin'ichiro Ishikawa Kobe University, Japan |
Freely available |
The International Corpus of Learner English (ICLE) |
English | Various | written | Argumentative and literary essays | High-intermediate to advanced | c. 3 m. | Sylviane Granger Centre for English Corpus Linguistics Université catholique de Louvain, Belgium |
CD-Rom + handbook: order online. |
The International Teaching Assistants corpus (ITAcorp) |
English | Various | spoken | Learner language from a variety ofspoken classroom tasks: office hours role plays, presentations, discussions | c. 500,000 | Steven L. Thorne Paula Golombek Jonathon Reinhardt Pennsylvania State University, USA |
||
The Iranian Corpus of Learner English | English | Farsi | written | Expository essays | University students (English majors) | 436,035 | Parviz Maftoon, Parviz Birjandi, Hossein Khazaee | CD-ROM, data gathered for PhD dissertation by Hossein Khazaee; this corpus is an intellectual property of Science and Research Branch, Islamic Azad University, Tehran, Iran |
The ISLE speech corpus | English | German Italian |
spoken | Recorded sentences from several blocks of differing types (reading simple sentences, using minimal pairs, giving answers to multiple choice questions) | Intermediate | c. 18h | ecisle@nats.informatik.uni-hamburg.de | CD-Rom |
The Israeli Learner Corpus of Written English | English | Hebrew | written | Argumentative and descriptive essays | c. 750,000 | Tina Waldman Kibbutzim College of Education, Israel |
||
The Japanese English as a Foreign Language Learner Corpus (JEFLL) |
English | Japanese | written | Student essays | From beginning to intermediate | c. 700,000 |
Yukio Tono, Meikai University, Japan |
The JEFLL Corpus will be freely available for research, first via the web query system (already available in Japanese) and then the entire data will be distributed under license in the future. |
The Janus Pannonius University Corpus (JPU) |
English | Hungarian | written | Essays and research papers | University students | c. 500,000 | József Horváth University of Pécs, Hungary |
Searchable online |
Lancaster Corpus of Academic Written English (LANCAWE) |
English | various | written | IELTS academic writing tests (descriptive and argumentative tasks); assignments. Longitudinal data. |
||||
The Lang-8 Learner Corpora | English | Various | written | texts from Lang-8, a social networking site for language learning | / | / | Toshikazu Tajiri & Mamoru Komachi | Available |
The LeaP Corpus : Learning Prosody in a Foreign Language | English | German | spoken | Four types of speech styles were recorded:
|
Various | c. 12h | Ulrike Gut Albert-Ludwigs-University Freiburg, Germany |
The annotated corpus is available to the scientific community. Please contact Ulrike Gut at the University of Augsburg. |
The Learner Corpus of Engineering Abstracts (LCEA) |
English | Malaysian | written | Abstracts of the Computer and Communication Systems Engineering Final Year Projects | Various |
c. 550,000 998 abstracts |
Helen Tan, University Putra Malaysia Chan Swee Heng Ain Nadzimah Syamsiah bt Mashohor |
Available |
The Learner Corpus of English for Business Communication | English | Chinese | written | Different types of business correspondence written for simulated business situations, including memos, faxes, reports, letters of enquiry and complaint letters | c. 117,500 | Li Lan Hong Kong Polytechnic University, Hong Kong |
Searchable online | |
The Learner Corpus of Essays and Reports | English | Chinese | written | Essays and project reports covering a range of topics from Science, IT and New Media to Nursing, Business and Economics, and the Social Sciences | c. 188,000 |
Sima Sengupta
|
Searchable online | |
A Learners' Corpus of Reading Texts | English | French | spoken | Unprepared reading of English texts. The texts are short abstracts of fiction or made-up dialogues. |
University students | Sophie Herment Valérie Kerfelec Laetitia Leonarduzzi Gabor Turcsan |
Freely available | |
The LONGDALE project: LONGitudinal DAtabase of Learner English | English | Various | spoken and written | Range of text types/task types. Longitudinal data. |
From intermediate to advanced | Fanny Meunier Centre for English Corpus Linguistics Université catholique de Louvain, Belgium |
Under development | |
The Longman Learners' Corpus | English | Various | written | Essays and exam scripts | Various | c. 10 m. | Longman | Commercial |
The Louvain International Database of Spoken English Interlanguage (LINDSEI) | English | Various | spoken | Interviews and picture descriptions | High-intermediate to advanced | c. 800,000 | Gaëtanelle Gilquin Centre for English Corpus Linguistics Université catholique de Louvain, Belgium |
CD-Rom and handbook: order online |
The Malaysian Corpus of Learner English (MACLE) |
English | Malay | written | Gerry Knowles Zuraidah Mohd. Don University of Malay, Malaysia |
||||
The Malaysian Corpus of Students' Argumentative Writing (MCSAW) |
English | Malay Chinese Indian |
written | Argumentative essays |
Form 4 |
c. 565,500 |
Seyed Ali Rezvani Kalajahi |
Available from developers |
The Michigan Corpus of Academic Spoken English (MICASE) | English | Mainly L1 speakers but also includes data produced by L2 speakers | spoken | Transcipts of academic speech events | c. 1,8 m. |
Ute Römer |
Searchable online | |
The Michigan Corpus of Upper-level Student Papers (MICUSP) | English | Semi-balanced sample of native and non-native speakers of English | written | ESP papers A-grade papers or ungraded papers that have been assessed and accepted (such as research proposals), but not published |
c. 2,6 m. |
Ute Römer |
Searchable online | |
The Montclair Electronic Language Database (MELD) |
English | Various | written | Student essays | Various | c. 100,000 | Eileen Fitzpatrick Milton S. Seegmiller Monclair State University, USA |
Searchable online Includes error annotations |
The Multimedia Adult ESL Learner Corpus (MAELC) |
English | ESL environment | multimedia | Video of classroom interaction and associated written materials | Beginner to upper-intermediate |
Stephen Reder |
The Lab School would like to share the extensive resources from MAELC with interested researchers and teacher trainers. Those interested should make inquiries to the Lab School by e-mail. | |
The Neungyule Interlanguage Corpus of Korean Learners of English (NICKLE) | English | Korean | spoken and written |
Written part: student essays |
Mainly from beginning to intermediate |
Written: Spoken: |
Ji-Myoung Choi Yonsei University, Seoul, Korea |
The corpus will be available to the scientific community for research purposes upon request. |
The Japanese Learner English Corpus (NICT JLE) |
English | Japanese | spoken | English oral proficiency interview test | various | 2 m. | Emi Izumi Kiyotaka Uchimoto Hitoshi Isahara National Institute of Information and Communications Technology, Kyoto, Japan. |
Freely available (downloadable) |
The NOn-native Spanish corpus of English (NOSE) |
English | Spanish | written | Argumentative and descriptive student essays | Intermediate and upper-intermediate | c. 300,000 words | Ana Diaz-Negrillo Universidad de Granada, Spain |
|
The NUS Corpus of Learner English | English | Several East Asian languages, predominantly Chinese | written | Student essays on a wide range of topics including environmental pollution, healthcare, etc. | various | c. 1 m. | Hwee Tou Ng Siew Mei Wu Daniel Dahlmeier National University of Singapore, Singapore. |
Freely available |
The PELCRA Learner English Corpus (PLEC) |
English | Polish | spoken and written | Written: Argumentative, descriptive, narrative and quasi-academic essays; formal letters | From beginning to post-advanced |
Under development Aim spoken: Aim written: |
Piotr Pęzik |
Online search engine and corpus analysis tools |
The PICLE corpus (Polish component of ICLE) | English | Polish | written | Student essays | Advanced | c. 330,000 | Przemyslaw Kaszubski AMU, Poznan, Poland |
Searchable online |
The Qatar learner corpus | English | Arabic (mostly from Qatar) | spoken | Spoken interviews with Qatari learners of English | Yun Zhao Helen Carnegie Mellon University, USA |
Freely available | ||
The Québec learner corpus | English | French (from Québec) | written | Argumentative essays | Intermediate and advanced | c. 250,000 | Tom Cobb Université du Québec à Montréal, Canada |
|
The Romanian Corpus of Learner English (RoCLE) |
English | Romanian | written | Student essays | Chitez Madalina Zurich University, Switzerland |
|||
Russian Error-Annotated English Learner Corpus | English | Russian | written |
examination essays of the kind similar to IELTS Task 1 and Task 2, with errors annotated manually |
Intermediate to Advanced |
c.800,000 by November 2017 and growing (together with the old part of the corpus less consistently annotated or not annotated, available at http://realec.org/index.xhtml#/ - c.2,000,000) |
Olga Vinogradova, School of Linguistics, Research University Higher School of Economics |
|
The Russian Learner Translator Corpus (RusLTC) |
English Russian |
Russian | written | Translations produced by trainee translators | Trainee translators | c. 1.5 m. tokens | Project directors: Andrey Kutuzov and Maria Kunilovskaya | Freeliy available |
The Santiago University Learner of English Corpus (SULEC) | English | Spanish | spoken and written |
Written: compositions or argumentative essays. Spoken: semistuctured interviews, short oral presentations and brief story descriptions. |
Various | Aim: c. 1 m. words | Ignacio M. Palacios Martínez, Santiago University | Available after registration |
The Scientext English Learner Corpus | English | French | written | Academic argumentative texts | c. 1.1 m. | scientext@u-grenoble3.fr | Searchable online | |
Second Language Research Tasks (SLRT) |
English | Various |
written spoken |
written paragraphs various oral tasks |
Various | c. 300,000 |
Bill Crawford (Northern Arizona University) Kim McDonough (Concordia University) |
Under development |
The Seoul National University Korean-speaking English Learner Corpus (SKELC) | English | Korean | written | Student essays | Various | c. 900,000 | Heokseung Kwon Seoul National University Korea |
|
The SILS Learner Corpus of English | English | Various (mainly Japanese) | written | Student essays | Basic, intermediate and advanced |
c. 3.2 m. (first and second drafts included) |
Victoria Muehleisen Waseda University, Japan |
|
The Soochow Colber Student Corpus (SCSC) | English | Chinese | written | Student essays | c. 227,000 | Colman Bernath Soochow University, Taiwan |
||
The Spoken and Written English Corpus of Chinese Learners (SWECCL) |
English | Chinese | spoken (SECCL) and written (WECCL) |
Written: argumentative and narrative essays. Spoken: National Spoken English Test – longitudinal data |
c. 2 m. | Wei Qiufang Liang Maocheng Wang Lifei |
CD-rom |
|
The Taiwanese Corpus of Learner English (TLCE) |
English | Chinese | written | Journals and essays (descriptive, narrative, expository, argumentative) | Intermediate to advanced | c. 2 m. | Rebecca Hsue-Huch Shih Sun Yat-sen University, Taiwan |
|
The Tawainese learner academic writing corpus (TaiwanLAWC) | English | Chinese | written | Theses and dissertations written by Taiwanese graduate students. | Howard Chen National Taiwan Normal University, Taiwan |
|||
The TELEC Secondary Learner Corpus |
English | Chinese | written and spoken | Compostions from secondary classroom | c. 2 m. | Quentin Allan University of Hong Kong, Hong Kong |
||
The Telecollaborative Learner Corpus of English and German Telekorp | English | German | written | Bilingual, longitudinal database comprising computer-mediated NS-NNS interactions between approximately 200 Americans and Germans collected during six different telecollaborative partnerships from 2000-2005. | c. 1,5 m. | Julie Belz Pennsylvania State University, USA. |
Not publicly available | |
The Ten-Thousand English Compositions of Chinese Learners (TECCL) |
English | Chinese | written | Essays (various topics) written in and after class, and in testing context. Also contains some collaborative writing samples. | Various (mainly undergraduates) | c. 1,8 m. | Project initiator: Jiajin Xu, National Research Centre for Foreign Language Education, Beijing Foreign Studies University | Raw texts and part-of-speech tagged texts are available |
The Tswana Learner English Corpus (TLEC) | English | Tswana | written | Argumentative essays | Advanced | c. 200,000 | Bertus Van Rooy North-West University, South Africa |
Available in ICLE |
The Uppsala Student English Corpus (USE) |
English | Swedish | written | Student essays | Various | c. 1,200,000 | Ylva Berglund Prytz Margareta Westergren Axelsson Uppsala University, Sweden |
The corpus can be used for research and educational purposes. It can be accessed on the Internet from the Oxford Text Archive. |
The Uppsala WordReference Corpus | English, Spanish, French, Italian | Various | Written | Forum posts |
|
English learner subcorpus: 38M English native subcorpus: 50M Spanish learner subcorpus: 5M Spanish native subcorpus: 22M French learner subcorpus: 4M French native subcorpus: 7M Italian learner subcorpus: 1M Italien native subcorpus: 3M |
Aleksandrs Berdicevskis Uppsala University |
Freely available |
The UPF Learner Translation Corpus | English | Catalan | written | Translations written by the students of the Translation and Interpreting degree at UPF. | c. 200,000 | Anna Espunya Pompeu Fabra University, Barcelona, Spain |
||
The UPV Learner Corpus | English | Catalan | written | essays | Various | c. 150,000 | Universitat Politècnica de València, Spain | |
The Varieties of English for Specific Purposes dAtabase learner corpus (VESPA) |
English | Various | written | ESP texts (term papers, reports, MA dissertations) | Various | c. 220,000 (under development) | Magali Paquot Centre for English Corpus Linguistics Université catholique de Louvain, Belgium |
|
The Written Corpus of Learner English corpus (WriCLE) |
English | Spanish | written | Essays | Various | c. 750,000 | Paul Rollinson Universidad Autonoma de Madrid, Spain |
The corpus is available for free, and can be downloaded from this website. There is also a search interface to retrieve sentences and clauses. |
The Yonsei English Learner Corpus (YELC) | English | Korean | written | Yonsei University English Diagnostic Tests (Part 1: Descriptive task, max. 100 words; Part 2: Argumentative tast, max. 300 words) | 9 levels (A1, A1+, A2, B1, B1+, B2, B2+, C1, C2) |
c. 1 m. | Seok-Chae Rhee CK Jung Yonsei University, Korea |
The YELC corpus will be available to the scientific community for research purposes from 31 March 2012. |
The Young Learner Corpus of English (YOLECORE) |
English | Greek | spoken | Pedagogic Corpus of video-recorded EFL language classes. |
170 school hours (126 hours of videotaped material) 1,5 m. types |
Project director: Marina Mattheoudakis, Aristotle University of Thessaloniki, Greece |
||
The Estonian Interlanguage Corpus of Tallinn University (EIC) |
Estonian | Russian Finnish English German Latvian Lithuanian Ukrainian Belorussian |
written | Spontaneously produced texts in language learning situations: argumentative and literary essays, written stories, letters, term papers, reading reports. | A1-C2 | c. 1 m. | Project director: Pille Eslon Tallinn University, Estonia |
Restricted online access |
Linguistic Basis of the Common European Framework for L2 English and L2 Finnish (CEFLING) |
Finnish English |
Various | written | Various | Various |
Maisa Martin, University of Jyväskylä, Finland |
||
Paths in Second Language Acquisition (TOPLING) |
Finnish English Swedish |
Various | written | Various | Various |
Maisa Martin, University of Jyväskylä, Finland |
||
The Advanced Finnish Learner Corpus (LAS2) |
Finnish | Russian Czech Swedish Estonian Lithuanian Komi English Hungarian German Icelandic Japanese |
written | Exam essays, theses, essays and writings | Advanced | c. 630,000 |
Kirsti Siitonen, University of Turku, Finland Ilmari Ivaska, University of Turky, Finland |
|
The Finnish National Foreign Language Certificate Corpus (YKI) | Finnish |
English |
written spoken |
Various | Beginner, intermediate and advanced |
Ari Maijanen, Centre for Applied Language Studies, University of Jyväskylä, Finland Tiina Lammervo, Centre for Applied Language Studies, University of Jyväskylä, Finland |
Available with user ID and Password | |
The International Corpus of Learner Finnish (ICLFI) |
Finnish | Various | written | Finnish learners’ spontaneously produced texts in language learning situations, large variety of text types | Beginner, intermediate and advanced | Under development |
University of Oulu, Finland |
Free download after applying for a user licence |
The Chy-FLE (Cypriot Learner Corpus of French) | French | Modern Greek (and Cypriot Greek) |
written | Argumentative and descriptive essays | From intermediate to advanced | c. 250,000 (under development) | Freiderikos Valetopoulos Université de Poitiers, France In collaboration with the University of Cyprus |
|
The COREIL corpus | French English |
spoken | Elisabeth Delais-Roussarie Hiyon Yoo Université Paris-Diderot, France |
|||||
The "Dire Autrement" corpus | French (Second Language) | Mainly L1 speakers of English | written | Narrative, injunctive, persuasivle and informative texts | c. 50,000 | Marie-Josée Hamel Jasmina Milicevic Dalhousie University, Canada |
Available after registration | |
French Interlanguage Database (FRIDA) |
French | Various | written | Free compositions: desciptive, argumentative and narrative texts, news & mail | Intermediate | Sylviane Granger Centre for English Corpus Linguistics Université catholique de Louvain, Belgium |
||
French Learner Language Oral Corpora (FLLOC) |
French | Various | spoken | See description of the 7 corpora | Various | Florence Myles Newcastle University Rosamund Mitchell University of Southampton, UK |
The contents of the database are being made freely available to the research community, in the form of digital sound files and related transcripts formatted using CHILDES software. |
|
The InterFra corpus | French | Swedish | spoken | Interviews, retellings of video clips and picture stories | Various |
Inge Bartning |
Available | |
The "Interphonologie du Français Contemporain" corpus (IPFC) |
French | Cypriot Greek Dutch English (Canada) German Japanese Norwegian Spanish
|
spoken | Reading aloud, repeating words, guided interviews, interactions between two learners. | Various | Under development | Sylvain Detey Waseda University, Japan Université de Rouen, France Isabelle Racine Université de Genève, Switzerland Yuji Kawaguchi Tokyo University of Foreign Studies, Japan |
Under development; samples available |
The Learner Corpus French (LCF) |
French | Dutch | written |
Argumentative essays Written compositions by Flemish students of French |
Intermediate to advanced | c. 500,000 | K.U.Leuven Campus Kortrijk, UGent and Lessius Hans Paulussen |
Under development |
The Lund CEFLE Corpus (Corpus Écrit de Français Langue Étrangère) | French | Swedish | written | Descriptive and narrative essays; picture-based stories. | Various | c. 100,000 | Malin Ågren Lund University, Sweden |
A sub-part of the corpus is available online. |
The University of the West Indies learner corpus (UWi) |
French |
English Jamaican Creole |
spoken | Conversations during oral exams and in informal contexts | Various | Hugues Peters University of New South Wales, Sydney, Australia |
||
Comasan Labhairt ann an Gàidhlig (CLAG) - Gaelic Adult Proficiency (GAP) |
Gaelic | Various | spoken |
Conversation task Narrative Elicited oral imitation task Question and answer activity |
Various |
Roibeard Ó Maolalaigh (University of Glasgow) Nicola Carty (University of Glasgow) |
||
The AleSKO corpus | German |
Chinese Also German L1 data from the FALKO corpus |
written | Argumentative essays | c. 13,600 | Heike Zinsmeister University of Konstanz, Germany Margrit Breckle Vilnius Pedagogical University, Lithuania. |
||
Analyzing Discourse Strategies: A Computer Learner Corpus | German | English (mainly American English) |
written | Threaded Discussion Chat Essays Longitudinal data |
From beginner to intermediate-mid | Under development | Christina Frei Edward Nixon University of Pennsylvania, USA |
|
The Corpus of Learner German (CLEG13) | German | English | written | Argumentative, free compositions Longitudinal over 4 years, undergraduate students |
Intermediate to advanced | c. 320,000 | Ursula Maden-Weinberger |
Online access through the FALKO platform. |
The deL1L2IM corpus | German |
Russian-Belorussian bilinguals |
written | Instant messaging dialogues | Advanced | c. 52,000 |
Sviatlana Höhn |
Available |
The Fehlerannotiertes Lernerkorpus (‘error annotated learner corpus’) (FALKO) |
German |
Learner subcorpus: various Native subcorpus: German |
written |
1. Summaries 2. Essays 3. Letters, fiction writing, journal articles, book reviews (= longitudinal data from American learners) |
1. Advanced 2. Advanced 3. Beginners - advanced
|
1. c. 40,000 (learner subcorpus) + c. 20,000 (native subcorpus) 2. c. 150,000 (learner corpus) + c. 70,000 (native subcorpus) 3. c. 78,000 (learner subcorpus) |
Anke Lüdeling |
Online access |
The KOLIPSI corpus | German | Italian | written | Two written language production tasks of a standardized test (email/letter) | A2-C1 | under development | Andrea Abel Aivars Glaznieks European Academy Bolzano/Bozen, Italy |
|
The Learning the Prosody of a Foreign Language (LeaP) |
German | Various | spoken | The LeaP corpus covers four different types of speech: - read speech - prepared speech - free speech - nonsense word lists |
Various | 62 speakers | Ulrike Gut University of Augsburg, Germany |
The annotated corpus is available to the scientific community. Please contact Ulrike Gut at the University of Augsburg. |
The LeKo (Lernerkorpus) corpus | German | c. 55,000 | Anke Lüdeling, Humboldt-Universität Berlin, Germany |
Online access (password protected) Register here |
||||
The LINCS Corpus |
1. German 2. German 3. German |
1. English 2. German |
1. Written 2. Written 3. Written |
1. Essays, examination, answers. 2. Essays 3. Teaching output |
1. Intermediate to Advanced 2. Advanced |
Under development | Elizabeth Thoday Heriot-Watt University Edinburgh, UK |
Not currently publicly available |
Multilingual Platform for the European Reference Levels: Exploring Interlanguage in Context (MERLIN) |
German Italian Czech |
Various | written | writing tasks from standardized tests (telc/UJOP) | A1 to C1 | c. 280,000 | Katrin Wisniewski | Available |
Rhodes University Deutsch als Fremdsprache (RUDaF) | German | English, Afrikaans, isiXhosa, XiTsonga | written | Short descriptive and argumentative writing paragraphs (300 words each) | A2-B2 | 34,000 |
Dr Undine S. Weber Rhodes University, South Africa |
Not available |
The Telecollaborative Learner Corpus of English and German Telekorp | German | English | written | Bilingual, longitudinal database comprising computer-mediated NS-NNS interactions between approximately 200 Americans and Germans collected during six different telecollaborative partnerships from 2000-2005. | c. 1,5 m. |
Julie Belz
|
Not publicly available | |
The Langman corpus | Hungarian | Chinese | spoken | Interviews conducted in 1994 with 11 Chinese immigrants living in Hungary. Interviews focused on issues related to their arrival in Hungary as well as their daily life activities |
Juliet Langman University of Texas at San Antonio, USA |
Freely available | ||
Corpus di Apprendenti di Italiano L2 (CAIL2) |
Italian | Various | written | Essays | Intermediate to advanced | c. 237,000 | Stefania Spina, Università per Stranieri di Perugia | Searchable via CQPweb |
Corpus parlato di italiano L2 | Italian | English German Japanese |
spoken | Transcriptions of interviews | Various | Stefania Spina Silvio Pazzaglia Mirco Perini Università per Stranieri di Perugia, Italy |
Searchable online | |
The KOLIPSI corpus | Italian | German | written | Two written language production tasks of a standardized test (email/letter) | A2-C1 | Under development | Andrea Abel European Academy Bolzano/Bozen, Italy |
|
The Lexicon of Spoken Italian by Foreigners (LIPS) |
Italian | Various | spoken | Proficiency exams of the Certification of Italian as a Foreign Language (CILS) | A1-C2 | c. 700,000 |
Francesca Gallina |
Freely available |
MISTiC (Multiple Italian Student TranslatIon Corpus) | Italian | English, French | written | translations produced by trainee translators (mainly specialised texts) | post-graduate trainee translators | ca. 125,000 (English-Italian), ca. 50,000 (French-Italian) | Sara Castagnoli, University of Bologna, Italy | not available |
Varietà di Apprendimento della Lingua Italiana: Corpus Online (VALICO) |
Italian | Various | written | Various | c. 570,000 | Manuel Barbera:manuel.barbera@bmanuel.org Carla Marello |
Freely available and searchable online. | |
Longitudinal Corpus of Chinese Learners of Italian (LOCCLI) | Italian | Chinese | written | essays | beginners and pre-intermediate | 97,000 | The LOCCLI is part of a joint project between Stefania Spina (University for Foreigners of Perugia, Italy) and Anna Siyanova-Chanturia (Victoria University of Wellington, New Zealand). | It is freely searchable via CQPweb (registration required) from https://www.unistrapg.it/cqpweb/ |
Corpus of Chinese Learners of Italian (COLI) | Italian | Chinese | written and spoken |
essays and answers to open questions interviews |
intermediate and advanced | 82,300 |
Contact: Stefania Spina |
The COLI is freely searchable via CQPweb (registration required) from https://www.unistrapg.it/cqpweb/ |
The Korean learner corpus | Korean | Various | written | Various: letters, essays, formal writing... | Beginner and intermediate | c. 10,000 | Seok Bae Jang Georgetown University, USA Sun Hee Lee Wellesley College, USA Sang kyu Seo Yonsei University, South Korea |
|
ESAM | Latvian and Lithuanian | Latvian and Lithuanian | written | Beginner | 52,000 | Inga Znotiņa | Available online | |
The ASK corpus | Norwegian | German Dutch English Spanish Russian Polish Bosnian-Croatian-Serbian Albanian Vietnamese Somali |
written | Essays from language tests | B1 and B2 | Kari Tenfjord University of Bergen, Norway |
||
The Persian Learner Corpus (PLC) |
Persian (Farsi) | Various | written | Narratives and essays | Intermediate and advanced | Academic/Restricted online access |
University of Belgrade, Faculty of Philology |
Academic/Restricted online access |
The Salam Farsi Learner Corpus (SFLC) |
Persian (Farsi) | Serbian | written | Narratives, descriptive essays | Beginner and upper-intermediate | Under development |
University of Belgrade, Faculty of Philology |
Academic, under development |
Learner Corpus of Portuguese L2 (COPLE2) | Portuguese | 15 languages: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian and Swedish | Written and spoken | Exams and assignments | A1-C1 | written: 171.461 oral: 25.783 |
Iria del Río | Available |
Russian Learner Corpus | Russian | varied | written and spoken | academic and non academic | teachers and heritage speakers | unknown | Ekaterina Rakhilina | Available online |
The PIKUST pilot learner corpus | Slovene | Various | written | Mostly argumentative essays | Majority advanced – but also intermediate and beginner | c. 35,000 | Mojca Stritar University of Ljubljana, Slovenia |
|
The Anglia Polytechnic University (APU) Learner Spanish Corpus | Spanish | Various | written | c. 120,000 | Anne Ife Anglia Ruskin University, UK |
|||
Aprescrilov ("Aprendera Escribiren Lovaina") | Spanish | Dutch | written | Written assignments and tests; several text types (letters, expository, descriptive, argumentative, narrative) | A1 to C1 | c. 1 m. |
Kris Buyse |
Restricted online access |
Spanish | Various | written | A1 to C1 |
c. 575,000 |
Universidade de Santiago de Compostela |
Online access | ||
Corpus Escrito del Español L2 (CEDEL2) |
Spanish | English | written | Written compositions by learners of Spanish | c. 730,000 |
Amaya Mendikoetxea |
Please contact Cristobal Lozano to get a free sample of the corpus | |
Corpus de textos escritos para el análisis de errores de aprendices de E/LE |
Spanish | Various | written | Essays | A2 to C1 | / |
Cestero Mancera, A. M. Universidad de Alcalá Henares |
CD-ROM available |
The Corpus of Taiwanese Learners of Spanish (Corpus de Aprendices Taiwaneses de Español) (CATE) |
Spanish | Chinese | written | Student essays | Various | c. 340,000 | hclu@mail.ncku.edu.tw | Under development |
The DIAZ corpus | Spanish |
German |
spoken | Semi-spontaneous (structured interviews) and experimental (structured questionnaires) Adult Spanish L2/L3 oral data | Various | Lourdes Diaz Rodriguez Universitat Pompeu Fabra, Spain |
Freely available | |
The Japanese learner corpus of Spanish | Spanish | Japanese | written | Student essays | c. 83,400 | Yoshihito Kamakura University of Birmingham, UK |
||
The Spanish Corpus Proficiency Level Training (SPT) |
Spanish | English (heritage language learners) | spoken | Dialogues about a given set of questions | Beginner to advanced | Dr Dale Koike, University of Texas, Austin Liberal Arts Instructional Technology Center |
Videos are available |
|
Spanish Learner Language Oral Corpus (SPLLOC) |
Spanish | English | spoken | Learner narratives, interviews and picture description tasks | Beginner to advanced | c. 50,000 | Laura Dominguez University of Southampton, UK |
Searchable online Data freely available for download |
Spanish Learner Oral Corpus | Spanish | Various (9+ languages - especially Portuguese, French, Italian) |
spoken | Semi-spontaneous interviews, narrative and descriptive tasks | A2-B1 | c. 50,000 words | Leonardo Campillos Llanos Laboratorio de Lingüistica Informatica Universidad Autonoma de Madrid, Spain |
Online access |
The Tartu Learner Corpus of Spanish as a L3+ | Spanish | Estonian | written | Academic research writing | Advanced | c. 885,000 | Mari Kruse, University of Tartu, Estonia | |
The ASU corpus | Swedish | Chinese English German Greek Polish Portuguese Spanish ... |
spoken and written | Transcribed audio-recorded conversations and written texts from adult learners of Swedish – longitudinal data | c. 490,000 words (c. 415,000 spoken and c. 75,000 written) |
Björn Hammarberg Stockholm University, Sweden |
Manual | |
Leiden Learner Corpus | Multilingual (Dutch, French, Italian, Portuguese and Spanish) | various | written and spoken | written data: short essays; oral data: picture-based story telling | various | 200 participants | M. Carmen Parafita Couto | |
The European Science Foundation Second Language Database |
Multilingual: Dutch |
Punjabi |
spoken | Spontaneous second language acquisition of forty adult immigrant workers living in Western Europe, and their communication with native speakers in the respective host countries | Various | Wolfgang Klein Clive Perdue Max Planck Institut, Nijmegen, Netherlands |
Freely available | |
The Foreign Language Examination Corpus (FLEC) |
Multilingual | Polish | written | Data from the Warsaw University Certification Exams |
Various | Under development | Piotr Banski Romuald Gozdawa-Golebiowski Warsaw University, Poland |
|
The MeLLANGE Learner Translator Corpus (LTC) |
Multilingual | various | written | Legal, technical, administrative and journalistic texts | Trainee translators |
Natalie Kübler |
Searchable online | |
The MiLC Corpus |
Multilingual: Catalan |
Catalan | written | Formal and informal letters, summaries, curriculum vitae, essays, reports, translations, synchronous and asynchronous communication exchanges, business letters | c. 150,000 | Angeles Andreu Andrés et al Universidad Polytecnica de Valencia, Spain |
||
The Multilingual Learner Corpus (MLC) |
Multilingual: English |
Brazilian Portuguese | written | Argumentative and marrative essays | Aim: c. 200,000 | Stella E.O. Tagnin University of São Paulo, Brazil |
Accessible online to registered researchers | |
The Padova Learner Corpus |
Multilingual: English |
Italian | CMC (Computer-Mediated Communication) |
Student work produced in blended language courses using FirstClass conferencing software.
|
Under development | Fiona Dalziel Francesca Helm University of Padua, Italy |
||
The corpus PARallèle Oral en Langue Etrangère
|
Multilingual: English (Mainly L2 speakers but also includes data produced by L1 speakers) |
Various | spoken | 5 oral production tasks | Various | Heather Hilton John Osborne Marie-Jo Derive Nejma Succo Jean O'Donnell Sandra Billard Sandrine Rutigliano-Daspet Université de Savoie, France |
Manual | |
The University of Toronto Romance Phonetics Database (RPD) |
Multilingual: English |
Various (including English, Mandarin, Russian, Spanish, etc.) |
spoken | Elicited production - sentence and passage reading, story narration, description of favourite meal | Various | Laura Colantoni Jeffrey Steele University of Toronto, Canada |
Password available from directors |
Learner corpus-based datasets
Corpus | Target language | First language | Medium | Text type / task type | Proficiency level | Size in words | Project director | Availability |
The Treebank of Learner English (TLE) |
English | Various | written | Sentences from the CLC FCE (annotated with syntactic trees) | Upper-intermediate |
97,681 |
Yevgeni Berzak | Publicly available through the UD repository ('English-ESL') |