Michael Rundell has been a lexicographer since 1980. He has edited several learner's dictionaries, most recently the Macmillan English Dictionary (2002, 2007), and he has been at the forefront in the application of new technology to dictionary-making. As one third of the Lexicography MasterClass (with Sue Atkins and Adam Kilgarriff), he provides training in lexicography and lexical computing. He has published widely in the field of corpus-based lexicography, and is co-author (with Sue Atkins) of the Oxford Guide to Practical Lexicography (OUP 2008).

The road to automated lexicography: first banish the drudgery... then the drudges?

Dr. Johnson’s well-known definition of the lexicographer – a harmless drudge that busies himself in tracing the original, and detailing the signification of words – is mirrored in his dictionary’s subtitle:
In which words are deduced from their originals, and illustrated in their different significations by examples from the best writers.

Descriptive lexicography no longer insists on sourcing its data from “the best writers”, but that’s the only significant difference between then and now. Johnson has identified the three main tasks in traditional lexicography:

deducing the meaning of words from “their originals” (that is, from the evidence of words in use)
detailing their “signification”
illustrating them with examples

In Johnson’s time this process really did involve a lot of drudgery. The first of his tasks, for example, analysing the linguistic evidence, presupposes an earlier stage in which the data is collected – and data collection was a labour-intensive business until very recently. Think not only of the hundreds of volunteer readers whose hand-gathered, painstakingly-transcribed citations underpin the OED, but also of the heroic efforts of early corpus-builders at Brown or Birmingham, whose ambitions always outran the available technology. And corpus creation merely provides us with the raw materials for a dictionary. Each subsequent stage in the dictionary-making process involves a mixture of routine administrative tasks – the “drudgery” that Johnson complained of – and creative thinking, applied first to data analysis and then to entry-writing. What has changed since the 1960s, when Laurence Urdang pioneered the use of computers in lexicography (Hanks 2008, Nesi 2009) is the balance between these two elements, with machines taking on more of the routine jobs such as checking cross-references (and generally doing them better than humans).

It is convenient to think of subsequent developments in terms of two kinds of outcome:

technologies that have enabled us to do the same things we did before, but more efficiently and more systematically.
"game-changing” developments that have expanded the scope of what dictionaries can do and (in some respects) changed our view of what dictionaries are for.

Innovations in he first of these categories have helped to improve dictionaries and make them more internally consistent, while also releasing lexicographers from a great deal of drudgery. The second category is more interesting. Large corpora and sophisticated querying tools have spawned the new discipline of corpus linguistics, and this has led to a re-evaluation of how language works – with inevitable consequences for dictionaries. It is already a given that the object which German speakers call a Wörterbuch is no longer necessarily (or even primarily) a “book”. What is less obvious is that the currency of dictionaries is no longer just “words” in isolation: they now also deal with bigger language systems and syntagmatic networks.

Which brings us to the question posed in this paper’s title: given that computers have gradually taken over many lexicographic tasks which were previously performed by humans, is it plausible to foresee this process continuing to a point where lexicographers are, ultimately, replaced by machines? Back in 1998, Greg Grefenstette asked a similar question: “Will there be lexicographers in the year 3000?” (Grefenstette 1998). He showed how a series of computational procedures – some already in place, some on the horizon, and each “bootstrapping” from the previous one in the sequence – had the potential to progressively reduce the need for human intervention in lexicographic processes. In some respects, Grefenstette’s timescale looks over-optimistic – on present form, it’s unlikely human beings of any sort (let alone lexicographers) will still be around a thousand years from now. But from a technical point of view, the future he envisaged is already close at hand.

Grefenestette belongs to same community (computational linguists) as my long-term collaborator Adam Kilgarriff, who has made innovative technical contributions to a number of projects I have been involved in over the past 15 years or so (see e.g. Kilgarriff & Rundell 2002). There has been plenty of trial and error, but the overall effect has been to transfer many lexicographic tasks from human to computer. This is inherently interesting (and challenging), and is also a good way of making yourself popular with budget-holders (who save money and usually get a better product). But it raises questions about the future of lexicography: will mechanization lead to lexicographers becoming deskilled? is there still a role for informed human judgement? have I been colluding in a process that will lead to my own redundancy? and so on.

This paper will survey a number of technologies that have been applied in the last 15 years to one or more of the key lexicographic tasks. It will conclude by speculating on what the end-point of this process might be (or indeed, whether there is an end-point at all).

References

Grefenstette, G. (1998). ‘The Future of Linguistics and Lexicographers: Will there be Lexicographers in the Year 3000?’, in Fontenelle et al. (Eds) Proceedings of the Eighth EURALEX Congress. Liege: University of Liege: 25-41. Reprinted in Fontenelle, T (Ed.) Practical Lexicography: A Reader. OUP 2008.

Hanks, P.W. (2008). Obituary for Laurence Urdang. International Journal of Lexicography 21.4: 467-471.

Kilgarriff, A. and Rundell, M. (2002). ‘Lexical Profiling Software and its Lexicographic Applications: A Case Study’, in Braasch and Povlsen (Eds) Proceedings of the Tenth EURALEX Congress. Copenhagen: Center for Sprogteknologi: 807-819.

Nesi, H. (2009). ‘Dictionaries in Electronic Form’, in A.P. Cowie (Ed.) The Oxford History of English Lexicography. Oxford. The Clarendon Press. Vol. II: 458-478.