Corpora and Historical Linguistics

Historical linguistics can be seen as a species of corpus linguistics, since the texts of a historical period or a "dead" language form a closed corpus of data which can only be extended by the (re-)discovery of previously unknown manuscripts or books. In some cases it is possible to use (almost) all of the closed corpus of a language for research - something which can be done for ancient Greek for example, using the Theasurus Linguae Graecae corpus which contains most of extant ancient Greek literature. However, in practice historical linguistics has not tended to follow a strict corpus linguistic paradigm, instead taking a selective approach to empirical data, to look for evidence of a particular phemonema and making rough estimates at frequency. No real attempts were made to produce samples that were representative.

In recent years, however, some historical linguistics have changed their approach, resulting in an upsurge in strictly corpus-based historical linguistics and the building of corpora for this purpose. The most widely known English historical corpus is the Helsinki corpus.

The Helsinki corpus contains approximately 1.6 million words of English dating from the earliest Old English Period (before AD 850) to the end of the Early Modern English period (1710). It is divided into three main periods - Old English, Middle English and Early Modern English - and each period is subdivided into a number of 100-year subperiods (or 70-year subperiods in some cases). The Helsinki corpus is representative in that it covers a range of genres, regional varieties and sociolinguistics variables such as gender, age, education and social class. The Helsinki team have also produced "satellite" corpora of early Scots and early American English.

Other examples of English historical corpora in development are the Zürich Corpus of English Newspapers (ZEN), the Lampeter Corpus of Early Modern English Tracts (a sample of English pamphlets from between 1640 and 1740) and the ARCHER corpus (a corpus of British and American English from 1650-1990).

The work which is carried out on historical corpora is qualitatively similar to that which is carried out on modern language corpora, although it is also possible to carry out work on the evolution of language through time. For example, Peitsara (1993) used four subperiods from the Helsinki corpus and calculated the frequencies of different prepositions introducing agent phrases. Throughout the period she found that the most common prepositions of this type were of and by, which were of almost equal frequency at the beginning of the period, but by the fifteenth century by was three times more common than of, and by 1640 by was eight times as common.

Studies like this have particular importance in the context of Halliday's (1991) conception of language evolution as a motivated change tin the probabilities of the grammar. However, it is important to be aware of the limitations of corpus linguistics, as Rissanen (1989) pointed out. Rissanen identifies three main problems associated with using historical corpora

  1. The "philologist's dilemma" - the danger that the use of a corpus and a computer may supplant the in-depth knowledge of language history which is to be gained from the study of original texts in their context.
  2. The "God's truth fallacy" - the danger that a corpus may be used to provide representative conclusions about the entire language period, without understanding its limitations in the terms of which genres it does and does not cover.
  3. The "mystery of vanishing reliability" - the more variables which are used in sampling and coding the corpus (periods, genres, age, gender etc) the harder it is to represent each one fully and achieve statistical reliability. The most effective way of solving this problem is to build larger corpora of course.
Rissanen's reservations are vaild and important, but should not diminish the value of corpus-based linguistics, rather they should serve as warnings of possible pitfalls which need to be taken on board by scholars, since with appropriate care they are surmountable.