Project Overview

Looking at text re-use in a corpus of seventeenth-century news reportage

It has been supposed by many scholars who have examined the newsbooks of the seventeenth century that newswriters indulged in a great deal of text re-use. This may mean the verbatim quoting of a source doument in more than one newsbook; it may also refer to more direct forms of duplication. The issue of text re-use in this period is a complicated one. Many writers were responsible for more than one periodical. Successful newsbooks might be duplicated or emulated in a number of ways by other hands - as reprints, as counterfeits, or as imitations, to follow the classification used in the New Cambridge Bibliography of English Literature. Furthermore, as with today's newspapers, even totally independent publications were still reporting the same events; for instance, speeches in Parliament.

All judgements of how extensive such re-use was in this period are of necessity somewhat imprecise. The documents are available only on microfilm (with the exception of the elderly and delicate originals). Performing a large-scale quantitative study of the text re-use phenomenon would clearly be impractical using only pen, paper and the human eye: the process would be too lengthy and to prone to human error. However, using techniques associated with the methodology of computer corpus linguistics, it has now been possible to undertake such a study.

Using a corpus of text stored in machine-readable format (i.e. not as graphical scans) it is possible to perform a comparison of two documents swiftly and accurately, and produce very quickly a quantitative evaluation fo their similarity - that is, the extent to which one has copied from another, or to which both have utilised the same third source. To accomplish this, a project (funded by the British Academy) was begun here at the University of Lancaster to accomplish the following two aims:

Prepare an electronic edition of a set of newsbook texts from a given period within that covered by the Thomason Tracts;

Use an automatic algorithm to evaluate the extent to which text re-use occurs in this small corpus.

The period selected was that from December 1653 to May 1654. This period is of historical as well as linguistic interest, since it corresponds to the beginning of Cromwell's Protectorate. At this time much of England's attention was focussed on happenings in Scotland, where a Royalist uprising under Glencairn was threatening Cromwell's rule. A historical summary of the Glencairn Uprising, prepared for this project by Helen Baker, can be downloaded from this site.

Other items much in the news at this time included a peace treaty being negotiated with the Netherlands, and an embassy to the Queen of Sweden.

The corpus consists of approximately 800,000 words of running text drawn from all the newsbooks present in the Thomason Tracts that were published during the period in question. These documents were typed in an SGML-compatible format by transcribers during the middle part of 2002 (see also here for a description of the encoding scheme). It was then necessary to identify an appropriate method for examining the text re-use in the corpus.

A study of text re-use in modern newspaper reportage was conducted by a team at the University of Sheffield including Robert Gaizauskas and Scott Piao (see for instance P. Clough, R. Gaizauskas and S. Piao, "Building and annotating a corpus for the study of journalistic text reuse", in Proceedings, LREC 2002). On the basis of this investigation, Piao and Gaizauskas developed algorithms and software capable of evaluating the similarity of a pair of texts by scanning for similar strings of words. It is this program, called TESAS, which has been used in this project to look for text re-use in the Early Modern period.

To see some sample data from the corpus, and the workings of the text re-use evaluation algorithm, click here.

This process has now been applied to the entire corpus of newsbook text and the results are to be published in 2003.