LCMC basic information

LCMC: Basic information

1. Aims

The corpus approach is increasingly being recognized as a useful tool for linguistic investigation. However, in-depth monolinguistic studies of Chinese have proved difficult in the past, because of the general lack of publicly available balanced corpora. Indeed, most of the existing resources relating to Chinese (e.g. the PH Corpus, the PFR People’s Daily Corpus released by the Institute of Computational Linguistics, Peking University, and the corpora released by the LDC) are composed exclusively of newswire texts or newswire texts and official documents. Balanced corpora do exist, but they, too, are problematic. The Sinica Corpus, for example, represents the language used in Taiwan, and is therefore not representive of modern Mandarin Chinese as written on the mainland of China. Another corpus that does represent mainland Chinese (see Zhou & Yu 1997) is not publicly available.

The LCMC corpus seeks to enable in-depth monolinguistic studies by making a diverse range of text-types publicly available to academic researchers. LCMC, in combination with FLOB, also provides a sound basis for contrastive studies of Chinese and English, whether one wishes to compare the two languages as a whole or compare them by text type.

2. Sampling frame and text collection

In the LCMC corpus, the FLOB sampling frame is followed strictly except for two minor variations. The first variation relates to the sampling frame – we replaced western and adventure fiction (category N) with martial arts fiction. There are three reasons for this decision. Firstly, there is simply no western fiction in China; secondly, martial arts fiction is a type of adventure fiction that, in China especially, is both popular and important, and therefore should be represented; thirdly, the language used in martial arts fiction is a distinctive language type, and thus worthy of study in its own right. Most stories of this type, even though they were published recently, are under the influence of vernacular Chinese, i.e. modern Chinese styled to appear like classical Chinese. While the inclusion of this text type has made the tasks of POS tagging and post-editing more difficult, it may also make it possible to compare representations of vernacular Chinese and modern Chinese. The second variation relates to our decision to modify the FLOB sampling period slightly by including samples within ±2 years of 1991 when an insufficient number of texts for a given category were produced in 1991 (we assume that such a time span will not influence a language significantly). Texts produced within ±2 years of 1991 represent no more than one third of the 500 samples included in the LCMC corpus.

The LCMC corpus has been constructed using written Mandarin Chinese texts published in Mainland China to ensure some degree of textual homogeneity. It should be noted that plain written texts alone have been transcribed, with tables, illustrations, pictures, formulae and special symbols omitted and replaced with a gap element marked by the wording ‘omission’. Long citations from translated texts or texts produced outside the sampling period were also omitted so that the effect of translationese could be excluded and L1 quality guaranteed.

A small number of samples that were conformant with our sampling frame were collected from the Internet. Most samples, however, were provided by the SSReader Digital Library in China. As each page of electronic books in the library comes in PDG format, these pages were transferred into text files using an OCR program provided by the digital library. This scanning process resulted in a 1-3% error rate, depending on the quality of the picture files. Each electronic text file was proofread and corrected independently by two native speakers of Mandarin Chinese so as to keep the transcribed raw texts as accurate as possible.

While the digital library has a very large collection of books, it does not provide newspapers. The only sources of newswire texts from the library are a dozen of collections of news awarded at various levels. These collections, however, represent newswire texts from more than eighty newspapers and television or broadcasting stations. The samples from these sources account for around two thirds of texts for the press categories (A-C). The other one third are sampled from newswire texts from Xinhua News Agency (excerpted from the PH Corpus). Considering that this is the most important and representative news provider in China, we believe that this proportion is justified.

Unlike western languages such as English, in which words are typically separated with white spaces and can thus be relatively easily be counted in terms of word number, Chinese contains running characters. Consequently, while it is easy to count the character number, it is not possible to count word number with raw texts. As the proofreading of raw electronic texts is time-consuming and expensive, it was economical to proofread an excessively large sample but use only around 2,000 words. Based on a pilot study of the ratio of words to characters, we decided to adopt a ratio of 1:1.6, which means that we needed a 3,200-character running text for a 2,000-word sample. When a text was less than the required length, texts of similar quality were combined into one sample. For longer samples, e.g. those from books, we adopted a random procedure so that beginning, middle and ending samples have been included in all categories. When the texts were segmented and it was possible to count exact word numbers, they were automatically cut to around 2000 words while keeping the final sentence complete. However, while the ratio that we decided on worked on most texts, a small number of texts finally yielded slightly less than 2,000 words. In this case, the whole processed text was included. While some individual samples contain fewer words, and some more words, than 2,000, the total number of words for each text type is roughly conformant to our sampling frame.

3. Encoding and markup conventions

Unlike single-byte western languages like English, Chinese uses 2 bytes of ASCII codes for each character. Currently there are three encoding systems for Chinese characters: GB2312 for simplified Chinese, Big5 for traditional Chinese, and Unicode. While the original texts were encoded in GB2312, we decided to convert the encoding into Unicode (UTF-8) for the following reasons, namely, (1) to ensure the compatibility of non-Chinese operating system and Chinese characters; and (2) to take advantage of the latest Unicode-compatible concordancers like Xara version 1.0 and the WordSmith Tools version 4.0.

In order to make it more convenient for users with an operating system earlier than Windows 2000 and without a language support pack to use our data, we have produced a Pinyin version of the LCMC corpus in addition to the standard version containing characters. While also encoded using UTF-8, the Pinyin version will be more compatible with older operating and concordance systems. This is also of assistance to users who can read Romanised Chinese but not Chinese characters.

Both versions of the corpus come in fifteen files. The corpus is XML conformant. Each file has two parts: a corpus header and text. The header gives general information about the corpus. The text part is annotated with five levels of details: (1) text category, (2) file identifier, (3) paragraph, (4) sentence and (5) word, punctuation/symbol and elements omitted in transcriptions (see List of codes). These details are useful. Presently Xara version 1.0 is aware of XML markup. With this tool, users can either search the whole corpus or define a subcorpus containing a certain text type or a specific file. The POS tags allow users to search for a certain class of words, and in combination with tokens, to extract a specific word that belongs to a certain class.