Textual versus Extra-textual Information

Textual information: the information encoded within the text, such as paragraph (or sentence) boundary markers, POS tags, etc.

Extra-textual information: additional information regarding the nature of the text, e.g. the author and the title of the text, the speaker information, the task information, etc. This information is usually encoded in the header.

Corpus file = [Header (extra-textual info)] + [text (textual info)]

Previously, we looked at simple file management using a directory structure. There are often cases where you need more than one criterion to identify subcorpora. For instance, you need to compare the learner data by proficiency levels as well as task types. File management by directories is not suitable for this purpose, for you will have to create more than one set of subcorpora, depending on the choice of variables.

Therefore, it is strongly recommended that you should use headers to distinguish different subcorpora.

Let’s practice making a short header for sample learner corpus data. First, download and have a quick look at the sample files.

Using File Manager open the folder texts\101\llc and copy the 12 files into the C:\ temp directory on your PC.

Open the files with Notepad.

Now attach header information at the beginning of the file. Use Notepad and type in the following header information to the file "file1.txt":


<head>

<medium>written</medium>

<language>french</language>

<level>advanced</level>

</head>

Do the same for the rest of the files. Note that files 1-6 are by french speakers and files 7-12 are by chinese speakers. Also files 1-3 and files 7-9 are at the advanced level while files 4-6 and 10-12 are at the elementary level.

You have now created a number of instances of a very basic header.