Lancaster University Department of Linguistics and Modern English Language
Corpus Linguistics Home
Page index
WordSmith
BNCweb
DIY Corpora
Building DIY Corpora
Headers in DIY Corpora
 
Current page
 
 
Page Two
 
 
Page Three
 
 
Page Four
 
 

Text Collection:
Your Own Texts

 

There are several basic ways to collect the texts for your DIY corpora:

  • Word-processed texts: save as a text only file
  • Keyboard entry: speech transcription, students' handwritten essays, etc.
  • Scanning: copyright protected novels, latest magazines, etc.
  • CD-ROM: newspaper, encyclopaedia, ICAME, etc.
  • Internet resources: email, chat, public documents, newspapers, magazines etc.
  • Text archives: copyright free (old) novels, essays, etc.
  • Copying from a large corpus: e.g. using sections of the BNC

This page covers how to convert a MS-Word document into a text file (.txt) and how to save web pages as text only files. The next page looks at how to download text materials from text archives. Page Three explains how to work on the downloaded files with WordSmith.

Converting a Word document into a text file

WordSmith and most other corpus processing tools are designed to work on plain text files (also known as ASCII files). MS-Word documents have formatting information encoded in the text, so in order to use Word documents for text processing in WordSmith and other corpus software we should convert them to plain text files. Here is how to do it:

  1. Open Windows Explorer and find a Word file. Double-click on the file to open it with Word: see how it looks. Any file will do - if you don't have one to hand, open Microsoft Word, type a sentence or two, and then save it. When you have looked, close the file.
  2. Now open the same Word file using Notepad: [Start] - [Programs] - [Accessories] - Notepad, and then drag the icon of the Word file onto the Notepad window. Note its appearance!
  3. Go back to Microsoft Word, and reopen the file.
  4. Click on [File] - [Save as] and choose "text only" from the menu "Save as type". The file name should be the same as the Word file, but ending in ".txt" instead of &quyot;.doc".
  5. Now double click on the new ".txt" file to open it in Notepad and see what it looks like.

The MS Word file could not be used in WordSmith, but the new plain text file can.

 

Copying a web page as a text file

Method # 1: Simple copy and paste on the web page

  1. Go to a newspaper website. If you don't know any, click here to go to the Guardian website.
  2. Go to a recent UK news story (by clicking on a link on the main page)
  3. Click on the screen where the article starts and highlight the text till the end, then press [Edit] - [Copy] (or keyboard shortcut [Ctl] + [c] ). This will copy the portion to the clipboard.
  4. Open Notepad.
  5. Press [Edit] - [Paste] (or keyboard shortcut [Ctl] + [v] ) to paste it. Save the file.

Method # 2: Save the entire web page as a text file

  1. Open a different article on the website.
  2. Click on [File] - [Save as...] in the browser.
  3. Choose .txt for file type and guive the file a name (use a different name!).
  4. Open the file with Notepad and see what it looks like.

Method # 3: Save the entire web page as an html file and clean it up in MS Word

  1. Open a different article on the website.
  2. Click on [File] - [Save as...] in the browser.
  3. Save as .htm or .html.
  4. You can open this file with MS Word and edit any way you want. But be careful! Html pages these days sometime contain some complicated tags which MS Word does not handle well. It is basically a lot easier to save the whole page as plain text.