Text Collection: Your Own Texts
There are several basic ways to collect the texts for your DIY corpora:
- Word-processed texts: save as a text only file
- Keyboard entry: speech transcription, students' handwritten essays, etc.
- Scanning: copyright protected novels, latest magazines, etc.
- CD-ROM: newspaper, encyclopaedia, ICAME, etc.
- Internet resources: email, chat, public documents, newspapers, magazines etc.
- Text archives: copyright free (old) novels, essays, etc.
- Copying from a large corpus: e.g. using sections of the BNC
This page covers how to convert a MS-Word document into a text file (.txt) and how to save web pages
as text only files. The next page looks at how to download text materials
from text archives. Page Three explains how to work on the downloaded
files with WordSmith.
Converting a Word document into a text file
WordSmith and most other corpus processing tools are designed to work on
plain text files (also known as ASCII files). MS-Word documents have
formatting information encoded in the text, so in order to use Word documents for text processing in WordSmith and other
corpus software we should convert them to plain text files. Here is how to do it:
- Open Windows Explorer and find a Word file. Double-click on the file to open it with Word: see how it looks. Any file
will do - if you don't have one to hand, open Microsoft Word, type a sentence or two, and then save it. When you have
looked, close the file.
- Now open the same Word file using Notepad: [Start] - [Programs] - [Accessories] - Notepad, and then drag the
icon of the Word file onto the Notepad window. Note its appearance!
- Go back to Microsoft Word, and reopen the file.
- Click on [File] - [Save as] and choose "text only" from the menu "Save as type". The file
name should be the same as the Word file, but ending in ".txt" instead of &quyot;.doc".
- Now double click on the new ".txt" file to open it in Notepad and see what it looks like.
The MS Word file could not be used in WordSmith, but the new plain text file can.
Copying a web page as a text file
Method # 1: Simple copy and paste on the web page
- Go to a newspaper website. If you don't know any, click here
to go to the Guardian website.
- Go to a recent UK news story (by clicking on a link on the main page)
- Click on the screen where the article starts and highlight the text till the end, then press [Edit] - [Copy]
(or keyboard shortcut [Ctl] + [c] ). This will copy the portion to the clipboard.
- Open Notepad.
- Press [Edit] - [Paste] (or keyboard shortcut [Ctl] + [v] ) to paste it. Save the file.
Method # 2: Save the entire web page as a text file
- Open a different article on the website.
- Click on [File] - [Save as...] in the browser.
- Choose .txt for file type and guive the file a name (use a different name!).
- Open the file with Notepad and see what it looks like.
Method # 3: Save the entire web page as an html file and clean it up in MS Word
- Open a different article on the website.
- Click on [File] - [Save as...] in the browser.
- Save as .htm or .html.
- You can open this file with MS Word and edit any way you want. But be careful! Html pages these days sometime contain
some complicated tags which MS Word does not handle well. It is basically a lot easier to save the whole page as
plain text.
|