Glossary of Useful Terms
MS Word file
A file in the format created by the program Microsoft Word. These files usually have names that end in the suffix .doc
and/or an icon . They cannot be used in corpus analysis programs without being converted to text
only format.
Text only file
A file containing letters and numbers but no proprietary formatting codes for things such as bold, italic, etc. Most text
only files have the ending .txt and/or an icon . They are also known as "Plain Text"
or "ASCII" files. Note that HTML and SGML files are essentially "text only" files, with formatting
handled using angled brackets < > so that no special character sets or proprietary codes need to be added to the file.
SGML file
A kind of text only file that contains "mark-up tags" which show where formatting should appear in the file, such
as <P> to mark a new paragraph, <pause> for pauses. You can also put a lot of information in a "header" of an SGML
file, such as the date the text was created, the author, the number of words, etc. WordSmith lets you switch SGML tags on
and off (see [Settings-Adjust Settings - Tags, and clear the button next to Activated]. SGML files have the ending .sgm
or .sgml
HTML
The current language of the Internet. Similar to SGML, in that tags like <P> and <TABLE> are common. The tags
are specific to displaying pages on the Internet. You cannot make up your own tags in HTML. HTML files have the ending
.htm or .html and/or an icon or .
XML
A new version of HTML. It should be the future language of the Internet, and is already widely used in corpus linguistics.
It is a kind of compromise between the sophistication of SGML and the flexibility of HTML.
|