LE-EAGLES WP4-1C.1
November 97

Survey of Research on Dialogue Acts

by Martin Weisser


Types of Dialogues:

- appointments (VERBMOBIL)
- airline/ travel inquiries
- developing plans to move trains and cargo from one city to another (TRAINS: "Coding Schemes for Spoken Dialogue Structure" & Dialog Act Markup in Several Layers [DAMSL])
- furnishing rooms (COCONUT)
- cooking recipes ("Instructions for Annotating Discourses".)
- giving directions (Maptask)


Types of Analysis:

- Automatic: VERBMOBIL, TRAINS-93
- Manual:
Online marking tools like Nota Bene ("Instructions for Annotating Discourses"); dat (TRAINS); CLAN (CHILDES)
No tools (Maptask ?)


Techniques for Identifying Topics

- keyword spotting + ? (VERBMOBIL)
- Manual semantic/pragmatic analysis of topic initiation. General underlying principle: initiated subject has to be brought to a conclusion (cf. Maptask). Thus beginnings and ends of "transactions" can be marked, although there does not seem to be general agreement about marking both [see further below]. In general, it seems to be assumed that the goal of the dialogue is achieved in as straightforward a manner as possible, although digression or irrelevant material can occur.


Annotation of Speech/Dialogue acts

The semantic/pragmatic annotation of Speech/Dialogue Acts according to the DAMSL model can be performed at "macro" or "micro" level. At the macro level utterances can be combined into larger units of annotation (called Segments) if they form one complete functional unit.

[However, the use of the term segment may be slightly problematical as it may be confused with segments identified at the phonetic level. The term functional unit or functional utterance unit would thus be preferable as it would be less likely to cause misunderstanding]

At the micro level, individual utterances can be identified through one or a combination of Utterance Tags. The following is a summary of those tags based on the categories described in Allen/Core. March 1997. Draft of DAMSL: Dialog Act Markup in several Layers.

Utterance tags:

Summarize the intentions of the speaker and the content of the utterance. Classified into four main categories.

  1. Communicative Status - Is the utterance intelligible and successfully completed?
    - Uninterpretable: Utterance is not comprehensible
    - Abandoned: Utterance is not completed and does not contribute to the dialogue
    - Self-talk: Speaker appears to not be intending to communicate what is being said
  2. Information Level - What is the semantic content of the utterance?
    - Task ("Doing the task"): This classification depends heavily on the specification of the domain. Basically anything that involves actions on the part of the participants to resolve the task such as questions, suggestions, etc. that is not part of the problem-solving process
    - Task-management ("Talking about the task"): Utterances that are specifically concerned with solving the problem of how to go about the task as in establishing problem solving procedures, coordination between the participants and establishing the degree of progress made so far. They serve to establish a general framework of cooperation, but do not actively constitute task-related actions.
    - Communication-management ("Maintaining the communication"): Partly purely phatic elements and partly "conventional" elements that establish or maintain communication between the participants
    - Other-level: Dummy category for anything that does not neatly fit into any of the above categories, but is still supposed to be relevant to the task (???)
  3. Forward Looking (Communicative) Function - how does the current utterance constrain the future beliefs and actions of the participants and affect the discourse? This classification may be independent of the intention of the speaker and will largely depend upon the interpretation of the coder.
    - Statement: an utterance that "makes a claim about the world", even if the claim might not be a definite one as in the case of uttering a hypothesis.
    If the speaker is trying to change the belief of the addressee, then it is tagged as an Assert. If the speaker seems to think that the claim has already been made before, a Reassert tag is assigned and if the utterance is clearly interpretable as a statement, but does not qualify as either of the two assert categories, the Other-statement tag is assigned.
    [Note: It seems that the Reassert tag might rather indicate a backward-looking function.]
    - Influencing-addressee-future-action: this utterance type may take the form of requests or questions and distinguishes between Open-option [maybe better Suggestion] that do not necessarily require the other participant to perform an action and Directive categories such as Info-request and Action-directive.
    - Committing-speaker-future-action: This category distinguishes between Offer, where the potential future action of the speaker depends on agreement of the other participant and Commit, which does not depend on this agreement. However, the problem with the latter category is that it includes acceptance of a previous request, which may essentially also be considered a backward-looking function.
    - Other-forward-function: Dummy category for other relatively rare functions that involves the categories Conventional-opening and Conventional-closing, Explicit-performative and Exclamation. [In general, those functions might be considered redundant as the first three could well be encapsulated under the Communication-management category and the latter hardly seems to contribute to the dialogue at all.]
  4. Backward Looking (Communicative) Function - how does the current utterance relate to the previous discourse?
    - Agreement: a type of utterance where one participant reacts to a statement or proposal made by the other. Six options can be coded, expressing various degrees of agreement or non-agreement. The speaker may either accept fully (Accept) or partially (Accept-part), express hesitation or doubt (Maybe), partially (Reject-part) or fully reject (Reject) or delay acceptance (Hold) by asking for clarification or making a counter-proposal [although the latter should probably rather be counted as a Reject].
    - Understanding: this category may possibly be better referred to as backchanneling and may in some cases overlap with the Agreement category. It embraces on the one hand signals of non-understanding (Signal-non-understanding) and on the other signals of understanding (Signal-understanding) that can be further subdivided into Acknowledge, Repeat-rephrase or Completion. Another aspect of understanding may be expressed by one participant's signalling that the other made a mistake and correcting this (Correct-misspeaking).
    - Answer: most commonly used to comply with an Info-request.
    - Information-relation: this category is supposed to express how an utterance relates to a preceding one and is not fully elaborated in the DAMSL document.
    - Antecedents: In general, an utterance may relate to more than just the immediately preceding one and may be marked as relationg to all of those antecedents.
  5. General remark on the above categories:
    The two categories (3) and (4) do not seem to be ‘inherently consistent’ categories as there seems to be a lot of overlap between them, i.e. it is sometimes rather difficult to decide whether an utterance is completely forward- or backward-looking. It might therefore be better to refer to them as "Primarily Forward-looking (Communicative) Function" and "Primarily Backward-looking (Communicative) Function"


Dagstuhl Conference

[Note: Listed here are only those points or issues that are not covered by or divergent from the new DAMSL manual.]

Five groups discussed different aspects of the annotations. Out of those five categories discussed we shall only deal with four, namely (1) "Forward-looking Communicative Function", (2) "Backward-looking Communicative Function", (3) "Segmentation" and (4) "Information level and Information Status". The topic of "Coreference" has deliberately been excluded here as does not exclusively belong to the domain of dialogues and may thus be dealt with in a different work package:

  1. Forward looking communicative functions
    [
    no particularly noteworthy differences with the DAMSL manual]
  2. Backward looking communicative functions
  3. Unresolved issues:

    - Higher level discourse structures.
    - Integration of coreference coding with forward and backward communicative function.
    - Specifying a set of informational relations (or at least some very general categories.
    - Coding of floor and topic control issues.
    - Coding of significant non-linguistic signals, e.g., refusing to respond, silence as acceptance

  4. Segmentation

solutions to segmentation types and rules:

3 segment types:

- (1) regular segment boundaries: marked @, corresponds to what I refer to as functional (utterance) unit above.
- (2) weak segment boundaries: optional subunits, marked *.
- (3) drop-in segment boundaries: marked $, which serve to indicate pehenomena like self-repair or hesitations.

segmentation rules:

1. segment material that serves an illocutionary function (@).
2. when in doubt whether to segment or not, don’t segment.
3. if there are strong indicators, e.g. prosodic markers like a long pause, segment (@). (note: segment only in cases that are compatible with rule 1.)
4. in collaborative completions: segment at locations of speaker change (@).
5. optional: subsegment material into smaller units using weak boundaries (*) where the resulting units serve the same illocutionary function

Open issues:

- segmentation and prosody
- data representation (SGML coding)
- segmentation principles valid for all languages
- notational problems, such as whether to indicate beginnings and ends of segments or both, possibly by indexing

  1. Information level and information status

- In general, people at the Dagstuhl conference seemed to think that a 5-way distinction as proposed in the old version of the DAMSL manual [see my comment about this above] was too much and seemed to prefer a 3-way one, either TASK, ABOUT-TASK, NON-RELEVANT or TASK, COMMUNICATION, NON-RELEVANT.
- One thing that is not referred to in the DAMSL manual is information status. The group dealing with this issue at Dagstuhl proposes four schemes to make a distinction between old and new information:

(1) retain a simple distinction between old and new
(2) add a category irrelevant
(3) subdivide old into (a) repetition (→ anaphora), (b) reformulation (≠ paraphrase) and (c) inference (→ to bridge anaphora).
(4) define four categories (a) repetition, (b) reformulation, (c) inference and (d) new
[somehow the latter two seem to coincide with each other, so I'm sure there must be some further distinction???]


Prosodic Annotation

Prosodic labelling still presents one of the major problems in labelling any kind of spoken data. One of the main problems is that it is sometimes difficult to get access to appropriate tools for the operating system platform one is working on, but even if those tools may be available to the researcher/transcriber, it still remains difficult to make proper use of the prosodic information.

Simplified systems describing the use of intonation often resort to oversimplified categories such as attributing a fall to any kind of statement and rises to either questions or non-final items in a list, but in reality it is not always that simple to associate a certain type of pitch movement with a certain type of speech act or even sentence type. A one to one mapping between pitch contours is thus hardly achievable and this will therefore sometimes present an obstacle to the automatic analysis of content.

A broad classification may sometimes aid in establishing whether one may want to transcribe punctuation marks (if one decides to use them at all) as either full stops, question marks or commata/semicola, but in general a fair amount of knowledge of the language to be transcribed and the information contained in the text to be analysed is still required and even for well-researched languages like English not enough is yet known about how the use of certain intonational features can be related to speech acts.

At the level of segmentation, knowledge of intonational features may sometimes aid in establishing distinctions between levels of tone group boundaries and thus indicate whether an utterance may be complete or not, but even there a word of caution is necessary as information from the speech signal may be misleading, especially with regard to non-final utterance elements that may correspond to minor tone groups. Again this probably refers more to automatic content analysis as for example pauses can usually easily be identified perceptually by the transcriber, but may not necessarily be easily identifyable in the speech signal due to phenomena such as final lengthening.

As far as coding intonational information is concerned, a proper set of (preferably mnemonic) symbols needs to be found and most of the existing systems only present a rathere feeble attempt to capture what is actually going on. The TOBI system, although widely used nowadays, does not capture pitch levels relative to the speaker’s pitch range and the combination of symbols used to identify certain levels can be misleading, just as any multi-character sequence representing one phenomenon can be misleading. The latter is also the problem with John Wells’ X-SAMPA recommendations, where he proposes to use certain multi-character sequences to represent diacritics and intonation and even to establish a separate tier for intonational information, which is, however, not physically separated from the rest of the segmental information. Systems like these may be easy enough for the computer to handle, but make segmental/intonational representation more and more unreadable for the human interpreter.

The main drawback in coding segmental/intonational information is therefore still the absence of a universal character set for all platforms, which necessitates this strange kind of transliteration and can probably only be overcome if a move towards Unicode is propagated within both the speech and corpus linguistics communities.


Bibliography

Allen, J. & Core, M. 1997. "Draft of DAMSL: Dialog Act Markup in Several Layers".

Carletta/Dahlbäck/Reithinger/Walker. 1997. Dagstuhl Workshop report "Standards for Dialogue Coding in Natural Language Processing"

Carletta/Isard/Isard/Kowtko/Newlands/Doherty-Sneddon/Anderson. 1995. "HCRC Dialogue Structure Coding Manual". Association for Computational Linguistics.

Jekat/Klein/Maier/Maleck/Mast/Quantz. 1995. "Dialogue Acts in VERBMOBIL". VM-Report 65.

Llisterri, J. 1996. "Preliminary recommendations on Spoken Texts". EAGLES Document EAG-TCWGSPT/P. Expert Advisory Group on Language Engineering Standards.

Nakatani/Grosz/Ahn/Hirschberg. 1995. "Instructions for Annotating Discourses". Technical Report Nr. TR-21-95. Center for Research in Computing Technology, Harvard University: Cambridge, MA.

Traum, D. 1996. "Coding Schemes for Spoken Dialogue Structure". University of Geneva.

Wells, J. C. "Computer-coding the IPA: a proposed extension of SAMPA". London: UCL.