interact2-2001.ppt 3245KB Jun 23 2011 12:32:38 PM
Multimodal HumanComputer Interaction
New Interaction Techniques 22.1.2001
Roope Raisamo (rr@cs.uta.fi)
Department of Computer and Information Sciences
University of Tampere, Finland
Multimodal human-computer
interaction
A definition [Raisamo, 1999e, p. 2]:
”Multimodal interfaces combine many
simultaneous input modalities and may
present the information using synergistic
representation of many different output
modalities”
Multimodal interaction
techniques
Our definition of an interaction technique
[Raisamo, 2000]:
• An interaction technique is a way to carry out
an interactive task. It is defined in the binding,
sequencing, and functional levels, and is
based on using a set of input and output
devices or technologies.
– In a multimodal interaction technique there are
more than one inputs or outputs used for the same
task.
Two views
• A Human-Centered View
– common in psychology
– often considers human input channels, i.e.,
computer output modalities, and most often vision
and hearing
– applications: a talking head, audio-visual speech
recognition, ...
• A System-Centered View
– common in computer science
– a way to make computer systems more adaptable
Multimodal humancomputer interaction
Computer
Computer input
modalities
Human output
channels
”cognition”
Human
Cognition
Interaction information flow
Intrinsic perception/action loop
Computer output
media
Human input
channels
Senses and modalities
Sensory perception
Sense organ
Modality
Sense of sight
Eyes
Visual
Sense of hearing
Ears
Auditive
Sense of touch
Skin
Tactile
Sense of smell
Nose
Olfactory
Sense of taste
Tongue
Gustatory
Sense of balance
Organ of equilibrium
Vestibular
[Silbernagel, 1979]
Design space for
multimodal user interfaces
Use of modalities
Fusion
Sequential
Parallel
Combined
ALTERNATE
SYNERGISTIC
Independent
EXCLUSIVE
CONCURRENT
Meaning
No Meaning
Meaning
No Meaning
Levels of abstraction
[Nigay and Coutaz, 1993]
An architecture for
multimodal user interfaces
Adapted from
[Maybury and
Wahlster, 1998]
Output
generation
- graphics
- animation
- speech
- sound
-…
Media
analysis
- language
- recognition
- gesture
-…
Media
design
- language
- modality
- gesture
-…
Interaction
management
- media fusion
- discourse
modeling
- plan
recognition
and
generation
- user
modeling
- presentation
design
Application interface
Input
processing
- motor
- speech
- vision
-…
Modeling
[Nigay and Coutaz, 1993]
Put
– That
– There
[Bolt, 1980]
Potential benefits
A list by Maybury and Wahlster [1998, p. 15]:
– Efficiency
– Redundancy
– Perceptability
– Naturalness
– Accuracy
– Synergy
– Mutual disambiguation of recognition errors
[Oviatt, 1999a]
Common misconceptions
A list by Oviatt [1999b]:
1. If you build a multimodal system, user will interact
multimodally.
2. Speech and pointing is the dominant multimodal
integration pattern.
3. Multimodal input involves simultaneous signals.
4. Speech is the primary input mode in any multimodal
system that includes it.
5. Multimodal language does not differ linguistically
from unimodal language.
Common misconceptions
6. Multimodal integration involves redundancy of
content between modes.
7. Individual error-prone recognition technologies
combine multimodally to produce even greater
unreliability.
8. All users’ multimodal commands are integrated in a
uniform way.
9. Different input modes are capable of transmitting
comparable content.
10. Enhanced efficiency is the main advantage of
multimodal systems.
Two paradigms for
multimodal user interfaces
1. Computer as a tool
– multiple input modalities are used to enhance
direct manipulation behavior of the system
– the machine is a passive tool and tries to
understand the user through all different input
modalities that the system recognizes
– the user is always responsible for initiating the
operations
– follows the principles of direct manipulation
[Shneiderman, 1982; 1983]
Two paradigms for
multimodal user interfaces
2. Computer as a dialogue partner
– the multiple modalities are used to increase the
anthropomorphism in the user interface
– multimodal output is important: talking heads and
other human-like modalities
– speech recognition is a common input modality in
these systems
– can often be described as an agent-based
conversational user interface
Two hypotheses on
combining modalities
1. The combination of human output channels
effectively increases the bandwidth of the
humanmachine channel.
This has been discovered in many empirical
studies of multimodal human-computer interaction
[Oviatt, 1999b].
Two hypotheses on
combining modalities
2. Adding extra output modality requires more
neurocomputational resources and will lead
to deteriorated output quality, resulting in
reduced effective bandwidth.
Two types of effects are usually observed:
a slow-down of all output processes, and
interference errors due to the fact that selective
attention cannot be divided between the increased
number of output channels.
Two examples of this: writing when speaking, and
speaking when driving a car.
Call for research
A summary in [Raisamo, 1999e] pointed out that more
research is needed to understand the following:
– How the brain works and which modalities can best be used
to gain the synergy advantages that are possible with
multimodal interaction?
– When a multimodal system is preferred to a unimodal
system?
– Which modalities make up the best combination for a given
interaction task?
– Which interaction devices to assign to these modalities in a
given computing system?
– How to use these interaction devices, that is, which
interaction techniques to select or develop for a given task?
Touch’n’Speak
[Raisamo, 1998]
• Touch’n’Speak is a multimodal user interface
framework that makes use of combined touch and
speech input and different output modalities
– Input: touch buttons, touch lists, touch gestures in area
selection (time, location, pressure), speech commands
– Output: graphical, textual, and auditory (non-speech) output,
speech feedback
• The framework was used to implement a restaurant
information system that provides information on
restaurants in Cambridge, MA, USA.
A snapshot of Touch’n’Speak
Examples
• CHI2000 Video Proceedings: The Efficiency of
Multimodal Interaction for a Map-Based Task
(8:18)
• SIGGRAPH Video Review 76, CHI’92
Technical Video Program: Multi-Modal Natural
Dialogue (10:25)
• SIGGRAPH Video Review 77, CHI’92
Technical Video Program: Combining
Gestures and Direct Manipulation (9:56)
• CHI’99 Video Proceedings: Embodiment in
Conversational Interfaces: Rea (2:08)
Homework
• Read Chapter 2 (Multimodal interaction)
in [Raisamo, 1999e].
– [Raisamo, 1999e] is available online at http://granum.uta.fi/
pdf/951-44-4702-6.pdf
– A printable version is available online at
http://www.cs.uta.fi/~rr/interact/dissertation.pdf
New Interaction Techniques 22.1.2001
Roope Raisamo (rr@cs.uta.fi)
Department of Computer and Information Sciences
University of Tampere, Finland
Multimodal human-computer
interaction
A definition [Raisamo, 1999e, p. 2]:
”Multimodal interfaces combine many
simultaneous input modalities and may
present the information using synergistic
representation of many different output
modalities”
Multimodal interaction
techniques
Our definition of an interaction technique
[Raisamo, 2000]:
• An interaction technique is a way to carry out
an interactive task. It is defined in the binding,
sequencing, and functional levels, and is
based on using a set of input and output
devices or technologies.
– In a multimodal interaction technique there are
more than one inputs or outputs used for the same
task.
Two views
• A Human-Centered View
– common in psychology
– often considers human input channels, i.e.,
computer output modalities, and most often vision
and hearing
– applications: a talking head, audio-visual speech
recognition, ...
• A System-Centered View
– common in computer science
– a way to make computer systems more adaptable
Multimodal humancomputer interaction
Computer
Computer input
modalities
Human output
channels
”cognition”
Human
Cognition
Interaction information flow
Intrinsic perception/action loop
Computer output
media
Human input
channels
Senses and modalities
Sensory perception
Sense organ
Modality
Sense of sight
Eyes
Visual
Sense of hearing
Ears
Auditive
Sense of touch
Skin
Tactile
Sense of smell
Nose
Olfactory
Sense of taste
Tongue
Gustatory
Sense of balance
Organ of equilibrium
Vestibular
[Silbernagel, 1979]
Design space for
multimodal user interfaces
Use of modalities
Fusion
Sequential
Parallel
Combined
ALTERNATE
SYNERGISTIC
Independent
EXCLUSIVE
CONCURRENT
Meaning
No Meaning
Meaning
No Meaning
Levels of abstraction
[Nigay and Coutaz, 1993]
An architecture for
multimodal user interfaces
Adapted from
[Maybury and
Wahlster, 1998]
Output
generation
- graphics
- animation
- speech
- sound
-…
Media
analysis
- language
- recognition
- gesture
-…
Media
design
- language
- modality
- gesture
-…
Interaction
management
- media fusion
- discourse
modeling
- plan
recognition
and
generation
- user
modeling
- presentation
design
Application interface
Input
processing
- motor
- speech
- vision
-…
Modeling
[Nigay and Coutaz, 1993]
Put
– That
– There
[Bolt, 1980]
Potential benefits
A list by Maybury and Wahlster [1998, p. 15]:
– Efficiency
– Redundancy
– Perceptability
– Naturalness
– Accuracy
– Synergy
– Mutual disambiguation of recognition errors
[Oviatt, 1999a]
Common misconceptions
A list by Oviatt [1999b]:
1. If you build a multimodal system, user will interact
multimodally.
2. Speech and pointing is the dominant multimodal
integration pattern.
3. Multimodal input involves simultaneous signals.
4. Speech is the primary input mode in any multimodal
system that includes it.
5. Multimodal language does not differ linguistically
from unimodal language.
Common misconceptions
6. Multimodal integration involves redundancy of
content between modes.
7. Individual error-prone recognition technologies
combine multimodally to produce even greater
unreliability.
8. All users’ multimodal commands are integrated in a
uniform way.
9. Different input modes are capable of transmitting
comparable content.
10. Enhanced efficiency is the main advantage of
multimodal systems.
Two paradigms for
multimodal user interfaces
1. Computer as a tool
– multiple input modalities are used to enhance
direct manipulation behavior of the system
– the machine is a passive tool and tries to
understand the user through all different input
modalities that the system recognizes
– the user is always responsible for initiating the
operations
– follows the principles of direct manipulation
[Shneiderman, 1982; 1983]
Two paradigms for
multimodal user interfaces
2. Computer as a dialogue partner
– the multiple modalities are used to increase the
anthropomorphism in the user interface
– multimodal output is important: talking heads and
other human-like modalities
– speech recognition is a common input modality in
these systems
– can often be described as an agent-based
conversational user interface
Two hypotheses on
combining modalities
1. The combination of human output channels
effectively increases the bandwidth of the
humanmachine channel.
This has been discovered in many empirical
studies of multimodal human-computer interaction
[Oviatt, 1999b].
Two hypotheses on
combining modalities
2. Adding extra output modality requires more
neurocomputational resources and will lead
to deteriorated output quality, resulting in
reduced effective bandwidth.
Two types of effects are usually observed:
a slow-down of all output processes, and
interference errors due to the fact that selective
attention cannot be divided between the increased
number of output channels.
Two examples of this: writing when speaking, and
speaking when driving a car.
Call for research
A summary in [Raisamo, 1999e] pointed out that more
research is needed to understand the following:
– How the brain works and which modalities can best be used
to gain the synergy advantages that are possible with
multimodal interaction?
– When a multimodal system is preferred to a unimodal
system?
– Which modalities make up the best combination for a given
interaction task?
– Which interaction devices to assign to these modalities in a
given computing system?
– How to use these interaction devices, that is, which
interaction techniques to select or develop for a given task?
Touch’n’Speak
[Raisamo, 1998]
• Touch’n’Speak is a multimodal user interface
framework that makes use of combined touch and
speech input and different output modalities
– Input: touch buttons, touch lists, touch gestures in area
selection (time, location, pressure), speech commands
– Output: graphical, textual, and auditory (non-speech) output,
speech feedback
• The framework was used to implement a restaurant
information system that provides information on
restaurants in Cambridge, MA, USA.
A snapshot of Touch’n’Speak
Examples
• CHI2000 Video Proceedings: The Efficiency of
Multimodal Interaction for a Map-Based Task
(8:18)
• SIGGRAPH Video Review 76, CHI’92
Technical Video Program: Multi-Modal Natural
Dialogue (10:25)
• SIGGRAPH Video Review 77, CHI’92
Technical Video Program: Combining
Gestures and Direct Manipulation (9:56)
• CHI’99 Video Proceedings: Embodiment in
Conversational Interfaces: Rea (2:08)
Homework
• Read Chapter 2 (Multimodal interaction)
in [Raisamo, 1999e].
– [Raisamo, 1999e] is available online at http://granum.uta.fi/
pdf/951-44-4702-6.pdf
– A printable version is available online at
http://www.cs.uta.fi/~rr/interact/dissertation.pdf