1.1 Adapting speaking after evidence
of misrecognition: Global and local hyperarticulation.
We conducted a wizard-of-oz simulation study in which 16
naive English speakers answered questions by speaking to what they
believed was a computer speech recognizer. All speakers received text
feedback about misrecognition at the same pre-planned points in the
dialogue which they then repaired by repeating the utterance. This
enabled us to examine the local and global effects upon
hyperarticulation in a controlled manner—both 'clear speech'
(phonetic exaggeration) and slowing in rate of speech (we found these
two measures to be reliably correlated with one another and with
pausing). Measurements were based on a corpus of over 1300 utterances
that included paired utterances (utterances before and after error
messages) that were compared in a repeated measures design. As
predicted, repairs utterances were spoken more slowly and more clearly
than initial utterances; some phonetic features were exaggerated more
than others. Content words were spoken more clearly during repairs,
and function words were not. The phonetic features included /t/
tapping, word-final /t/ release, vowel reduction in the indefinite
article, and word-final/d/ release in “and”. We found that
clear speech was targeted locally to the misrecognized part of the
utterance, and that slowed speech takes about 4-7 utterances to return
to normal. Monolingual speakers produced more clear speech than
bilingual (English dominant) speakers. We then processed this corpus
through 2 kinds of speech recognizers (with and without alanguage
model). to see which aspects of hyperarticulation are most problematic
for speech recognition. For the system without a language model, word
error rates increased with slowedspeaking rate (as expected) but there
was no effect with clear speech (which may overlap with what this
system was trained on). Somewhat surprisingly, for the system with a
language model, slowed speech were associated with lower word error
rates (although this system did do more poorly on clear speech, as
expected). The report of this work has been submitted to a journal
(Stent, Huffman, & Brennan).
1.2 Effects of speakers’ mental models
of an addressee upon hyperarticulation.
A second study using a similar method is underway that looks
for effects of speakers' mental models of their addressees upon
hyperarticulation. The idea is that hyperarticulated speech to human
partners should differ from that to machine partners, based on
people's desire to be polite to other people and their
willingness to express their frustration to machines as well as an
expectation that people are more intelligent conversational partners
than are machines. This study uses the same wizard-of-oz setup as the
previous one, except that the feedback messages consist of
text-to-speech rather than text on the subjects' screen. As
before, error messages are scripted. There are 3 partner conditions: 2
kinds of machine partners (a limited one that whose misrecognitions
are such that syntactic and semantic constraints are not preserved and
a more able one whose misrecognitions are more “graceful”
or similar to what human partners might mishear). The third kind of
partner is construed as a person in another room who is allegedly
doing the same task via a text-to-speech device intended for
non-speaking people; this addressee produces the same feedback
messages as the more able computer partner, with some noise in the
channel to account for misrecognitions. We have completed the first
two conditions and are analyzing the hyperarticulation data now; the
third condition will be completed soon.
2. Dialect convergence in a conversational
task.
We are currently conducting an experiment to test the
claim (made widely by sociolinguists) that speakers (can or sometimes)
converge in their dialects when interacting with one another.
Subjects first name a set of picture cards designed to elicit words
that contain features of interest from both consonants and
vowels that contrast in Long Island English (LI and General American
English (GA). Then they interact by using these cards to play a
“go fish” card game with a partner who speaks a
strong version of LI. If they are judged to speak LI, they are invited
back to play the game again with another partner who speaks GA. We
log, digitize, and label all tokens of the target words, which are
then coded blind by a Linguistics graduate student for such things as
R-dropping (subjectively, by listening) and vowel height (by making
formant measurements using PRAAT speech analysis sofware). Next, pairs
of matching word tokens comprised of one word from the subject's
sessions with each of the 2 partners will be presented to blind
raters, who will judge which partner the word was spoken to. The
pairs of tokens will be sampled from tbe beginning, middle, and end of
the dialogue to determine whether dialect convergence, if it indeed
occurs, happens immediately, or slowly over time. We will then examine
the degree of convergence (if any) that occurs for each word,
conditional on how many tokens of that word the subject heard the
partner utter. We expect that subjects who are bi-dialectal will
switch to (converge rapidly with) the partner's dialect, and
that those who are not may show smaller changes toward the
partner's dialect in some but not all of the features. Speakers
vary a great deal in their awareness of what features constitute a
dialect (e.g., they are very aware of r-dropping and certain vowel
differences, but not other vowel differences). As a secondary study,
we plan to take a smaller sample of the word tokens uttered by
speakers who were not judged to speak LI (and were not invited back to
interact with the GA speaker) and compare tokens from early in the
task to those late in the task, to see if there is any tendency to
move toward or away from the LI partner's dialect.
3. Prosodic disambiguation and audience
design.
Evidence has been mixed on whether speakers spontaneously and
reliably produce prosodic cues that resolve syntactic ambiguities.
And when speakers do produce such cues, it is unclear whether they do
so 'for' their addressees (the audience design hypothesis) or 'for'
themselves, as a by-product of planning and articulating
utterances. Three experiments addressed these issues. In Experiments 1
and 3, speakers followed pictorial guides to spontaneously instruct
addressees to move objects. Critical instructions (e.g., 'Put the dog
in the basket on the star') were syntactically ambiguous, and the
referential situation supported either one or both interpretations. We
found that speakers reliably produced disambiguating cues to syntactic
ambiguity whether the situation was actually ambiguous or not.
However, Experiment 2 suggested that most speakers were not yet aware
of whether the situation was ambiguous by the time they began to
speak. Experiment 3 examined individual speakers' awareness of
situational ambiguity and the extent to which they signaled structure,
with or without addressees present. Speakers tended to produce
prosodic cues to syntactic boundaries regardless of their addressees'
needs in the situation. Such cues did prove helpful to addressees, who
correctly interpreted speakers' instructions virtually all the
time. In fact, even when utterances were syntactically ambiguous in
situations that supported both interpretations, eyetracking data
showed that 40% of the time addressees did not even consider the
non-intended objects. In the journal article that reports these
studies (Kraljic & Brennan, 2005), we discuss the standards needed
for a convincing test of the audience design hypothesis. Our results
are relevant not only to basic research in human language use, but
also to the design of human-computer spoken dialogue systems. The
potential for such systems to use prosody for disambiguation depends
heavily on how reliable speakers are in actually producing prosodic
cues. Since, as we have found, prosody appears to constitute such a
reliable cue to syntactic structure, then prosodic disambiguation is
likely to yield significant improvements in machine parsing.
4. Perception and representation of
phonological variation.
Spoken words exhibit considerable variation from their
hypothesized
canonical forms. Much of the variation is lawful, governed by
phonotactic principles. This work examines the immediate and
long-term processing consequences for rule-governed final-/t/
variation in English. Two semantic priming experiments demonstrate
that variation does not hinder short-term semantic processing, as long
as variation is regular. Two long-term priming experiments with
different tasks show that form processing over time is not as lenient
as immediate semantic processing: Strong priming is found only for the
canonical, unchanged form of /t/. Our results suggest that
immediately, regular variation does not affect semantic processing,
but over time, surface-/t/ variation is reduced to the basic [t], and
exemplar representations for lawful variants are not stored in
long-term memory. Sumner & Samuel (2005) reports this work in
detail.
5. Adjusting to
another's speech: Perceptual and cognitive effects.
Six experiments with over 640 participants were completed by
Tanya
Kraljic for her Ph.D. dissertation, under the direction of Arthur
Samuel and Susan Brennan (with Marie Huffman as committee member).
The following description is revised from the dissertation abstract:
Different speakers may pronounce the same sounds very differently
(with native or non-native accents, or atypical variants like lisps);
and yet listeners have little difficulty perceiving speech accurately,
especially after a bit of experience with a particular speaker. Recent
research on perceptual learning suggests that listeners adjust their
preexisting phonetic categories to accommodate speakers'
pronunciations of those phonemes But the underlying processes that
enable such learning, and the implications for linguistic
representation, are poorly understood. A series of experiments used
variations on the perceptual learning paradigm to examine these
issues. Listeners were exposed to a particular person's speech and
later tested on their perception of critical aspects of that
speech. The first three experiments investigated the nature of
perceptual learning: Is learning specific to the speakers and phonemes
heard during exposure? Does it result in temporary adaptations or
enduring representational changes? Can listeners adjust to several
speakers simultaneously? Taken together, Experiments 1 – 3
suggest that perceptual experience leads to very different learning
for different types of phonemes. For phonemes that vary primarily on a
spectral dimension, perceptual learning is very robust (shifts of
approximately 15%), long-lasting (the shifts are just as large 25
minutes later, even with intervening speech input), and
speaker-specific. For such phonemes, the perceptual system is able to
maintain several different representations simultaneously. In
contrast, for sounds that vary on a timing dimension, perceptual
learning is less robust (shifts of 4-5%), generalizes to new speakers
and new phonemes, and results in representations that need to be
re-adjusted whenever a new pronunciation is encountered. Experiments 1
and 2 are to be published in Cognitive Psychology (Kraljic &
Samuel, In press) and Psychonomic Bulletin & Review (Kraljic &
Samuel, Accepted). Experiments 3a and 3b are being written up for
publication. In Experiments 4 and 5, the study of perceptual learning
was extended to dialectal variations in pronunciation as well as to
production. The results suggest that the perceptual system handles
dialectal variation quite differently than idiosyncratic variation:
dialectal variation does not result in perceptual learning. Further,
while listeners' productions can change to reflect speech they have
just heard, their productions do not appear to change to reflect
speech they have previously adjusted to (perceptually). This suggests
that phonemic representations may not be shared by the perceptual and
production systems. Experiments 4 and 5 are being written up for
publication.
6. Effects of
knowing two languages on lexical activation.
When bilinguals listen to speech, are words from both
languages
activated if they match some of the input? For example, if a
Spanish-English bilingual is asked to select the 'leaf' from a display
of pictures that also includes a book, is the Spanish word 'libro'
also activated because it matches the /li/? Conversely, is the
second-language (L2) word 'leaf' activated when 'libro' is in a
Spanish context? Previous studies on whether a lexical cohort includes
candidate words from both a bilingual's languages have tested
Russian-English bilinguals in the U.S., and Dutch-English bilinguals
in the Netherlands. For the former, evidence was found for L2
activation; for the latter, no activation occurred. Using an
eye-tracking measure of lexical activation over a display of physical
objects evoking a lexical cohort from both languages (as in the prior
studies), we tested this question using bilinguals whose languages are
similar (Spanish-English) or different (Mandarin-English).
Critically, we are dividing our bilinguals based on whether they
acquired English by age seven, or later, testing the effect of L2
acquisition age on language automaticity. The result from this study
are being analyzed and interpreted, and will be written up for
publication this year (Samuel & Kraljic, in progress).
7. The Impact of Individual
Differences in Working Memory on Language Use During Conversation.
This project is part of the Ph.D. thesis of Psychology
Graduate Student Calion Lockridge. A 2-phase study involving pairs of
speakers pre-tested for their working memory span is examining how two
partners flexibly shifts the responsibility for a conversational task
based on each individual partner's ability. In Phase I, individual
subjects are tested using a battery of tests, the OSPAN (Operation
Span), N-Back, CVLT (California Verbal Learning Task), and IRI,
Interpersonal Reactivity Index). Using mainly the OSPAN, each subject
is categorized as “hi-span” or “lo-span”, and
then invited back to the lab to be paired with another subject for a
matching task in Phase II. Subjects are paired based on their span as
hi-hi, hi-lo, lohi or lo-lo pairs, where one is assigned to play the
role of Director in a matching task (this role involves taking most of
the responsibility), and the other, the Matcher. During Phase II, each
partner individually studies and learns labels for a set of ambiguous
figures (tangrams). For 6 tangrams they learn the same label
(“Shared”), for 6 they learn a different label
(“Different”; the 2 labels have been normed for goodness
of fit using another set of subjects) and for 6 more tangrams, they
learn no label (“New”). Then the Director and Matcher are
then brought together and seated such that they cannot see each other;
the Director receives a target arrangement of the 18 cards and must
help the Matcher match an identical set of cards (this is repeated 5
times). For the cards for which the two learned Different
perspectives, we are interested in how people distribute
responsibility for the task, as evidenced in their conversational
behavior. In particular, we are most interested in whether hi-span
partners will end up taking the perspectives of lo-span partners for
the Different cards. Eighty subjects will participate in both phases
of the experiment, and data collection is expected to be complete by
the end of summer, 2005. With just 15% of the data analyzed, the
results are extremely promising; we are finding differences in
behavior and distribution of effort based on span condition (e.g.,
when the Matcher is lo-span, partners entrain on the same perspective
much sooner with a hi-span Director than with a lo-span
Director). This study will provide the first evidence that an
individual's memory span matters in their language use with another
person.
8. Effects of common ground on spontaneous
co-speech gesture.
We are conducting a study of audience design on gesture and
referring in conversation. Are the gestures and referring expressions
a speaker produces adapted “for” a particular addressee,
or do these emerge simply from the speaker's own processing? Speakers
view cartoons and then re-tell them 3 times to 2 listeners (L1 and
L2), either in the order L1-L2-L1 (where the first and second tellings
are to a New partner and the third is to an Old partner (who has heard
the story before), or the alternative order L1-L1-L2 (where the second
telling is to the Old partner and the third is to the New partner. A
large body of previous work finds that repeated tokens of words are
shortened and less distinctly pronounced compared to the initial
token. There is currently a debate in the literature as to whether
this is an automatic effect based entirely on the speaker's processing
(given information is produced less distinctly than new information)
or whether it is affected by the addressee's needs or knowledge
state—that is, if the information is Old to the speaker but New
to the addressee, is it pronounced more distinctly than if it is Old
to the speaker and Old to the addressee? Previous measures were the
lengths of repeated word tokens; we are expanding by adding (1)
whether a particular narrative element is realized from one retelling
to another (elements should be left out more often to Old listeners
than to New listeners) and (2) whether gestures are modified in
retelling to be less informative, spatially smaller, more rapid, or to
represent distinctively different perspectives. About half of the data
have been collected and analysis is underway for simple measures like
word counts. Initial results are providing promising support for the
hypothesis that speakers take their listeners' needs into account and
are more informative to New listeners than to Old listeners in the
retelling of stories. We have also begun the design of the gesture
coding system that we will use for these data.
9. Building surface realizers automatically
from corpora using general-purpose tools.
We evaluated the feasibility of automatically acquiring
surface
realizers from corpora using general purpose parsing tools and
lexicons. We designed a basic architecture for acquiring a generation
grammar, described a surface realizer that uses grammars developed in
this way, and presented a set of experiments on different corpora that
highlight possible improvements in our approach. We then went on to
conduct a series of experiments on syntactic variation using this
approach to surface realization. This project is described in Zhong
& Stent (2005), Zhong, Stent & Swift (2006), and Zhong &
Stent (in submission).
10. Automatic evaluation of referring
expression generation using corpora.
In dialog, participants frequently converge on the same
referring expressions even if those referring expressions are
inefficient. Existing rule-based algorithms for referring expression
generation do not adequately model this adaptation. We extended two
such algorithms for referring expression generation with simple models
of partner adaptation and evaluated these algorithms automatically
using corpora of spoken dialog. The report of this work is published
in Gupta & Stent (2005).
11. The Rate-a-Course survey dialog system.
The Rate-a-Course survey dialog system is a dialog system that
lets
students provide information about their courses over the telephone.
It is implemented in VoiceXML. We have used this system to study
adaptation in initiative and lexical choice, and automatic inference
of user preferences. This system is described in Stent et al. (2006).
12. The RavenCalendar dialog system.
The RavenCalendar dialog system is a dialog system for
maintaining
a personal calendar. It uses the Olympus framework from CMU, Google
calendar and Google maps, and automatic event finding technology
developed at SBU. Users can provide input through text, speech, or
clicking on a map. The system provides output through speech, text,
the Google calendar interface and displays on the map. We will use
this system to study adaptation over extended periods of time and to
study adaptive response generation. This system is described in
Stenchikova et al. (2007).
13. Adaptive shifts in planning
strategies: monologue vs. dialog.
Research on the scope of planning in language production has
been carried out almost exclusively on monologue—that is, speech
without an addressee. However, a number of factors may influence
planning strategies in interactive conversation, including the natural
pressure to begin speaking more quickly, and the tendency to delay
utterances to wait for feedback. A major focus of a set of studies led
by Ben Swets will be to investigate how these adaptive factors impact
speech planning in real time. Preliminary results suggest that speakers
in dialog, who give rich descriptions for their addressees, are doing
so in a very incremental fashion. Specifically, when offering enriched
descriptions, speakers in dialog take less time to articulate early
sentence regions, but nonetheless give more detailed descriptions in
the end. The implication of such a set of results is that speakers in
dialog are adept at using the resources at their disposal in order to
accommodate their addressees.
14. Audience design and working memory
capacity.
When speakers engage in dialog, part of planning a sentence
involves adjusting the utterance to the needs of addressees. Prior
research has focused on circumstances in which speakers change the
syntactic structure of a phrase for a new addressee. Another ongoing
line of research led by Ben Swets will address a new question: Can
evidence from eyetracking and speech duration analyses reveal that
these audience design considerations come at some cost in the
production process? If integrating information about addressee needs
into utterance plans is costly, we will identify when such costs arise
in the course of sentence production. These ongoing studies will also
the impact of such differences on speakers’ ability to a) plan
high-level sentence information in advance and b) implement audience
design.
15. The impact of gesture on recovery from
conversational interruptions.
The aim of this project to examine whether gestures serve a
functional role in conversation by preventing interruptions from being
too disruptive to speakers. In association with the Gesture Focus
Group, we are planning a study that follows up Ben Swets’ dissertation
research on recovery from conversational interruptions. Research on
recovery from task interruptions has shown that giving someone a
warning that an interruption is about to occur helps the interrupted
person rehearse his or her place in a task, which aids
post-interruption recovery processes. It is possible that, in
conversation, gestures by an interruptor function as warnings that an
interruption is about to occur. These warnings may prevent disruption
to speakers once the interruption is over by allowing speakers to
strengthen their “bookkeeping” representations before giving up the
floor.
16. A cross-linguistic comparison of beat
gestures.
Research on speakers’ gesture production distinguishes between at least
two kinds of gestures: (a) gestures that represent some aspect of the
content of speech (representational gestures) and (b) motorically
simple gestures that do not represent speech content (beat gestures).
The underlying idea is that some gestures can convey some sort of
conventionalized semantic information based on the speaker’s internal
representations, whereas others are just the result of rhythmical
‘pulses’ that are realized kinetically (Tuite, 1993). A project led by
Georgios Tsertnadelis hypothesizes that gestures that rhythmically
accompany speech (i.e. beat gestures) are also based on or at least
influenced by the speaker’s internal prosodic representations of timed
phonological prominences. For example, a speaker of a language with
stress may anchor his or her beat gestures around stressed
(rhythmically prominent) syllables (i.e. a beat gesture may align
better with stressed syllables than with unstressed syllables) whereas
a speaker of a language with no stress (pitch or tone language) may
show different patterns of alignment based on some other prosodic
constituent (long syllables?). Perhaps speakers of languages with
stress may be more likely to use beat gestures in general. The ongoing
study has speakers of a stress language (e.g. English or Greek) and a
pitch language (Korean or Japanese) tell a short joke to another native
speaker of their language while being videotaped. We then analyze the
video and audio recording looking for differential alignment of
gestures with prosodic prominence. In another condition,
bilingual speakers of English and non-stress language (Korean) tell an
English joke first to another native speaker of English and then to
another native speaker of Korean. We will look for differential
alignment of beat gestures in the two versions of the joke. A complex
coding system will help reveal whether beat gestures in English or
Greek will align better with stressed syllables. A beat gesture should
be more likely to initiate over a stressed syllable/word rather than a
non-stressed one. Speakers of non-stress languages may show alignment
to a different prominent constituent or even an overall reduction in
the frequency of beat gestures. Finally, speakers of languages with
different stress patterns may show differential alignment of
beat-gestures and spoken prominence for example: French (always phrase
final stress) or Czech (always phrase initial stress). Beat gestures
may be parasitic to prosodic prominences in spoken language arising
from a cognitive alignment of vocal gestures with hand gestures in a
reinforcing motoric pattern of concordance and harmonization between
mouth and hand. The opposite could also be true: spoken prosodic
prominences may have evolved parasitically to original rhythmic hand
gestures, that functioned as the scaffolding for an evolutionary flip
from sign language to spoken languages.
|