Adaptation Project: Data

1.1 Adapting speaking after evidence of misrecognition: Global and local hyperarticulation.

We conducted a wizard-of-oz simulation study in which 16 naive English speakers answered questions by speaking to what they believed was a computer speech recognizer. All speakers received text feedback about misrecognition at the same pre-planned points in the dialogue which they then repaired by repeating the utterance. This enabled us to examine the local and global effects upon hyperarticulation in a controlled manner—both 'clear speech' (phonetic exaggeration) and slowing in rate of speech (we found these two measures to be reliably correlated with one another and with pausing). Measurements were based on a corpus of over 1300 utterances that included paired utterances (utterances before and after error messages) that were compared in a repeated measures design. As predicted, repairs utterances were spoken more slowly and more clearly than initial utterances; some phonetic features were exaggerated more than others. Content words were spoken more clearly during repairs, and function words were not. The phonetic features included /t/ tapping, word-final /t/ release, vowel reduction in the indefinite article, and word-final/d/ release in “and”. We found that clear speech was targeted locally to the misrecognized part of the utterance, and that slowed speech takes about 4-7 utterances to return to normal. Monolingual speakers produced more clear speech than bilingual (English dominant) speakers. We then processed this corpus through 2 kinds of speech recognizers (with and without alanguage model). to see which aspects of hyperarticulation are most problematic for speech recognition. For the system without a language model, word error rates increased with slowedspeaking rate (as expected) but there was no effect with clear speech (which may overlap with what this system was trained on). Somewhat surprisingly, for the system with a language model, slowed speech were associated with lower word error rates (although this system did do more poorly on clear speech, as expected). The report of this work has been submitted to a journal (Stent, Huffman, & Brennan).

1.2 Effects of speakers’ mental models of an addressee upon hyperarticulation.

A second study using a similar method is underway that looks for effects of speakers' mental models of their addressees upon hyperarticulation. The idea is that hyperarticulated speech to human partners should differ from that to machine partners, based on people's desire to be polite to other people and their willingness to express their frustration to machines as well as an expectation that people are more intelligent conversational partners than are machines. This study uses the same wizard-of-oz setup as the previous one, except that the feedback messages consist of text-to-speech rather than text on the subjects' screen. As before, error messages are scripted. There are 3 partner conditions: 2 kinds of machine partners (a limited one that whose misrecognitions are such that syntactic and semantic constraints are not preserved and a more able one whose misrecognitions are more “graceful” or similar to what human partners might mishear). The third kind of partner is construed as a person in another room who is allegedly doing the same task via a text-to-speech device intended for non-speaking people; this addressee produces the same feedback messages as the more able computer partner, with some noise in the channel to account for misrecognitions. We have completed the first two conditions and are analyzing the hyperarticulation data now; the third condition will be completed soon.

2. Dialect convergence in a conversational task.

We are currently conducting an experiment to test the claim (made widely by sociolinguists) that speakers (can or sometimes) converge in their dialects when interacting with one another. Subjects first name a set of picture cards designed to elicit words that contain features of interest from both consonants and vowels that contrast in Long Island English (LI and General American English (GA). Then they interact by using these cards to play a “go fish” card game with a partner who speaks a strong version of LI. If they are judged to speak LI, they are invited back to play the game again with another partner who speaks GA. We log, digitize, and label all tokens of the target words, which are then coded blind by a Linguistics graduate student for such things as R-dropping (subjectively, by listening) and vowel height (by making formant measurements using PRAAT speech analysis sofware). Next, pairs of matching word tokens comprised of one word from the subject's sessions with each of the 2 partners will be presented to blind raters, who will judge which partner the word was spoken to. The pairs of tokens will be sampled from tbe beginning, middle, and end of the dialogue to determine whether dialect convergence, if it indeed occurs, happens immediately, or slowly over time. We will then examine the degree of convergence (if any) that occurs for each word, conditional on how many tokens of that word the subject heard the partner utter. We expect that subjects who are bi-dialectal will switch to (converge rapidly with) the partner's dialect, and that those who are not may show smaller changes toward the partner's dialect in some but not all of the features. Speakers vary a great deal in their awareness of what features constitute a dialect (e.g., they are very aware of r-dropping and certain vowel differences, but not other vowel differences). As a secondary study, we plan to take a smaller sample of the word tokens uttered by speakers who were not judged to speak LI (and were not invited back to interact with the GA speaker) and compare tokens from early in the task to those late in the task, to see if there is any tendency to move toward or away from the LI partner's dialect.

3. Prosodic disambiguation and audience design.

Evidence has been mixed on whether speakers spontaneously and reliably produce prosodic cues that resolve syntactic ambiguities. And when speakers do produce such cues, it is unclear whether they do so 'for' their addressees (the audience design hypothesis) or 'for' themselves, as a by-product of planning and articulating utterances. Three experiments addressed these issues. In Experiments 1 and 3, speakers followed pictorial guides to spontaneously instruct addressees to move objects. Critical instructions (e.g., 'Put the dog in the basket on the star') were syntactically ambiguous, and the referential situation supported either one or both interpretations. We found that speakers reliably produced disambiguating cues to syntactic ambiguity whether the situation was actually ambiguous or not. However, Experiment 2 suggested that most speakers were not yet aware of whether the situation was ambiguous by the time they began to speak. Experiment 3 examined individual speakers' awareness of situational ambiguity and the extent to which they signaled structure, with or without addressees present. Speakers tended to produce prosodic cues to syntactic boundaries regardless of their addressees' needs in the situation. Such cues did prove helpful to addressees, who correctly interpreted speakers' instructions virtually all the time. In fact, even when utterances were syntactically ambiguous in situations that supported both interpretations, eyetracking data showed that 40% of the time addressees did not even consider the non-intended objects. In the journal article that reports these studies (Kraljic & Brennan, 2005), we discuss the standards needed for a convincing test of the audience design hypothesis. Our results are relevant not only to basic research in human language use, but also to the design of human-computer spoken dialogue systems. The potential for such systems to use prosody for disambiguation depends heavily on how reliable speakers are in actually producing prosodic cues. Since, as we have found, prosody appears to constitute such a reliable cue to syntactic structure, then prosodic disambiguation is likely to yield significant improvements in machine parsing.

4. Perception and representation of phonological variation.

Spoken words exhibit considerable variation from their hypothesized canonical forms. Much of the variation is lawful, governed by phonotactic principles. This work examines the immediate and long-term processing consequences for rule-governed final-/t/ variation in English. Two semantic priming experiments demonstrate that variation does not hinder short-term semantic processing, as long as variation is regular. Two long-term priming experiments with different tasks show that form processing over time is not as lenient as immediate semantic processing: Strong priming is found only for the canonical, unchanged form of /t/. Our results suggest that immediately, regular variation does not affect semantic processing, but over time, surface-/t/ variation is reduced to the basic [t], and exemplar representations for lawful variants are not stored in long-term memory. Sumner & Samuel (2005) reports this work in detail.

5. Adjusting to another's speech: Perceptual and cognitive effects.

Six experiments with over 640 participants were completed by Tanya Kraljic for her Ph.D. dissertation, under the direction of Arthur Samuel and Susan Brennan (with Marie Huffman as committee member). The following description is revised from the dissertation abstract: Different speakers may pronounce the same sounds very differently (with native or non-native accents, or atypical variants like lisps); and yet listeners have little difficulty perceiving speech accurately, especially after a bit of experience with a particular speaker. Recent research on perceptual learning suggests that listeners adjust their preexisting phonetic categories to accommodate speakers' pronunciations of those phonemes But the underlying processes that enable such learning, and the implications for linguistic representation, are poorly understood. A series of experiments used variations on the perceptual learning paradigm to examine these issues. Listeners were exposed to a particular person's speech and later tested on their perception of critical aspects of that speech. The first three experiments investigated the nature of perceptual learning: Is learning specific to the speakers and phonemes heard during exposure? Does it result in temporary adaptations or enduring representational changes? Can listeners adjust to several speakers simultaneously? Taken together, Experiments 1 – 3 suggest that perceptual experience leads to very different learning for different types of phonemes. For phonemes that vary primarily on a spectral dimension, perceptual learning is very robust (shifts of approximately 15%), long-lasting (the shifts are just as large 25 minutes later, even with intervening speech input), and speaker-specific. For such phonemes, the perceptual system is able to maintain several different representations simultaneously. In contrast, for sounds that vary on a timing dimension, perceptual learning is less robust (shifts of 4-5%), generalizes to new speakers and new phonemes, and results in representations that need to be re-adjusted whenever a new pronunciation is encountered. Experiments 1 and 2 are to be published in Cognitive Psychology (Kraljic & Samuel, In press) and Psychonomic Bulletin & Review (Kraljic & Samuel, Accepted). Experiments 3a and 3b are being written up for publication. In Experiments 4 and 5, the study of perceptual learning was extended to dialectal variations in pronunciation as well as to production. The results suggest that the perceptual system handles dialectal variation quite differently than idiosyncratic variation: dialectal variation does not result in perceptual learning. Further, while listeners' productions can change to reflect speech they have just heard, their productions do not appear to change to reflect speech they have previously adjusted to (perceptually). This suggests that phonemic representations may not be shared by the perceptual and production systems. Experiments 4 and 5 are being written up for publication.

6. Effects of knowing two languages on lexical activation.

When bilinguals listen to speech, are words from both languages activated if they match some of the input? For example, if a Spanish-English bilingual is asked to select the 'leaf' from a display of pictures that also includes a book, is the Spanish word 'libro' also activated because it matches the /li/? Conversely, is the second-language (L2) word 'leaf' activated when 'libro' is in a Spanish context? Previous studies on whether a lexical cohort includes candidate words from both a bilingual's languages have tested Russian-English bilinguals in the U.S., and Dutch-English bilinguals in the Netherlands. For the former, evidence was found for L2 activation; for the latter, no activation occurred. Using an eye-tracking measure of lexical activation over a display of physical objects evoking a lexical cohort from both languages (as in the prior studies), we tested this question using bilinguals whose languages are similar (Spanish-English) or different (Mandarin-English). Critically, we are dividing our bilinguals based on whether they acquired English by age seven, or later, testing the effect of L2 acquisition age on language automaticity. The result from this study are being analyzed and interpreted, and will be written up for publication this year (Samuel & Kraljic, in progress).

7. The Impact of Individual Differences in Working Memory on Language Use During Conversation.

This project is part of the Ph.D. thesis of Psychology Graduate Student Calion Lockridge. A 2-phase study involving pairs of speakers pre-tested for their working memory span is examining how two partners flexibly shifts the responsibility for a conversational task based on each individual partner's ability. In Phase I, individual subjects are tested using a battery of tests, the OSPAN (Operation Span), N-Back, CVLT (California Verbal Learning Task), and IRI, Interpersonal Reactivity Index). Using mainly the OSPAN, each subject is categorized as “hi-span” or “lo-span”, and then invited back to the lab to be paired with another subject for a matching task in Phase II. Subjects are paired based on their span as hi-hi, hi-lo, lohi or lo-lo pairs, where one is assigned to play the role of Director in a matching task (this role involves taking most of the responsibility), and the other, the Matcher. During Phase II, each partner individually studies and learns labels for a set of ambiguous figures (tangrams). For 6 tangrams they learn the same label (“Shared”), for 6 they learn a different label (“Different”; the 2 labels have been normed for goodness of fit using another set of subjects) and for 6 more tangrams, they learn no label (“New”). Then the Director and Matcher are then brought together and seated such that they cannot see each other; the Director receives a target arrangement of the 18 cards and must help the Matcher match an identical set of cards (this is repeated 5 times). For the cards for which the two learned Different perspectives, we are interested in how people distribute responsibility for the task, as evidenced in their conversational behavior. In particular, we are most interested in whether hi-span partners will end up taking the perspectives of lo-span partners for the Different cards. Eighty subjects will participate in both phases of the experiment, and data collection is expected to be complete by the end of summer, 2005. With just 15% of the data analyzed, the results are extremely promising; we are finding differences in behavior and distribution of effort based on span condition (e.g., when the Matcher is lo-span, partners entrain on the same perspective much sooner with a hi-span Director than with a lo-span Director). This study will provide the first evidence that an individual's memory span matters in their language use with another person.

8. Effects of common ground on spontaneous co-speech gesture.

We are conducting a study of audience design on gesture and referring in conversation. Are the gestures and referring expressions a speaker produces adapted “for” a particular addressee, or do these emerge simply from the speaker's own processing? Speakers view cartoons and then re-tell them 3 times to 2 listeners (L1 and L2), either in the order L1-L2-L1 (where the first and second tellings are to a New partner and the third is to an Old partner (who has heard the story before), or the alternative order L1-L1-L2 (where the second telling is to the Old partner and the third is to the New partner. A large body of previous work finds that repeated tokens of words are shortened and less distinctly pronounced compared to the initial token. There is currently a debate in the literature as to whether this is an automatic effect based entirely on the speaker's processing (given information is produced less distinctly than new information) or whether it is affected by the addressee's needs or knowledge state—that is, if the information is Old to the speaker but New to the addressee, is it pronounced more distinctly than if it is Old to the speaker and Old to the addressee? Previous measures were the lengths of repeated word tokens; we are expanding by adding (1) whether a particular narrative element is realized from one retelling to another (elements should be left out more often to Old listeners than to New listeners) and (2) whether gestures are modified in retelling to be less informative, spatially smaller, more rapid, or to represent distinctively different perspectives. About half of the data have been collected and analysis is underway for simple measures like word counts. Initial results are providing promising support for the hypothesis that speakers take their listeners' needs into account and are more informative to New listeners than to Old listeners in the retelling of stories. We have also begun the design of the gesture coding system that we will use for these data.

9. Building surface realizers automatically from corpora using general-purpose tools.

We evaluated the feasibility of automatically acquiring surface realizers from corpora using general purpose parsing tools and lexicons. We designed a basic architecture for acquiring a generation grammar, described a surface realizer that uses grammars developed in this way, and presented a set of experiments on different corpora that highlight possible improvements in our approach. We then went on to conduct a series of experiments on syntactic variation using this approach to surface realization. This project is described in Zhong & Stent (2005), Zhong, Stent & Swift (2006), and Zhong & Stent (in submission).

10. Automatic evaluation of referring expression generation using corpora.

In dialog, participants frequently converge on the same referring expressions even if those referring expressions are inefficient. Existing rule-based algorithms for referring expression generation do not adequately model this adaptation. We extended two such algorithms for referring expression generation with simple models of partner adaptation and evaluated these algorithms automatically using corpora of spoken dialog. The report of this work is published in Gupta & Stent (2005).

11. The Rate-a-Course survey dialog system.

The Rate-a-Course survey dialog system is a dialog system that lets students provide information about their courses over the telephone. It is implemented in VoiceXML. We have used this system to study adaptation in initiative and lexical choice, and automatic inference of user preferences. This system is described in Stent et al. (2006).

12. The RavenCalendar dialog system.

The RavenCalendar dialog system is a dialog system for maintaining a personal calendar. It uses the Olympus framework from CMU, Google calendar and Google maps, and automatic event finding technology developed at SBU. Users can provide input through text, speech, or clicking on a map. The system provides output through speech, text, the Google calendar interface and displays on the map. We will use this system to study adaptation over extended periods of time and to study adaptive response generation. This system is described in Stenchikova et al. (2007).

13. Adaptive shifts in planning strategies: monologue vs. dialog.

Research on the scope of planning in language production has been carried out almost exclusively on monologue—that is, speech without an addressee. However, a number of factors may influence planning strategies in interactive conversation, including the natural pressure to begin speaking more quickly, and the tendency to delay utterances to wait for feedback. A major focus of a set of studies led by Ben Swets will be to investigate how these adaptive factors impact speech planning in real time. Preliminary results suggest that speakers in dialog, who give rich descriptions for their addressees, are doing so in a very incremental fashion. Specifically, when offering enriched descriptions, speakers in dialog take less time to articulate early sentence regions, but nonetheless give more detailed descriptions in the end. The implication of such a set of results is that speakers in dialog are adept at using the resources at their disposal in order to accommodate their addressees.

14. Audience design and working memory capacity.

When speakers engage in dialog, part of planning a sentence involves adjusting the utterance to the needs of addressees. Prior research has focused on circumstances in which speakers change the syntactic structure of a phrase for a new addressee. Another ongoing line of research led by Ben Swets will address a new question: Can evidence from eyetracking and speech duration analyses reveal that these audience design considerations come at some cost in the production process? If integrating information about addressee needs into utterance plans is costly, we will identify when such costs arise in the course of sentence production. These ongoing studies will also the impact of such differences on speakers’ ability to a) plan high-level sentence information in advance and b) implement audience design.

15. The impact of gesture on recovery from conversational interruptions.

The aim of this project to examine whether gestures serve a functional role in conversation by preventing interruptions from being too disruptive to speakers. In association with the Gesture Focus Group, we are planning a study that follows up Ben Swets’ dissertation research on recovery from conversational interruptions. Research on recovery from task interruptions has shown that giving someone a warning that an interruption is about to occur helps the interrupted person rehearse his or her place in a task, which aids post-interruption recovery processes. It is possible that, in conversation, gestures by an interruptor function as warnings that an interruption is about to occur. These warnings may prevent disruption to speakers once the interruption is over by allowing speakers to strengthen their “bookkeeping” representations before giving up the floor.

16. A cross-linguistic comparison of beat gestures.

Research on speakers’ gesture production distinguishes between at least two kinds of gestures: (a) gestures that represent some aspect of the content of speech (representational gestures) and (b) motorically simple gestures that do not represent speech content (beat gestures). The underlying idea is that some gestures can convey some sort of conventionalized semantic information based on the speaker’s internal representations, whereas others are just the result of rhythmical ‘pulses’ that are realized kinetically (Tuite, 1993). A project led by Georgios Tsertnadelis hypothesizes that gestures that rhythmically accompany speech (i.e. beat gestures) are also based on or at least influenced by the speaker’s internal prosodic representations of timed phonological prominences. For example, a speaker of a language with stress may anchor his or her beat gestures around stressed (rhythmically prominent) syllables (i.e. a beat gesture may align better with stressed syllables than with unstressed syllables) whereas a speaker of a language with no stress (pitch or tone language) may show different patterns of alignment based on some other prosodic constituent (long syllables?). Perhaps speakers of languages with stress may be more likely to use beat gestures in general. The ongoing study has speakers of a stress language (e.g. English or Greek) and a pitch language (Korean or Japanese) tell a short joke to another native speaker of their language while being videotaped. We then analyze the video and audio recording looking for differential alignment of gestures with prosodic prominence. In another condition, bilingual speakers of English and non-stress language (Korean) tell an English joke first to another native speaker of English and then to another native speaker of Korean. We will look for differential alignment of beat gestures in the two versions of the joke. A complex coding system will help reveal whether beat gestures in English or Greek will align better with stressed syllables. A beat gesture should be more likely to initiate over a stressed syllable/word rather than a non-stressed one. Speakers of non-stress languages may show alignment to a different prominent constituent or even an overall reduction in the frequency of beat gestures. Finally, speakers of languages with different stress patterns may show differential alignment of beat-gestures and spoken prominence for example: French (always phrase final stress) or Czech (always phrase initial stress). Beat gestures may be parasitic to prosodic prominences in spoken language arising from a cognitive alignment of vocal gestures with hand gestures in a reinforcing motoric pattern of concordance and harmonization between mouth and hand. The opposite could also be true: spoken prosodic prominences may have evolved parasitically to original rhythmic hand gestures, that functioned as the scaffolding for an evolutionary flip from sign language to spoken languages.