Kendall (2009) shows that speech rate correlates to region, ethnicity, gender, and age. Beyond average rates, acceleration and deceleration matter. Psychologists and musicologists link tempo not only to demographic categories, but to emotions and personality-types, too.
An analysis of five Star Trek episodes shows how differences in tempo match differences in characters, from the highly variable tempos of a passionate Captain Kirk to the measured, burst-free speech of the emotionless Mr. Spock. The actors deploy tempo stylistically, creating emotions and personalities that audiences understand.
To calculate evenness and irregularity in tempo, I adapt measures of burstiness that network traffic engineering uses for packets traveling across the Internet: computing the variance in time between syllable nuclei in an utterance, then dividing the variance by 0.5*the number of syllables. The bigger the ratio, the more it is characterized by clusters.
Kirk’s burstiness differs significantly from his crew: at least four times greater than all the others at their burstiest. Everyone’s bursts correlate to emotional “hot spots”—areas of increased involvement (Çetin and Shriberg 2006).
I demonstrate that meanings of tempo are structured by two themes: (i) arousal (“action readiness”) and (ii) ideologies about time. These emerge not just in the Star Trek data but from building
indexical fields for “fast talk” and “slow talk.” In the spirit of Eckert (2008), the fields begin with proven correlations. I also develop a rapid survey methodology—results from 50 participants chart a constellation of ideological meanings that describe who talks fast/slow and when.
This paper differs from most work on social meaning by focusing on a suprasegmental aspect of speech. It also draws upon psychology, anthropology, musicology, and computer science. Its use of performances distills stylistic tempo from reflexive, cognitive effects, offering insights that assist our understanding of how tempo gets used in naturally-occurring speech.
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Variation in speech tempo: Capt. Kirk, Mr. Spock, and all of us in between
1. Variation in speech tempo:
Capt. Kirk, Mr. Spock, and all of us in between
Tyler Schnoebelen
Stanford University
2. Goals
• The social meanings of tempo
– Who uses it, how’s it understood?
• Indexical field construction
– New approaches to structuring indexical fields
• Between fast speech and slow speech:
burstiness
– A way to measure William Shatner (and everybody
else)
11. Social meaning
• Speech rates are not stable by demographic
category
• They vary all over the place
• Conveying and creating identities and
attitudes
20. Indexical fields
• Variables aren’t fixed but are located in a
“constellation of ideologically related
meanings” (Eckert 2008)
21. 3 steps to an indexical field
1. Statistically significant correlations with fast
speech rate in psychology and linguistics
literature
2. Corpora (“talk fast” and “are fast talkers”)
– Corpus of Contemporary American English
– Bing! search results
3. Survey
– 50 participants across the US via Mechanical Turk
24. The structure of indexical fields
• It should be possible to relate items within the field
• And this should allows us to understand constraints
• And how different meanings come to attach to
different variables
• My assumptions:
– Indexical fields expand and contract over time
– New meanings rely upon what’s already there
– “The no teleportation” hypothesis
– In principle, sadness and fast talk could come to be related
but the path is unlikely
25. Clustering
• Take the indexical field (41 items)
• Also add “fast-talkers” and “slow-talkers”
• Ask for pair-wise judgments of how much overlap
there is between each pair
– ie, “New Yorkers” overlap with “Northerners”, but
“teachers” don’t overlap with “con-men/hustlers”
• 20 judgments for each pair (840 pairs)
• 245 Americans surveyed via Amazon Mechanical Turk
• Hierarchical clustering based on correlation patterns
(but non-hierarchical methods give similar results)
26. Sample of the data
active angry; in a rage anxious
active 2.11 -0.373 -0.156
angry; in a rage -0.433 2.08 -0.0549
anxious -0.298 -0.0875 2.08
auctioneers 0.500 -1.331 -0.898
27. Predictions
• The main clusters that emerge will be
connected to two inter-related notions:
– Ideologies of time
– Emotional arousal
29. What’s this showing?
• Time and emotional arousal may well underlie
the field
• But a different axis is much more apparent:
– Speaker-orientation vs. listener-orientation
• Fast-speech is about time, but are you talking
fast for me or for yourself?
– There’s a parallel to IN’ vs. ING
• Do I take your IN’ as a sign of friendliness or as
evidence of laziness?
47. Burstiness
• Variance / (syllables * 0.5)
– Variance gets us dispersion of the data
– The denominator helps us see how spread out the
data is
– The bigger the ratio, the more it is characterized
by clusters (“bursts”)
48. Burstiness and emotionality
• 48 Americans judged the emotional intensity of
228 utterances
– Utterances taken from 8 episodes, focusing on:
• Captain Kirk
• Mr. Spock
• Lt. Sulu
• Dr. (Bones) McCoy
– Each utterance judged by 3-5 people
– Scores were normalized per judge and then averaged
– Top 30, bottom 30 and 63 randomly chosen in
between were analyzed for speech rate and burstiness
– Restricted to utterances that were at least 5 syllables
50. Better than speech rate
• Among factors tested:
– Burstiness
– Speech rate
– Syllable count
– Interactions among these
• Only burstiness is significant (in a simple linear
regression model or an ordinary least squares
model, p=~0.0125)
– But note that the r-squared isn’t all that great:
0.05044
51. • A better approach is to use a mixed model,
where speaker is a random effect.
– This allows us to see that Kirk and Bones use
burstiness, while Sulu and Spock don’t.
• Kirk 0.4371045
• Bones 0.1710811
• Sulu -0.1518260
• Spock -0.4563595
Mixed model
53. Emotionality by Burstiness and
Speaker
AIC BIC logLik deviance REMLdev
341.4 352.7 -166.7 336.7 333.4
Random effects:
Groups Name Variance Std.Dev.
Speaker (Intercept) 0.21810 0.46701
Residual 0.87055 0.93303
Number of obs: 123, groups: Speaker, 4
Fixed effects:
Estimate Std. Error t value
(Intercept) -0.07646 0.27625 -0.2768
Burstiness 7.07226 3.07077 2.3031
> pvals.fnc(data.lmer)$fixed
Estimate MCMCmean HPD95lower HPD95upper pMCMC
Pr(>|t|)
(Intercept) -0.0765 -0.0845 -0.8558 0.6649 0.8174 0.7824
Burstiness 7.0723 7.1217 1.0825 13.1220 0.0194 0.0230
54. Summary
• We can move beyond the “who” of variation and into
“how” and “why”
• Indexical fields are a useful conceptual tool and we can
use them to understand constraints on meaning
• It seems likely that many indexical fields are structured
by axes like self/other-orientation
– Which are made visible to listeners and appraised by them
• Rate is not the only thing that matters for emotion
– Burstiness also communicates the drama of the situation
– It is unlikely that people go to the extent that Shatner does
– But there’s reason to believe that tempo may be as
useful—or more so—than simple rates
55. Thank you!
• Collins, S. 1989. Subjective and autonomic responses to Western classical music. Unpublished doctoral dissertation, University of Manchester, UK
• Eckert, P. (2008). Variation and the indexical field. Journal of Sociolinguistics, 12(4), 453–476.
• Huson, D., D. Richter, C. Rausch, T. Dezulian, M. Franz and R. Rupp. (2007). Dendroscope: An interactive viewer for large phylogenetic trees . BMC
Bioinformatics 8:460, 2007, software freely available from www.dendroscope.org
• de Jong, N. H., and T. Wempe. (2009). Praat script to detect syllable nuclei and measure speech rate automatically. Behavior research methods, 41(2),
385.
• Juslin, P. N., and P. Laukka. (2003). Communication of emotions in vocal expression and music performance: Different channels, same code?.
Psychological Bulletin, 129(5), 770–814.
• Kendall, T. (2010). Language Variation and Sequential Temporal Patterns of Talk. Linguistics Department, Stanford University: Palo Alto, CA. February.
• Kendall, T. (2009). Speech Rate, Pause, and Linguistic Variation: An Examination Through the Sociolinguistic Archive and Analysis Project, Doctoral
Dissertation. Durham, NC: Duke University.
• Scherer, K. (2003). Vocal communication of emotion: a review of research paradigms. Speech Communication, 40, 227-256.
• Scherer, K. R. (1981). Speech and emotional states. Speech evaluation in psychiatry, 189–220.
• Schnoebelen, T. (2009). The social meaning of tempo. http://www.stanford.edu/~tylers/notes/socioling/Social_meaning_tempo_Schnoebelen_3-23-
09.pdf
• Scherer, K. 2003. Vocal communication of emotion: a review of research paradigms, Speech Comm. 40 227–256.
• Scherer, K. and J. Oshinsky. (1977). Cue utilization in emotion attribution from auditory stimuli. Motiv. Emot. 1, 331–346.
• Ververidis, D., and C. Kotropoulos. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9), 1162–
1181.
• Special thanks to Penny Eckert, John Rickford, Kate Geenberg, Kyuwon
Moon, Roey Gafter, and Mathew Lodge
• Also, fwiw, I’ve put together a lot of essays and reading notes about language and emotion here:
– http://www.stanford.edu/~tylers/emotions.shtml
57. Why look at TV/movies for style?
• Actors are a good source for studies of style since they make vivid the cues
that are more mixed and grey in real life. “The act that one does, the act
that one performs, is, in a sense, an act that has been going on before one
arrived on the scene” (Butler 1988: 526). Butler is talking about gender,
but this idea applies to acting as well. Actors don’t really create anything
out of whole cloth. They assemble bits and pieces. It would be difficult to
analyze the acoustic signal of wooden acting, since we’d be measuring
perceptions of lack, but even histrionic, scene-chewing samples offer us
speech cues associated with various social categories. Again, the
assumption is that actors use stylistic resources that their audiences can
be expected to understand. If audiences uniformly agree on what a
performance expresses, it doesn’t necessarily matter what the intention
was. We’re after that shared social meaning and the components that
comprise it, though we may be giving up the psychophysiological effects
on the voice that happen under natural conditions.
• Writers create scenes of dramatic interest, so that there is also a higher
proportion of arousal in a scene than in daily life.
A great place to begin thinking about speech rate—or what I’ll be calling tempo—is Tyler Kendall’s work.
Looking at 27,000+ tokens, Kendall finds significant effects for:
Region
Ethnicity
Gender
Age
Ethnicity*Gender interaction
Utterance length
On a by-speaker model:
Region
Gender
Region * Utterance length
Age
Median pause duration
Looking at 27,000+ tokens, Kendall finds significant effects for:
Region
Ethnicity
Gender
Age
Ethnicity*Gender interaction
Utterance length
On a by-speaker model:
Region
Gender
Region * Utterance length
Age
Median pause duration
Looking at 27,000+ tokens, Kendall finds significant effects for:
Region
Ethnicity
Gender
Age
Ethnicity*Gender interaction
Utterance length
On a by-speaker model:
Region
Gender
Region * Utterance length
Age
Median pause duration
Looking at 27,000+ tokens, Kendall finds significant effects for:
Region
Ethnicity
Gender
Age
Ethnicity*Gender interaction
Utterance length
On a by-speaker model:
Region
Gender
Region * Utterance length
Age
Median pause duration
Looking at 27,000+ tokens, Kendall finds significant effects for:
Region
Ethnicity
Gender
Age
Ethnicity*Gender interaction
Utterance length
On a by-speaker model:
Region
Gender
Region * Utterance length
Age
Median pause duration
Looking at 27,000+ tokens, Kendall finds significant effects for:
Region
Ethnicity
Gender
Age
Ethnicity*Gender interaction
Utterance length
On a by-speaker model:
Region
Gender
(ETHNICITY DROPS)
Region * Utterance length
Age
Median pause duration
From Kendall’s discussion of his dissertation work at Stanford in 2010.
Moreover, speech rate is connected to how people attribute emotions and personality characteristics to their interlocutors.
I’ve put together a lot of essays and reading notes about language and emotion here:
http://www.stanford.edu/~tylers/emotions.shtml
Speech rate and pitch are among the most commonly studied cues to emotion in the psychology and phonetics literature. (But note that there is a tremendous amount of indeterminacy—fast speech rate is connected to anger, happiness and fear, for example.)
Top table from Scherer 2003, based on Johnstone and Scherer 2000
Bottom table from Ververidis and Kotropoulos (2006: 1171).
I’ve put together a lot of essays and reading notes about language and emotion here:
http://www.stanford.edu/~tylers/emotions.shtml
Musicologists are also interested in how music has the effects that it does.
Emotional states corresponding to fast tempo (Scherer and Oshinsky 1977: 340). Those in parentheses had other characteristics ranked higher than tempo. For example, "anger" is expressed mainly through many harmonics and then by fast tempo.
Scherer (1981: 206) adds confidence and indifference; Collins (1989: 45) would have us add excitement but probably wouldn’t put up happiness. Fear is especially associated with a highly irregular tempo (Collins 1989: 45).
Juslin and Laukka (2003) review 104 studies of speech and 41 studies of music having to do with cues to emotion. This comes from pg 792.
Note that the vast majority of the studies of vocal emotion use ACTED speech. It is only in recent years that any naturalistic data has started to influence the emotionology literature.
I’ve put together a lot of essays and reading notes about language and emotion here:
http://www.stanford.edu/~tylers/emotions.shtml
One of the major points is that speakers themselves vary their speech tempo. Conveying emotion is one part of it, but speech tempo has interactional effects.
This gets us into the realm of style, where we look at the meaning a variable has in interaction, not just according to broad demographic groups.
One of the most useful concepts is that of the “indexical field”.
Here are the steps I took to come up with the indexical field of fast talk on the next slide (I did the same for slow talk, but don’t talk about that here—see my 2009 paper). Step 1 involved reviewing several dozen studies. The odder items in the indexical field (“not very powerful”) usually come from those.
The 40-odd items that you see in this figure demonstrate the indexical field for fast talking. These attributes come from how people in several corpora talk about fast speech and from research by psychologists and others who have found statistically significant relationships between, say, fast speech and listeners’ perceptions of intelligence. It is augmented by looking at attributes a pilot survey I ran on 50 people, asking what sorts of people talk fast.
Tempo doesn’t mean one thing; nor can we say that “fast tempo” means something particular. A fast tempo has multiple meanings. Depending upon what other variables it combines with, it can mean happy, New Yorker, angry, con artist, and a number of other things. Its meaning is indeterminate and requires other cues—some of which are linguistic, others which are not. This doesn’t stop people from explicitly commenting on what “fast talkers” are like or what kinds of people talk fast.
If these characteristics were presented as a list, we might be distracted by all the contradictions—how can fast speech be about both rage and joy? How does it signal an honest person and a con-man? Laying out the characteristics in a field, however, allows clusters to form. Some of these we sense intuitively—in America, there is a cultural association between “New Yorker” and “Jewish”, for example, which can be traced both to historical settlement patterns and current demographics, but which also work out to be ideological judgments about shared outlook and behavioral patterns. The connection between New Yorkers and other elements are also relatively straight-forward. As the stereotype goes, New Yorkers are in a hurry. Portrayals of New Yorkers often show them to be aggressive—an attribute that connects to “active” and “emphatic”, and which can be read as a version of “angry”, too.
The “no teleportation” hypothesis is that a new meaning for a variable relies on what’s already there—there are constraints on what a variable can mean and it’s based on what’s there. New meanings proceed from the existing.
Actually I got judgments for both “a overlap b” and “b overlap a”, 10 of each, 1680 pairs—but in the end I collapsed these and took the averages. It’d be better to NOT do this, but I’m still trying to figure out a way to do clustering that is lopsided. Suggestions welcome.
Hierarchical clustering assumes that there is a true relationship underneath. This assumption is problematic and may ultimately lead us to prefer other clustering techniques. Therefore, I am not committed to the particular clustering technique but to the concept itself as a way of understanding structure. And ultimately understanding how meaning works in interaction and how variables change in meaning over time.
This is a sample of the table before turning it into correlations. These are z-scores and since different participants answered slightly different questions, that’s why active-active, angry-angry, and anxious-anxious don’t have precisely the same score. But the “a overlaps with a” scores really are the highest, as we would expect/wish.
Basically, we are looking for clusters of relationships, so “active” and “auctioneers” are more alike than “anxious” and “auctioneers” in this minitable—ie, active and auctioneers have similar positive/negative correlation patterns with the columns: +, -, -. While “anxious” is -, -, +.
The true table is 43x43 (the 41 indexical field items plus “fast talkers” and “slow talkers”).
This notion of activity (also known as arousal) is crucial for us to understand. I believe it is a core part of how indexical fields for tempo get constructed. Activity is, of course, measurable in terms of heart-rate, respiration, glucose uptake, epinephrine release, etc. When we look at the indexical fields for different tempos, we see that the active emotions (rage, joy) are associated with fast talk while the passive emotions (sadness, boredom) are associated with slow talk.
This is created using R and a tool called “Dendroscope” that biologists use. I’m happy to give a brief tutorial on it—it isn’t hard to do, especially if you know R already.
The way clustering works is that ANYTHING you place in it will find a home, even if it doesn’t really belong. I could put in “people who keep whales for pets” and you’d see it show up somewhere. The important thing is where.
We expect most of these attributes to connect with fast speech tempo (since that’s the indexical field we’re building). People in this experiment had no idea what they were doing beyond saying how much overlap there was between two given terms. It would have been problematic if “slow talkers” had patterned inside any of the other clusters. It is meaningful (and reassuring) that they branch off.
My high-level labels are an attempt to describe what the clusters are about, how they seem to work in interactional evaluation. The upper left clusters seem to be very much “this fast talk is more about the speaker than me the listener”, contrast this with the other clusters, which seem to have to have more of an interactional aspect foregrounded. (The overwhelming and overwhelmed clusters have significant interactional effects, too, but I am trying to tease apart how the speaker’s style gets assessed and I think the “how much is this about me” part is relevant—nonetheless, this is all rather tentative, as you can imagine. I think the crucial work is to build clusters for other indexical fields and see if similar types of clusters emerge.)
As I worked on fast and slow speech tempo, it occurred to me that not only do people vary speed across utterances, but within them, too.
William Shatner, of course, is known for his unusual phrasing.
A few examples. This one is among my favorites. Kirk beamed down to a planet where an old flame switched bodies with him (she was crazy from having been denied a starship captainship due to sexism—even in the future sexism!). So Shatner is playing Janice Lester pretending to be Kirk. Mostly, this is just fast speech, but listen for some tempo adjustments to the end.
The following clips are among the very most bursty—a term I’ll define in a moment.
Notice the wide range of emotions that use burstiness.
The second clip on this is actually Kevin Pollack doing an impersonation. His rates aren’t so terribly different from Shatner/Kirk at his burstiest, BUT Pollack does it in the captain’s log, which Shatner/Kirk doesn’t (since it’s not a particularly emotional genre for Kirk).
Here is what a Shatner and his impersonators look like at their burstiest.
"William Shatner talks in fast bursts and moves his head a lot it's true he does", http://www.stanford.edu/~tylers/notes/socioling/Sounds/Kids_impersonation_of_Shatner_converted.wav
Versus Leonard Nimoy’s logical, no-emotions Mr. Spock
http://www.stanford.edu/~tylers/notes/socioling/Sounds/Spock_Our_involvement.wav
Packets traveling through the Internet sometimes don’t come at an even pace. They come with bursts and lags (which can mess up video/audio quality, for example). I take measurements from this domain and apply them to speech.
Do we use intonational phrases, breath units, sentences?
The results are largely the same, though I used sentences.
This obscures the part of burstiness that Kirk/Shatner gets from the deleting pauses between phrases.
We can draw a parallel to network traffic engineering, which attempts to measure the burstiness of packets traveling across the Internet (such burstiness affects the quality of some applications, like voice and video). Of the several measurements I have tried, the one that seems to work best is variance/mean. That is, I input how much time there is between syllables and calculate the variance (the dispersion of the data). Then I divide the variance by the mean number of syllables in the utterance. This provides a measure of how spread out the data is; the bigger the ratio, the more it is characterized by clusters—what we’d call bursts.
Nivja de Jong and Ton Wempe’s Syllable Nuclei script was used for first pass, but basically every file had to be done by hand.
Burstiness and emotionality correlate
If we added in pitch, intensity, etc, we’d have a better r-squared, presumably.
Speech rate, syllables, interactions don’t play a role (speech rate model p=0.224)
> data.lm<-lm(Emotionality ~ Burstiness, data=data)
> summary(data.lm)
Call:
lm(formula = Emotionality ~ Burstiness, data = data)
Residuals:
Min 1Q Median 3Q Max
-2.0913 -0.9090 -0.1158 1.0076 2.1043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.13846 0.09392 1.474 0.1430
Burstiness 8.36138 3.29794 2.535 0.0125 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.007 on 121 degrees of freedom
(102 observations deleted due to missingness)
Multiple R-squared: 0.05044, Adjusted R-squared: 0.0426
F-statistic: 6.428 on 1 and 121 DF, p-value: 0.01251
> data.lmer=lmer(Emotionality ~ Burstiness + (1|Speaker), data)
> ranef(data.lmer)$Speaker
(Intercept)
Bone 0.1710811
Kirk 0.4371045
Spoc -0.4563595
Sulu -0.1518260
> data.lmer=lmer(Emotionality ~ Burstiness + (1|Speaker), data)
> print(data.lmer, corr=FALSE)
Linear mixed model fit by REML
Formula: Emotionality ~ Burstiness + (1 | Speaker)
Data: data
AIC BIC logLik deviance REMLdev
341.4 352.7 -166.7 336.7 333.4
Random effects:
Groups Name Variance Std.Dev.
Speaker (Intercept) 0.21810 0.46701
Residual 0.87055 0.93303
Number of obs: 123, groups: Speaker, 4
Fixed effects:
Estimate Std. Error t value
(Intercept) -0.07646 0.27625 -0.2768
Burstiness 7.07226 3.07077 2.3031
Spock isn’t very bursty and isn’t *supposed* to be emotional, but the clips here are the most bursty and most emotional, respectively.
Note that there are episodes where Spock loses his logic (Amok Time), but I have set those to the side. It would be good to add them in to the analysis.
We estimate the p-value for a mixed-effects model (in green) using MCMC.
MCMC (Markov chain Monte Carlo) sampling works this way:
Each sample contains one number for each parameter in the model. With lots of samples, we get a posterior distribution of the parameters.
We can estimate the p-values and confidence intervals
This is 10,000 samples
From a PCA of the indexical field pairwise overlap ratings.
A different way of visualizing the data.
Yet another visualization method from Dendroscope.
This uses SplitsTree’s NeighborNet algorithm for relating attributes. Unlike hierarchical clustering, this technique models the indeterminacy—the more webbing there is the more different ways to cluster the data there are. Distance is also meaningful—the further away two things are, the less they are related. This may ultimately be the best way to show the indexical fields.
I have a tutorial of how how to use SplitsTree for language classification that should be relatively straight-forward for adaptation to this stuff: http://www.stanford.edu/~tylers/notes/qp/Linguistic_phylogenetics_4-23-09.pdf
Huson, D. and D. Bryant. (2006). Application of Phylogenetic Networks in Evolutionary Studies, Molecular Biology and Evolution, 23(2): 254-267. www.splitstree.org
Huson, D. SplitsTree: A program for analyzing and visualizing evolutionary data. Bioinformatics, 14(10): 68-73, 1998.