Underlying language

“He asked, “when articulation and sound wave go their separate ways, which way does perception go?” (p.121). His answer was that perception goes with articulation.” Galantucci et al. (2006). Epigraph.

Galantucci, Fowler and Turvey (2006) made an extremely thorough and concise review of the motor theory of speech perception (MT). Summarizing 50 years of research they have considered three main statements: speech perception is special, perceiving speech is perceiving gestures, and speech perception involves the motor system. The authors dismissed the first statement and in the third do not argue about the amount of motor system’s contribution to speech processing just saying that it is involved. The second claim, which the authors retain in pretty strong a formulation, seems to be central for the MT. I will try to argue that none of the evidences provided support the exclusive role of gestures at any stage of the speech perception.

Several types of evidences are given to support the claim: “perception tracks articulation” argument, non-auditory (visual and haptic) influence on speech perception, results from choice tasks (simple vs. choice), perceptual separation and acoustic ascription. Let us take a closer look at them one by one.

“Perception tracks articulation” evidence comprises two cases which seem to transcend all the papers and textbooks arguing for the MT:/di/-/du/ and /pi/-/ka/-/pu/. Perceptually both /di/ and /du/ syllables start with the same /d/ consonant, but acoustically the onset of the syllables is pretty different. Due to the difference in the height of the vowel second formant, formant transition is high and upward in /di/, and low and downward in /du/. Still, the formant transitions were the only and sufficient cues encoding the consonant in the two artificial stimuli used in the experiment(Liberman et al. (1954)). In the /pi/-/ka/-/pu/ set the noise burst is centered at exactly the same frequency (1440 Hz), resulting in two different percepts - /p/ before the high vowels and /k/ before the low /a/ (Liberman et al., 1952). In the first case acoustics of two stimuli is different, while perceptually consonant is the same and in the second case acoustics is the same while the percepts are different. Obviously, in these two cases perception tracks the place of articulation, i.e. the gesture, and not the acoustics.

It is important to note that these examples are based on one implicit assumption. That is, that the individual phonemes are necessarily extracted in the course of the speech perception. This assumption is crucial for the argument, since if the analysis happens at the syllable level, then there are no “separate ways” of the acoustical and perceptual representations. From syllabic perspective /di/ and /du/ are just different units, as well as /pi/ and /ka/ and different acoustics they have. Syllable level analysis does not imply that subsyllabic cues are not processed, it rather means that in the course of processing individual acoustic cues are not mapped to any categorical speech representations before the syllable level. Such way of processing would inherently and automatically account for all within-syllable context effects making “articulatory deciphering” unnecessary.

Can a syllable rather than a phoneme be a unit of speech perception? This question looks ill-posed, since there might be no single “basic unit of speech perception” at all. This is suggested, among other evidence, by long and indecisive search for such a unit (Goldinger & Azuma, 2003), variable information load of different segments in different contexts which renders uniform analysis meaningless (Greenberg, 2004), and different timeframes of speech stream analysis working in parallel (Hickok & Poeppel, 2007). At least there's no reason to exclude syllable from the list of suspects. Since perception and production of speech might be connected, and especially since this link is underscored by the proponents of the MT, it is worth also looking at the “basic unit of production”. One of the most widely recognized models of production is that suggested by Levelt. In the 2001 paper he suggested that the most frequent syllables can be stored as single entities in the motor cortex. Persuasive confirmations were obtained from the reaction times of production in both normal and clinical populations (Aichert et al., 2004). Thus the implicit assumption is far from being self-evident and without it the whole “separate ways argument” fails.

To summarize:
a) “Separate ways” account, which is one of the central pro-MT arguments, is based on the assumption that segmentation down to phonemes is necessary for successful speech perception.
b) It seems that this assumption can not be taken for granted and thus the “separate ways” account might be not valid for the speech perception per se.

Second argument is the visual and haptic influence on speech perception. Authors do not specify why exactly such an influence suggests that we perceive auditory speech as gestures. Whatever their reasoning might be it is important to note that visual influence on speech perception varies heavily between different phonemes & features. Being strong for the consonants, especially the place of articulation feature, it is much weaker for the vowels (Massaro, 1998). Since the motor theory in its current formulation claims universality and does not distinguish between different classes of sounds the arguments drawn to support it should also extend to all the sounds of human languages in more or less stable manner.

Third argument is the imitation speed. Authors draw evidence from the reaction time choice studies. In these studies people had to respond to the appearance of the target in the auditory stream with either always the same response (simple task) or the response that depended on the type of current target (choice task). For example Fowler and colleagues (2003; quoted from Galantucci et al.) presented their subjects with /a/ sounds interspersed with /pa/, /ta/ or /ka/. To every CV syllable subjects had to respond either producing one of the three syllables through the whole block (simple task, /pa/ for one third of the participants, /ta/ & /ka/ for the other thirds) or repeating the very syllable they just heard (choice task). Usually, choice tasks, where the response depends on the stimulus presented, are more difficult for the subjects and the reaction times exceed those of simple tasks by 100-150 ms (Luce, 1986; quoted from Galantucci et al.). Fowler and colleagues (2003) on the contrary found only 26 ms difference between the tasks, which obviously suggests that production (motor) programs for the specific syllable are activated during perception. It does not suggest, though, that such an activation is a necessary step in speech perception and not a top-down spread of activity occurring after the phoneme identification. At this point one would want to see supporting results from aphasia, showing that damages to motor structures lead to severe problems in speech perception (this was also pointed out by Greg Hickok on TalkingBrains, there was some more discussion on the paper in that blog, backlinks are in the post referred) but the authors give none. So, again, we do not have an unequivocal proof of motor program activation necessity for the speech perception.

Lastly the authors refer to the cases of acoustic ascription and perceptual separation. Perceptual separation is a context effect, where the listener takes into account the coarticulatory effect of a neighboring sound when categorizing the phoneme. Authors give as an example the study of Mann & Repp (1980; quoted from Galantucci et al.) who found that listeners give more /s/ judgments in the /s/ to /sh/ continuum, when the consonant preceded /u/ as compared to /a/. Acoustic ascription is manifested when listeners on the basis of coarticulatory information encoded in the preceding sound “guess” the following sound and thus perceive it faster. Whalen and colleagues (1984; quoted from Galantucci et al.) have used cross-spliced CV syllables, where the vowel from one syllable was attached to the consonant of another and vice versa. In such syllables coarticulatory information is obviously misleading and indeed subjects' RT to identify vowels within those syllables was larger than within the non-spliced ones.

Both effects are possible, according to Galantucci and colleagues, only if listeners are able to separate the overlapping auditory consequences of the gestures corresponding to two neighboring phonemes. Separation of the consequences, in its turn, is possible only if the gestures themselves enter at certain point in the analysis for both coarticulated gestures affect the same acoustic features. However in a series of studies Lotto and colleagues (e.g. Lotto (2004)) have obtained context effects mimicking perceptual separation with the CV syllables preceded by non-speech tones. Existence of the similar non-speech effect does not rule out the possibility that for certain combinations of syllables perceptual separation can happen only on the basis of gestural analysis, but it suggests that the mere presence of perceptual separation does not necessarily imply the analysis & perception of gestures.
Acoustic ascription as well does not necessarily require gestural analysis. One alternative explanation could be based on the important role of syllables in the speech perception. If we happen to store the most frequent syllables as single entities, then the violation of the syllable structure disable the use of such syllabary resulting in the increasing detection RTs.

Conclusions
So far we have considered each and every argument drawn by Galantucci and colleagues to defend their main claim, that perceiving speech is perceiving gestures. None of the arguments provided the unequivocal support for the strong version of their claim - “perceiving speech is always perceiving gestures”. Especially important is the absence of the supporting data from the brain lesions and the presence of the implicit assumption that in the course of speech perception listener necessarily breaks the auditory stream into phonemes.

Absence of the brain lesion data does not allow concluding if the activation of gestural representation is an obligatory step or just a by-product for speech perception. If the “phoneme” assumption does not stand then acoustic information can be in most cases directly mapped to linguistically-relevant categorical units and gestural representations might be not needed.

Speech perception is extremely complex and thus it seems utterly surprising that in most models (MT included) simple mechanisms are offered, that are supposed to underlie perception in each and every situation independently of phoneme type, available context and general linguistic knowledge of the perceiver. Weaker formulation of the very same claim “perceiving speech might sometimes involve perceiving gestures” could be much more viable and gain support from the given examples of the visual and haptic influences and auditory context effects.

For some activities phoneme-by-phoneme parsing could be beneficial, for example for reading/writing, and possibly addition of a new word to the mental lexicon, or perception of speech in the absence of context especially if the neighborhood of the word perceived is dense. Phoneme parsing, that could be of use in these cases, could indeed employ the gestural analysis as suggested by the MT. Indirect, but clear evidence for the gestural analysis need during the phoneme parsing, but not the regular perception, was provided by Burton and colleagues (2000). The authors asked their subjects to judge whether the onset phonemes were same or different in either rhyming (dip-tip) or non-rhyming (dip-ten) pairs. Presumably, in a rhyming pair the same-different judgment can be done on a basis of general acoustic picture – any difference leads to “different” choice – while in non-rhyming ones the segmentation is necessary. In rhyming pairs the activation was found only in the STG, while inferior parietal and inferior frontal cortices were additionally active for the non-rhyming pairs. Another indication of motor cortex contribution to speech perception comes from a TMS study of Meister and colleagues (2007). They have applied repeated TMS to premotor cortex areas which were seen active in fMRI during syllable perception and production. They have demonstrated reliable decrease of the detection rate in a task where the participants by pressing respective buttons had to discriminate between /pa/, /ta/ and /ka/ syllables. The decrease, though reliable, was about 8% only and that not from 100, but from 78% baseline (syllables where presented against the white noise background). This suggests that the contribution of the motor cortex is moderate and scarcely provides an exclusive pathway for speech perception.

Summarizing in one phrase: MT in its strict formulation is wrong, but in certain cases gestural analysis might contribute to the speech processing; to delineate those cases is the task for future research.

Aichert, I. and W. Ziegler (2004). "Syllable frequency and syllable structure in apraxia of speech." Brain and Language 88(1): 148-159.