Not a member yet? Register for full benefits!

Looks more important than Sounds in Speech

The McGurk effect, in which a link in brain processing between what a person sees and what a person hears, is not new. In fact it is named after Scottish cognitive psychologist Harry McGurk, who pioneered studies on the link between hearing and vision in speech perception in the 1970s.

In a nutshell, the McGurk effect is that sounds you hear are altered based on what you see. Normally this takes the form of an otherwise clearly spoken voice message being distorted or changed by any unexpected way the person's lips are moving, which tricks the brain into thinking another, different sound was produced instead.

This has obvious implications for any interaction, not least of all when you are creating the avatars from scratch and you need to sync the lip movements to the sounds produced. Get the synchronisation wrong an the McGurk effect will ensure any spoken message using that avatar's mouth will be slightly garbled because of the visual discongruity.

However, until recently it had been assumed that the McGurk effect was simply the result of sensory channels getting crossed in the brain, and nothing deeper than that. If so, it was simply a case of finding out which channels were affected and making the appropriate corrections to any interface and we would guarantee smooth hearing.

Life is not so simple as that.

See what I say

University of Utah researchers, as a side-process of work to control and communicate with prostheses, have conducted an in-depth study of the McGurk effect. They pinpointed where in the brain the effect crops up, and it is right in the middle of the temporal cortex. That is to say the McGurk effect actually occurs in the area of the brain that processes sound. It is not a case of crossed wires on processing sensory data, but in fact that visual data is integral at a fundamental level to the way we process sound.

As is the norm with this kind of experiment, volunteers were chosen who were undergoing brain surgery anyway, so their skulls would be open and an ECoG or electrocorticography array could be placed over the surface of their brains to monitor activity. An ECoG array is basically the same as an EEG, except that by being under the skull the signals are several orders of magnitude stronger and more precise.

Four volunteers were selected; two male and two female in order to account for possible gender-based differences. Its a small sample size, but this type of experiment always has small sizes because of the inherent surgical risks and costs incurred in a direct brain interface of this nature. It is not the sort of interface that makes scaling the sample sizes up exactly easy. However, because of the high degree of similarity between human brains, and the easy-repeatability of the McGurk effect in the general population, it suffices. The goal was to uncover the mechanism after all, not explore any slight differences in mechanism across populations.

These four volunteers were then asked to watch videos of an individual with an over large handlebar moustache obscuring their upper lip. On the video, the individual would utter a single syllable clearly. This syllable would be one of the four: “ba,” “va,” “ga” and “tha.”

The trick was subtle. The audio was synced to the visual, but not embedded within it. The researchers altered which sound was heard, relative to which the mouth on the video uttered. This triggered the standard McGurk effect, where any differences between the sight and the sound ultimately altered the sound.

The interesting part was of course recording the brain activity whilst the McGurk effect was triggered. As often as not, the video was mismatched to the sound, and for each individual, whether it was mismatched or not, the electrical activity in their temporal cortex was recorded.

A definite pattern emerged in the signals.

When the syllable being mouthed matched the sound, everything proceeded as normal, and the correct activity pattern for that sound was observed. However, when the mouthing did not match the sound, the activity was more interesting. The auditory processing centre of the brain chose to record the visual image of the sound rather than the auditory signal, and that is what the temporal cortex reflected.

In other words, when the McGurk effect is triggered, the brain's auditory centre disregards the sound if it conflicts with a visual signal. Even to our sound processing centre, the visual signal is more important than the auditory.

It would be difficult for the implications of this to be more profound. When hearing the environment around us, what we see is literally more important to our brains than what we hear.

So, when designing avatars to synch their facial expressions and lips to speech, the viseme – or visual indicator of the sound – is far and away the most important aspect. No matter how clear the audio, it is the mistakes or lags of facial and lip animation that will affect what the recipient hears the most, not what is actually said. A lagging lip sync literally changes the sound the brain gets out of it. It makes it more important than it ever was, for lip synchronisation to be done properly.


Look at What Iím Saying

Seeing Is Believing: Neural Representations of Visual Stimuli in Human Auditory Cortex Correlate with Illusory Auditory Perceptions (Paper)

Staff Comments


Untitled Document .