Only A Few Basic Emotional Expressions are the Root of All Others
We have known for some time that different cultures perceive different facial expressions as conveying different emotional states, despite the basic brain structure of their citizens being similar. Likewise we have seen that humans express themselves via gesture in different ways depending again on their native culture, and how that alters things from the other side of the equation (the person expressing the concept rather than the person reading it) so taking all the possibilities for miscommunication into account, even something as simple as body language and facial expression has a fair chance of sending a message very different to what was intended, even if it was perfectly accurate as intended.
This is especially a concern for online virtual environments, where the user base is often scattered right across the world and draws from perhaps dozens of different cultures. Additionally of course, avatar facial expressions are usually scripted rather than motion-capture based as things stand, and a sequence file to move the muscles for each expression has to be created, then called up on need.
Often dozen upon dozen of facial expression sequence files or seqs are created, moving the facial rigging into any number of mostly similar expressions to try to counter culture bias to an expression.
There might be another way.
Back in 2008, a pair of researchers from the University of Toronto created a system to try and recognize what the basic visual emotional states of a human are. That's not the emotions they have running through their brains, but the emotional states on display through their facial muscles. Many emotional states look similar, and are easily confused in a still image, so if the ones that look similar can be said to have the same base expression, how many base expressions are there across the whole facial expression range?
The work at the time, was based on showing a computerized neural network video after video after video of humans interacting normally with one another, and the system took both the audio data and the visual data together, to determine what the base facial expression was.
Ultimately they came up with six basic emotional states: happiness, sadness, anger, fear, surprise, and disgust, with an 82% chance of a successful match to one of these six.
So, if we were to create these base seqs, it would in theory be possible to then modify them with a second seq file to just create culture-specific expressions say based on where a person was logging in from, around the world.
In other words, create the basic sequences to give the basic emotional states, then add in a location-based 'tweak' to adjust the expression to the correct meaning for the receiver, with a different tweak for each receiver. After all, nothing says everyone has to receive the same visual image of another person's avatar, so why do they have to receive the same visual image of that avatar's facial expression?
It would greatly simplify the choice of sequences for a person to use when trying to visually convey a message, and it would try to cross cultural barriers by giving each recipient a variation on the expression they would be most familiar with based on their own geographical area.
With far fewer seqs to choose from, it would also make it much easier to sync the choice of seq file for facial expression to a user's voice input (or virtual voice input for those who don't/can't use physical voices). Adding additional interaction to the visual expression of the avatar, as it 'naturally' changes to the correct one of a small sample size in response to changes in the voice speaking.
If six basic facial expressions still seems like a lot to choose from, much newer work in 2014 by researchers at the University of Glasgow suggests that there may in fact, only be four basic visual emotional states. Instead of studying videos of humans conversing with one another, venting rage or breaking down on camera, this newer study examined the physical muscle connections in the face, and the time delays between a neural code being sent by the brain to activate that muscle.
Whether there are ultimately four, or six basic states, the concept remains clear. If we just sequence those basic states, then have the system modify that sequence based on geolocation of the viewer (or average geolocation to cover someone logging in from a different area of the world than normal), it would go a long way towards minimizing, or even eliminating perceived cultural differences in avatar-based communication.
It would also open up more realistic possibilities for an automated voice-syncing system to produce the right basic facial state to correspond to the voice stress levels received. With fewer basic choices to choose from, an expert system is far less likely to get it wrong, and able to get it right using far less computing power. The latter being something we are all aware, is at a premium in real-time immersive environments of all kinds.
Relevant Dictionary Terms