Wednesday, June 2, 2010

The assessment of emotional expression in dogs

“The assessment of emotional expression in dogs using a Free Choice Profiling methodology” (Walker et. al., Animal Welfare).

Do different people tend to have overlapping or at least complementary ways of describing dog behaviors? And if they do, can a computer put together a behavior scale out of those descriptions, even without any understanding of what’s actually being described? Put a different way: Can we describe a group of observations of dog behavior using a fancy statistical technique which is hard for Dog Zombies to understand? Our intrepid investigators put a bunch of college women together with some dog videos and applied a lot of statistical processing to find out.

The mammals

The people: eighteen undergraduate women with varying levels of familiarity with dogs, but all currently studying animal behavior. (At a guess, they were all students in a class taught by one of the investigators.)

The dogs: ten Beagles trained as customs dogs.

The setup: The 18 women sit in a movie theater. They watch video clips of the beagles. They write down words that come to mind as useful in describing the dogs.

At the end of this first session, they hand in their terms. They then are sat down again (it doesn’t say if they’re all together in the movie theater this time, but I imagine them at individual desks for some reason) with the list of terms that they had generated. Each person gets their own list; there’s no collation yet at this point. They watch some of the beagle videos again. (I sympathize. I have watched a lot of Dog TV lately.) They rate each dog against each term using a visual analog scale. That’s a line, from zero to lots, and you put a mark on the line to rate the dog.

So if I were to have the word “energetic,” and be asked to rate my dog Jack at this moment, I’d draw a line and put a mark on it to rate how far along the “energetic” scale he is. Jack is currently fully lateral on the floor and twitching, so I’d put the mark on the far left of the scale. If you then asked me to rate him for “cute,” I’d put the mark on the far right. For “red,” I’d put the mark somewhere in the middle, based on my personal assessment of how red he is on the scale of blond to brick. (Jack’s sort of a strawberry blond, so the marker would be a little left of center.) So these are very individual sorts of assessments.

The computers

The investigators then took all this data and turned it into numbers by measuring the distance of the marks in terms of millimeters. And they handed it to a computer, which performed Generalized Procrustes Analysis (GPA) on it. I love that this technique involves the name “Procrustes.”

So, this is where I sort of want a companion blogger who is an expert in statistics to take over for a few paragraphs. I am not an expert in statistics by any means, but I will give explaining what happens next a shot. Please take it all with a grain of salt; I may be completely inaccurate.

Basically, I imagine the computer spreading all these scores out and seeing which ones match. I think the idea is, if one dog gets a 3-4 mm score by lots of observers, then the computer guesses that those people are all measuring the same thing with those particular terms. The computer checks to see if other dogs also rank similarly with those terms. So if one dog scores 1-2 mm on Term 2 from Observer 5, and Term 3 from Observer 7, then Observer 5’s Term 2 might be describing the same thing as Observer 7’s Term 3. It would then be worth checking a second dog. Is its score on Observer 5’s Term 2 (say 4.4 mm) similar to its score on Term 3 from Observer 7 (say 4.6 mm)? If so, and if other dogs are also similar, those two terms might be describing the same thing (“nervous” vs “shy,” for example).

In addition to seeing which terms might be measuring the same thing, the computer is also trying to figure out which terms are related to each other in other ways. If this dog scores high on Term 1 from Observer 13, does he also score low on Term 3 from Observer 13? Maybe those terms are opposites. (“Outgoing” vs “shy,” for example.) If this dog scores very high on Term 2 from Observer 10, does he also score mid-range to high on Term 7 from Observer 12? Maybe those terms have some sort of relationship, but aren’t exactly the same thing (“outgoing” vs “friendly,” for example).

From all this, the computer comes up with “dimensions.” Although the eighteen women had just ranked each dog in terms of “how much of this term does it have?”, the dimensions are paired, so that a group of positive terms are at one end, and a group of negative (opposite) terms are at the other. In this case, they got three dimensions:
  • playful/happy/confident versus nervous/unsure/tense
  • alert/inquisitive/investigative versus attention-seeking/quiet/unsure
  • playful/nervous/boisterous versus calm/relaxed/confident
So each of the observers’ original terms were allocated to one (or more?) of these dimensions. “GPA thus transforms the 18 different dog-scoring configurations into one multidimensional consensus profile, entirely independently of any interpretation by the experimenter.” In other words, a computer has done all the assignments, and it does not understand what the terms mean. The assignments were done in the complete absence of semantics. There is a point in the process where a check for “satisfactory semantic convergence between observer word charts” is done, checking to see if the grouped terms are reasonable concepts to put together. It’s not clear if a computer or a human performs that check. (How would a computer do it? Using a digital thesaurus? I love the idea of a database with weights for how similar each English word is to every other English word.)

They assure us that the eighteen observers, when their terms are applied to this scale, score the dogs very similarly. I started to lose the thread of the statistics at this point, but they did helpfully provide an image of a bullseye. Thirteen of the observers were inside the bullseye (scored dogs similarly). Five were outside.

The meaning

So what have they actually done here? It looks like these observers tended to pick up on similar traits. So if you show a dog to a bunch of people, they will have similar ideas about it. They may use different words, but whether they say “nervous” or “shy,” they will have comparable amounts of that trait in mind. And that is really interesting.

Of course, I also want to say what this is not. These people may all agree about how shy a dog is, but that doesn’t mean that they are any good at telling if the dog actually is shy. The scale generated here, and the scores these observers made, has not been tested for its predictive power. I would love to see something like a test of the scale on dogs placed in new and strange surroundings. Can a dog’s score on this scale predict how it will respond to a friendly stranger, or how much it will explore versus hide in a strange room? I am not criticizing this technique; it doesn’t claim to answer that question. But I think it’s worth keeping in mind what its limits are.

It does claim to tell us whether different people have similar perceptions, and I would love to see it tested on people of different cultural backgrounds, especially people who speak different native languages. The terms that the observers chose didn’t have to be the same in order to be grouped together, but they seemed to all choose similar concepts. Would people who were the native speakers of a variety of languages have such a strong overlap of chosen concepts?

And, as the researchers point out, the observers were all women. Is it possible that women tend to have similar perceptions about dogs, versus the perceptions men tend to have? Do women assign different levels of importance to different behavioral traits, and are they therefore more likely to choose different traits as important enough to score?

Looking more closely at the dimensions that were constructed causes me to suspect that the dimensions are indeed pulling together different traits. For example, one dimension has “alert/inquisitive/investigative” vs “attention-seeking/quiet/unsure.” Outgoing dogs on one side, insecure dogs on the other — makes sense. But the insecure dogs are also “attention-seeking.” That makes sense logically, as insecure dogs may be more likely to seek reassurance from humans. However, it is a somewhat different trait than “quiet.” In fact, I can imagine that a quiet dog might tend to be less attention-seeking, by virtue of being, well, quiet. So it’s interesting that these traits were pulled together into one dimension, even when people who labeled dogs “quiet” probably didn’t necessarily think of those same dogs as “attention-seeking.” Different behaviors, but one interpretation that pulls them together.

On the other hand, you get the “playful/nervous/boisterous” dimension. That may well be different characterizations of the same trait — high energy. Some people think high energy is good (playfulness) and some think it’s less good (boisterousness, what we call “freshness” in this house). But it’s the same thing, whether it’s something you look for in a family pet or not. So it’s also interesting that these traits were pulled together.

I’m particularly impressed that the computer was able to correctly pull positive and negative ends of dimensions together. It was able to determine that “nervous” is the opposite of “confident,” even though no human ever explicitly told it that. Good job.

On the other hand, it did less well on other terms. Looking at the table which details which terms were pulled into which dimensions, I have to say: “aloof/disinterested” is on the same end of dimension 2 with “curious/explorative”? Really, computer?

And there is significant overlap between the dimensions. “Nervous” and “unsure” each show up in two different dimensions. Lots of other terms overlap, too. If there’s so much overlap, shouldn’t the dimensions be constructed differently, more cleanly, somehow?

The investigators suggest that the resulting scale does describe real behavior, because “the dogs are distributed reasonably evenly over the three dimensions, which suggests that these dimensions effectively characterise observed variances in behavioral expression.” I’m not sure about this argument. Wouldn’t you expect to see clumping of some behaviors? Some things that aren’t desirable for customs dogs, or are unusual for beagles, but show up anyways? Why should behaviors be naturally evenly distributed?

So what does it mean? Could it just mean that computers are able to find meaning in any data set? All that semantic overlap makes me feel “nervous” and “unsure.” What are they really describing?

The investigators address this (I think) in some beautiful but very dense prose: “descriptors are not meant to designate separate, sharply delineated, causal factors, but complementary, overlapping, mutually-enhancing aspects of the whole organism. Rather than be confused by the multitude of terms, the idea is to perceive the meaning expressed through them.” That’s actually really lovely, but I am not yet sure it really works.

The future directions

So what’s next? The investigators suggest using this FCP methodology in research into dog welfare. Sounds like a good idea to me. It’s notoriously hard to really tell if an animal’s welfare is good or not, so new tools might be helpful.

However, I note that the reason it’s hard to evaluate welfare is that our prejudices get in the way. We keep thinking what we would like in a particular situation, without having any way to understand what the animal would like. Good welfare science finds ways to ask the animal. FCP seems to me to be only about asking the human. But maybe it can be used to find new axes we hadn’t considered scoring dogs on, and then we can test what those axes correlate with. Good next steps would be comparing how a dog scores on this scale to physiologic parameters, like cortisol level, immune system function, sympathetic nervous system activation, and the like.

Also, as the researchers note, this method should be tested on a variety of breeds. Although they don’t specify this, I’d like to see the method tested on a group containing a variety of breeds. Just performing multiple tests on groups each containing a single breed doesn’t tell you if this method is any good at handling increased variability. Different breeds do have their different dialects; would that change this method’s effectiveness?

This article was a really fun read. Does it tell us more about dogs, or more about humans? I’m not yet sure how the approach will be used in dog welfare, but the article has some good ideas for things to try out. Good luck to them!

J Walker, A Dale, N Waran, N Clarke, M Farnworth, & F Wemelsfelder (2010). The assessment of emotional expression in dogs using a Free Choice Profiling methodology
Animal Welfare, 19, 75-84


  1. Hi, read your post on this article. Great job! Do you happen to have access to this article because I can't seem to find it online.

  2. Thank you!

    I do have access to it; it is at IngentaConnect:

    If that doesn't work for you, give me an email address and I'll send the PDF.

  3. hey, thanks alot! I really appreciate it. Here's my email This article is crucial for my research. So thanks once again :)