Friday, August 3, 2012

Marking Positively: How to Score Natural Translations

This post is addressed particularly to researchers, but it's relevant too for teachers of translation. Note that Natural Translation (NT) is used here as a cover term for both Natural Translation and Native Translation.

At the Forli conference in May (enter forli in the Search box), I noticed that some people are still using the old subtractive scoring method to rate NT.

What is the subtractive method? It means starting from 100 points and knocking off a point, or several points, for each mistake of any kind; typically a point or two for minor errors of content or expression and up to five points for major ones. The 'pass mark' is usually expressed as a positive percentage, but it's really a 'failure score'. That's how students' written translations are marked, and likewise the examinations of the professional associations like the Canadian one to which I belong. It can also be used for interpretations, especially if they're transcribed.

Two objections can be raised. The first is a didactic one: that the approach is negative and therefore discouraging. True, mathematically speaking, -30% of mistakes is equivalent to +70% correct, but the psychological effect is different. Anyway, it's not so important as the second objection, which is that the approach reinforces 'nit-picking' by the markers, because small details are allowed to affect the score significantly. I still squirm at a sequence in an old film about an interpretation exercise for European Commission interpreters (see References) in which a student is berated in front of the other students for his translation of a single word.

When evaluating NT, we need to take the opposite approach. Although mistakes are of great interest insofar as they reveal the limitations and the 'pathology' of NT, in NT research our primary interest should be in what subjects can translate and not in what they can't. A score of only 40% because of numerous distortions and omissions would probably entail failure for an Expert or Professional translator or a translation school student; but for a Natural Translator it represents a non-negligible translating ability and we should focus on it and analyse what that 40% consists of.

How can we build a positive scoring method?

In the 1990s I became involved in the design of tests for candidates who wanted to work as community interpreters for public services in Ontario, Canada. These became known as the CILISAT tests and are still in use. The Government of Ontario funded the necessary research. The candidates were almost always Native Interpreters, because the pay was too low to attract Professional Experts and because the languages were not taught in Canada. We decided we needed a test instrument that would be better suited to Native, i.e. untrained, Interpreters than those used by the translation schools and in the profession. So we turned to a method called propositional analysis. It's used by psychologists among others, and in fact I'd been introduced to it by the late David Gerver, who was one of the pioneer researchers on interpreters and was also a clinical psychologist. The form of it we used it can be described this way:
"To analyze the text, propositional analysis – a description of the text in terms of its semantic content – is used. The units of analysis are propositions, or units of meaning containing one verbal element plus one or more nouns. The corresponding units are then selected on the basis of meaning rather than structure."
In practice this meant that we broke down the scripts for the interpretation tests into simple, single-clause sentences representing propositions and then awarded points according to whether the meaning of each proposition as a whole was conveyed in translation: zero points for an omission or a meaning contrary to that of the proposition; 1 point for a meaning conveyed but not clearly or not completely; 2 points for a complete and true rendering. There was a weighting that distinguished between important and unimportant propositions. This scale was solely for meaning. Other factors, for example correct language, were scored separately and globally, not proposition by proposition.

For example, the statement, "At around 6 o'clock I saw a blue sports car waiting on the other side of the road," might be broken down into:
The time was approximately 6 pm

I saw a car.

The car was blue.

The car was a sports car.

The car was waiting.

The car was on the other side of the road.
A paraphrase like, "I seed a sport car stopping at the kerb of our street before supper" would score 7 points for informational meaning before being weighted for importance. (Work it out! 1+2+0+2+1+1.)  The maximum possible points varied with each script. Small language mistakes like "seed" were relegated to a separate evaluation.

Guadalupe Barrera Valdes and Manuel Rosalinda Cardenas. Constructing matching tests in two languages: the application of propositional analysis. NABE: The Journal for the National Association for Bilingual Education, vol. 9 no. 1, pp. 3-19. 1984. There’s an abstract here.

Roda P. Roberts. Interpreter assessment tools for different settings. In R. P. Roberts et al. (eds.), The Critical Link 2: Interpreters in the Community, Amsterdam, Benjamins, 1999. Most of it is here.

David Gerver. A psychological approach to simultaneous interpretation'. Meta, vol. 20, no. 2, pp. 119-128, 1975. "A slightly altered version of a paper presented at the 18th International Congress of Applied Psychology in Montreal in July 1974". The text is here.

André Delvaux (director). Les Interprètes. Brussels: Commission of the European Communities. c1975. 16 mm film. c15 mins.

"The Government of Ontario funded the necessary research."


  1. Thanks for this! I was just trying to figure out the best way to assess Wikipedia translations, so your post has been really helpful. I like the idea of positive as opposed to negative scoring and proposition analysis seems very well suited to the kind of study I want to do. You've given me some good points to mull over for the next little while.

  2. Interesting article. However, I have three questions:

    a) If you deduce points from a standard total, how would you cater for long vs short texts? (a long text would likely hold more mistakes, even if the global % was the same).

    b)It would have been also nice if you would have provided an indication of "acceptable score". For example, I have noticed throughout 35 years as a professional translator that in some case the accuracy of the translation is less important than others. However, in certain cases, translation mistakes can however kill people. If a translator scored consistently above a certain threshold, that would meen he was properly qualified for that particular subject.

    c) Would it be possible to automate these scorings? Because it would for example allow to score how well machine translations perform...

  3. Thanks for sharing such a useful piece of information.Its really nice and informative blog.