Translating sentences with substitutions

by Luke Plant

Posted in:

— January 24, 2013 00:14

The problem

Many programs build up sentences using bits - often a template into which different things might be substituted. However, the things you substitute into a sentence can change the sentence, and vice-versa, in ways that are not anticipated by the programmer.

For example, plurals. In English, you might try code like this:

if n == 1:
    return "I have 1 pig"
else:
    return "I have %s pigs" % n

Localising these strings gives problems, because the rules for how plural forms work are different in every language. This specific problem is generally considered 'solved' by the use of gettext, but many more exist.

For example, we have another problem as soon as we start substituting nouns:

"Delete selected %s?" % object_name

Various attributes about the noun could affect the sentence. In French, the adjective "selected" needs to agree in gender with the noun being substituted in. So you cannot lookup the translations for "Delete selected %s" and for object_name separately. (This is a real example picked from Django source code).

Further, depending on how the sentence uses the noun, the form of the noun might need to change. For example, the noun might appear in the accusative position for a given sentence and language, which requires a different form of the noun to be used compared to the nominative form.

Several other examples of this appeared in Django ticket 11688. One proposed solution on that ticket would require a huge amount of knowledge and effort on the part of Django programmers, and almost certainly would not work anyway.

This post is an attempt to come up with a better solution, or at least kick start discussion. I haven't been able to find any solutions to this problem online, and most people seem to be just using gettext, which is a 95% solution — and maybe that is good enough for most people.

[Update 2013-02-19 - ‘Richard’ pointed me to Locale::Maketext article, which has in essence a similar approach to what I've done here]

[Update 2015-08-10/2018-03-29 - Mozilla’s l20n and fluent projects seem to cover most of the use cases here. The strategies are quite similar to the one described below. The language has iterated, and the one now used is essentially similar to what I envisaged below, but much more practical and having the substitution syntax that you would want. You should look there if you just want a working solution. It's not clear how active or widespread these projects are, though.]

Assumptions and simplifications

We will assume that a sentence is a composable unit of meaning, such that sentences can be translated independently. So, if in language A we have sentences 1 and 2, in that order, we can translate these into language B by translating sentence 1 and sentence 2 independently, and putting them together in the same order.

This is, no doubt, a simplification. In some languages, the two sentences might make more sense if re-ordered, or combined, or split in various ways. Indeed, some languages may not have a truly equivalent concept of 'sentence' at all.

However, we have to do something, and this is a reasonable approximation.

Requirements

We need a powerful way of defining sentences in a given human language. It must be powerful enough that the person doing the translation can do anything they need, without the programmer needing to be aware of all the things in the language that will cause difficulty.

So, we'll start with a full programming language, and chop out the things we shouldn't need.

We shouldn't need side effects - translation should be a pure function. So we'll use a purely functional programming language without side effects.

We need something fairly readable, because translators are going to have to use it. It should be as close as possible to declarative in style.

Pattern matching seems like a great fit for some of our needs.

Possible solution

Given the above requirements, let's start with a Haskell-like pure functional language, whose pattern matching will be extremely helpful. It will obviously have IO removed, and no type signatures (but that won't stop us inferring them and being able to statically type-check the code). Everything else will be borrowed directly from Haskell, so that I can avoid having to make up my own syntax and semantics.

If the concept works, we can argue about better or simpler syntax for some constructs, or helper functions that aren't part of the Haskell prelude.

Hopefully, we will find a relatively small subset of Haskell that is needed to give us all the power we need to solve this problem - a subset small enough that we could guarantee non-termination ideally, to avoid problems with translations created by malicious agents.

This will be an example based exploration.

Let's assume that every sentence can be generated by a function. The function will take as parameters all the substutions that are needed, and return the translated string.

So, suppose we have the English sentence "I have some pigs". For every different language we need, we would have a translation file which contains the function iHaveSomePigs, which in this case takes zero parameters. So for French:

iHaveSomePigs = "J'ai des cochons"

(The mapping between the English sentence "I have some pigs" and the function name iHaveSomePigs hasn't been defined, and we'll skate over that detail for now).

If we have a variable number of pigs, for French we might have:

iHaveNPigs 0 = "Je n'ai pas de cochon"
iHaveNPigs 1 = "J'ai un cochon"
iHaveNPigs n = "J'ai " ++ show n ++ " cochons"

For English we could do this:

iHaveNPigs 1 = "I have 1 pig"
iHaveNPigs n = "I have " ++ show n ++ " pigs"

(For those unfamiliar with Haskell, the way that pattern matching works is that the first definition that matches the arguments is used. Since n is not a literal, but a variable, it can match any argument.)

We can cope with more complicated rules, such as those used in Polish, perhaps something like this:

iHaveNFiles n = "Mam " ++ show n ++ " " ++ pluralize n "file"

plurals "file" = [ "plik"
                 , "pliki"
                 , "plików"
                 ]

pluralize n word = plurals word !! pluralForm n

pluralForm n
  | n == 1                                                                        = 0
  | n `mod` 10 >= 2 && n `mod` 10 <= 4 && (n `mod` 100 < 10 || n `mod` 100 >= 20) = 1
  | otherwise                                                                     = 2

Note that the complex logic in pluralForm and pluralize only has to be defined once. Adding more words simply requires additional plurals lines. It's not the nicest syntax, but could probably be improved, and it's pretty easy to copy.

Let's add in gender, using the sentences "Delete this %s?" (singular) and "Delete selected %s?" (plural). We can use guards:

deleteThisThing thing
    | isMasculine thing = "Supprimer ce " ++ singular thing ++ "?"
    | otherwise         = "Supprimer cette " ++ singular thing ++ "?"

    -- (Ignoring the problem with 'ce' followed by vowel for now...)

deleteSelectedThings thing
    | isMasculine thing = "Supprimer les " ++ plural thing ++ " sélectionnés"
    | otherwise         = "Supprimer les " ++ plural thing ++ " sélectionnées"

isMasculine thing = elem thing [ "pig"
                               , "man"
                               -- anything else masculine
                               ]

singular thing = pluralForm 1 thing
plural   thing = pluralForm 2 thing

pluralForm 1 "pig" = "cochon"
pluralForm 2 "pig" = "cochons"

pluralForm 1 "man" = "homme"
pluralForm 2 "man" = "hommes"

Note that the only thing required by this system is that the functions deleteThisThing and deleteSelectedThings exist. Everything else is at the freedom of the translator, and better ways of defining any of these functions are possible.

Of course, it isn't expected that a translator would be able to produce this by himself/herself. However, once the basic logic has been set up, this syntax is readable enough that a translator could easily add more of the same. Lines like:

pluralForm 1 "pig" = "cochon"

are actually pretty readable. The lack of parentheses in Haskell function calls is also a bonus (though, as I said earlier, exact syntax could be debated). This is not really that much harder than editing a .po file if you are just wanting to add more of the same.

Also, we've got flexibility. If we really don't care about getting the gender right, we can just do "sélectioné(e)s" and be done with it.

Let's make it harder - we'll add case. I'll use NT Greek as an example, because it has nouns that decline with case (and I don't know any similar modern languages well enough). I'm going to introduce an enum for the different cases, using data for now, and for the different genders. I could also do the same for number ("Singular" and "Plural"), but just using 1 and 2 seems easier.

Our sentence will be "You like the %s.". For this in Greek, we need to choose the accusative singular form of the thing we pass in. We also need to pick the word for "the" (the definite article) which matches the gender and number of the noun, and it has to match the accusative case too. So, if we pass in a masculine word, we need the singular accusative masculine definite article (having fun yet?):

data Case = Nominative | Accusative | Genitive | Dative
data Gender = Masculine | Feminine | Neuter

youLikeTheThing thing = "φιλεις "
                        ++ definiteArticle 1 Accusative (genderOf thing)
                        ++ " "
                        ++ accusativeSingular thing ++ "."

accusativeSingular thing = nounForm 1 Accusative thing

nounForm 1 Nominative "book" = "βιβλιον"
nounForm 1 Accusative "book" = "βιβλιον"
nounForm 1 Genitive   "book" = "βιβλιου"
nounForm 1 Dative     "book" = "βιβλιω"

nounForm 2 Nominative "book" = "βιβλια"
nounForm 2 Accusative "book" = "βιβλια"
nounForm 2 Genitive   "book" = "βιβλιων"
nounForm 2 Dative     "book" = "βιβλιοις"

genderOf "book" = Neuter
genderOf "man"  = Masculine
-- etc

definiteArticle 1 Nominative Masculine = "ο"
definiteArticle 1 Accusative Masculine = "τον"
definiteArticle 1 Genitive   Masculine = "του"
definiteArticle 1 Dative     Masculine = "τω"

definiteArticle 1 Nominative Neuter    = "το"
definiteArticle 1 Accusative Neuter    = "το"
definiteArticle 1 Genitive   Neuter    = "του"
definiteArticle 1 Dative     Neuter    = "τω"

-- feminine etc

-- definiteArticle 2 (plurals) etc.

Of course, you can easily define shorter aliases to avoid some typing here, and there may be better ways to generate the tables, though as written above they are pretty readable, and should be familiar to anyone who knows Greek.

The function youLikeTheThing here is no longer very readable, although it could be much worse. Some kind of substitution syntax/function could be used.

The code above actually works, BTW, and it actually ran first time I tried - the only correction I needed to make its output correct was to add a space after the definite article. You just need to put it in a file test.hs, add the following line:

main = putStrLn $ youLikeTheThing "book"

and do:

$ runhaskell test.hs

There is not a type signature in sight, but you have compile time guarantees. This is all a testimony to the clarity of Haskell's syntax.

The features of Haskell we've used are:

functions
simple pattern matching on numbers and strings
guards
data statements, limited to union types of nullary constructors i.e. effectively enumerated values. We could use a keyword enum for clarity.
string concatenation
lists
a few arithmetic and logical operators

We haven't used recursion. I can imagine circumstances where it might be useful, but if deemed too risky, you could add some rules that would disallow it (e.g. by requiring a function mustn't call itself directly, and must only call functions that exist prior to it in the source code, to avoid mutual recursion.) This would be helpful to ensure termination.

You might also want a module system, to be able to pull in some common definitions and functions for a given language, for consistency across different projects.

This whole approach has the advantage of being able to refine and special case as much as you want. Take the sentence "you like the %s": suppose that if the thing is a human being e.g. "man" or "woman", you need to use a completely different verb. Then you just add a special case first:

isAPerson "man"   = True
isAPerson "woman" = True
isAPerson n       = False

youLikeTheThing thing
    | isAPerson thing = ...
-- fall through to the normal case here

In the other direction, if you just don't have the time to care about any of this, you can just use a really simple (and often wrong) formula:

youLikeTheThing thing =  "φιλεις τον " ++ greek thing

greek "book" = "βιβλιον"

Notice that the programmer of the main project does not know anything about plural forms, gender, case etc., or put any of that into the source code. The only thing he/she would do is call a function with all the things to be substituted. We could have some mapping from English strings to function names, or we could just use the function name as a string, e.g. from a Python project we might call the function like so:

prompt = translate("doYouWantToDelete", n, object_name)

This would call the translation function doYouWantToDelete with the parameters n and object_name.

As a refinement, we can provide a version which will work when the whole localisation machinery is turned off i.e. we allow the programmer to provide their own version of the translation function which returns the default language:

prompt = translate("doYouWantToDelete", n, object_name,
                   lambda n, object_name: "Do you want to delete these %s %s(s)" %
                                          (n, object_name))

As before, the provided function can be correct or simplistic as desired for English.

Feedback

There are a few questions in my mind:

Would a solution like this work for the languages you know? What additional features would be needed to cope with other human languages?
Is this vaguely practical? Could you get translators to be able to edit code like this? If not, and only programmers would be able to do this, are there enough programmer-translators to make it a viable solution, at least for some big projects?

I'm aware that the string concatenation gets ugly fairly quicky, and some kind of interpolation might be needed (including the ability to call functions within that interpolation). With that in place, I think you could achieve a reasonable level of readability.

A translation tool could also have language-specific templates to quickly insert the code for common forms.
Is it possible to have a simpler language that would still be able to cope with the examples here?

The examples I've come up with suggest to me that you need a full programming language, and that attempting to start from the other direction (e.g. build up from the current gettext approach) will produce a monstrosity.

gettext already does a 95% job, and we are at the point of diminishing returns. So if we are going to try to tackle the final bit, we need to err on the side of enough power to get all the of that 5%, rather than put a lot of effort in and discover we've only arrived at 96%.

You also cover the case of having a client who insists that the program should output "cet homme" and not "ce homme" - while it might make your translation file ugly, you've got the power to be able to do it if you want.

Comments?

The problem

Assumptions and simplifications

Requirements

Possible solution

Feedback

You may also like: §

Comments §