How to create parallel texts for language learning, part 2

I wrote previously about how to manually create a parallel text for language learning, which basically involved lining up paragraphs using a common spreadsheet program. Now I’m going to dive into my preferred method of parallelizing, which is by using special software to create a sentence-aligned text. This article is intended for a more computer-savvy audience, so if you’re confused by the tech terminology, then I recommend going back to the previous article.

The main feature of a parallel text is that it has aligned sections of text in at least two languages, enabling you to quickly understand the meaning in a new language using a language that you already understand. Having each section aligned means that you can totally eliminate annoying dictionary lookups, and you also get the benefit of having sentence-level translations that better represent the meaning of each word in context. This is an extremely valuable tool for language learning because it enables you to learn much faster, and to learn more in-depth features of the language quickly.

In order to facilitate quick understanding, I like to have sentence-aligned text. This means that each sentence in the target language is lined up with an equivalent sentence in a familiar language. In practice, the sentences are never 100% equivalent, so sometimes two L2 sentences match one L1 sentence or vice versa, but a lot of translated works are surprisingly close to each other on a sentence level. The problem is that it would be very laborious to manually create a sentence-aligned text.

Another issue is the knowledge of the L2 required in order to successfully create an aligned text. If you had to do it manually, it would be no good for languages where you know absolutely nothing so far. Fortunately, there’s some software that makes the process both fast and doable by those who are total beginners in that language. One such program which is freely available and open source, is called “hunalign“, which was originally created by some linguists for producing parallel Hungarian/English texts, but will actually work for almost any language pair.

In producing sentence-aligned texts, there are three main steps. The first is separating the input properly into individual sentences. Next, the processing step where the software matches the sentences together. Finally, there’s an optional adjustment phase where some of the errors made by the software can be manually corrected.

Hunalign expects input in text format, with one sentence per line and every word separated by a space. Unfortunately this immediately rules out languages like Chinese and Japanese, which have no spaces between individual words; for those languages, you’re better off using the manually produced paragraph-aligned method that I outlined in part 1 of this article.

There are many ways to separate sentences, but the general idea is that you need to have some sort of tool that can match patterns and insert some line-breaks. In this article, I’ll be focusing on the “emacs” text editor with pattern matching done using “regular expressions“.

First, you need to convert your ebooks to text format. If your source is a PDF file, this may introduce the problem of extraneous line-breaks, by which I mean line-breaks in the middle of sentences. This happens because PDFs are rather explicit in their formatting. They don’t rely on the display software to wrap the lines of text, so they have explicitly included line-breaks in the middle of sentences, which causes us a bit of a problem. To fix this, we can use the emacs command query-replace-regex, and tell it to search for this pattern:
\([^.!?"';«]\)^Q^J
and then replace it with: \1 (with a space at the end).

Just to briefly clarify, this pattern says “look for anything except one of these certain characters I’ve listed in the [] brackets, followed by a line-feed”, and then you’re replacing it with whatever character it found, followed by a space instead of that line-feed. The character was “saved” by using the \( and \), and then later re-inserted using the \1. The linefeed is specified by typing ^Q^J (aka ctrl-Q ctrl-J). The ctrl-Q indicates that the following character should not be interpreted normally, allowing us to type ctrl-J without it being interpreted as an [ENTER].

In short, what we did was find any character that did NOT end a sentence or phrase, but was nevertheless at the end of a line, and then we replaced the line-feed with a space. If you missed it, the part signifying NOT was the ^ character immediately after the [ bracket.

We can think of this previous operation as a "join" of smaller fragments. The other patttern we need is a "split" pattern: one to find the real end of sentences or phrases, so that we can insert a line-break there. For that, I used variations on this regular expression:
\([^.][^.]\)\([.!?;]\) (again, note the space at the end here)

This says basically "find any sentence-ending characters that are followed by a space, and which DON'T have periods in either of the two preceding positions". This extra detail helps to differentiate from things like "..." and "A.B." sort of patterns that may have periods, but are not necessarily sentence ends. We then replace this pattern with:
\1\2^Q^J

The \1 and \2 refer back to the portions that are surrounded by \( and \), replacing them literally. We then follow up with a line-feed, which will replace the space in the original pattern.

So, now that we've hopefully joined all the sentence fragments, and split the actual sentences, we are left with a document containing one sentence per line. Now we can begin the match-up process with hunalign. Although hunalign has many confusing options, what you basically need is something like this:
hunalign -text dictionaryfile L2.txt L2.txt > result.txt

For the dictionary file, you can just use an empty file to start with, or "/dev/null" on linux. Optionally, there are some parameters you can give hunalign that will tell it to attempt to create a dictionary file, but you'll need to manually trim it yourself to make it more useful. It can be helpful if you're going to parallelize multiple books in the same language pair, because it can lead to quicker and higher quality alignments if you use a good dictionary file (even if it's small), but it's quite possible to totally ignore the dictionary feature.

The output of hunalign is not exactly the best for reading, so I also use a little clean-up script that a friend wrote in python. It converts the hunalign output into a nice HTML table with color-alternating lines, which allows easy eye movement over to the matching section in the other column. Unfortunately WordPress.com will not allow me to upload text files, so here's the file content in PDF: hun2html.py. You'll also want the stylesheet to go along with it, which I'll just put here. You'll need to rename this stylesheet to lrstyle.css.

table.lrtext { width: 100%; }
th.lr2th { width: 50%; }
hr.lrhr { width: 100%; }
tr.lriffy { background-color: #EECCCC; }
tr.lriffyaltline { background-color: #EEC8FF; }
tr.lraltline { background-color: #CCE8FF; }

So, after all this converting, you may find that the aligner has incorrectly aligned some of the sentences. This is sort of to be expected, since the program really knows nothing about the languages it's aligning. It's actually just using a sophisticated matching process for the symbols it finds, and correlates them statistically. If you've converted the result to html using the script I've given above, then you may occasionally notice some lines that are highlighted in pink, which indicates that hunalign was uncertain about those matches. If you know how they should be lined up, then you can fix this by going back to the original text files and adding or subtracting line-breaks manually, and then re-running hunalign. It's not often that I have to do this, since usually the correct line is just displaced by one line up or down from where it should be, so I just keep reading.

Finally, here's a sample of the output from one of the parallel texts I've made for myself.

Hoofdstuk 1 Chapter 1
De heer van het duister regeert The Dark Lord Ascending
De twee mannen verschenen uit het niets, slechts enkele meters van elkaar verwijderd, op een smal, maanverlicht weggetje. The two men appeared out of nowhere, a few yards apart in the narrow, moonlit lane.
Een paar tellen bleven ze roerloos staan, met hun toverstok op de borst van de and er gericht; For a second they stood quite still, wands directed at each other’s chests;
toen herkenden ze elkaar, stopten hun stok onder hun mantel en liepen haastig in dezelfde richting. then, recognizing each other, they stowed their wands beneath their cloaks and started walking briskly in the same direction.

Ok, that's pretty much it for now. At some point in the future I hope to write a small program that will automate all these steps so that anyone can do it, but for now I'm too busy. Perhaps some time during the summer. Until then, try giving hunalign a shot. There are also some other programs around, I've heard, that use hunalign internally and are a bit more user-friendly. You might want to take a look for those if you are having problems.

Enjoy your parallel texts!

About these ads

8 Responses to How to create parallel texts for language learning, part 2

  1. hpp says:

    Great article, thank you!
    What about “Mr.”, “Mrs.”, “Dr.”, “Vd.”, “Vds.” – these will be misinterpreted as sentence endings, but normally aren’t.

    • doviende says:

      Ya, if the text includes those, then I manually adjust. If I search for \(Mrs?\.\)^Q^J and replace with \1 , then it’ll remove the extra line-break.

  2. Claudie says:

    Thank you so much for this great post! It’s definitely going to be quite useful, will try it out soon (hoping that Arabic/English will work with it).

  3. Hector says:

    After compiling hunalign and installing emacs, it took about 5 minutes to produce a parallel version German-Spanish of the hitchhikers guide to the galaxy! And the alignment is nearly perfect. Thanks a lot!!!

  4. Hector says:

    I have been thinking about using hunaling for japanese… there are programs that know more or less the boundaries in japanese words(for example rikai web, or rikaichan plugin know somehow word separation). Theoretically it could be possible to
    1)parse the japanese text to produce a text with spaces after the words,
    2)feed that into hunalign(If it handles utf-8 or whatever the required codification is) and then
    3)remove the spaces from the aligned version.
    And the result wouldnt be to bad, even if the separation in step 1 is approximated.

  5. Crush says:

    The python file isn’t compatible with the new changes in Python 3, so here’s an updated version which should hopefully work under the new version:
    http://www.mediafire.com/?g7e8aa4w45u5tlm

    Thanks for the article, the texts are certainly not perfect as there’s still a lot of “post-editing” to be done, but it saves a lot of effort :)

  6. Col says:

    Useful Python script, but accented characters are mangled (at least in Firefox on Linux), which is not so good for many European languages. To fix this the charset must be specified as UTF-8 in the html header, by changing the third line of the script to:

    meta = ”

    (all on one line).

    • Col says:

      Well, my new third line got obliterated by WordPress and I don’t know how to make it show what I typed, but basically it needs a space followed by

      charset=UTF-8

      immediately after

      text/html;

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 84 other followers

%d bloggers like this: