How to create parallel texts for language learning – Part 1

I’d like to say a bit about ways to make parallel texts. I think parallel texts to be a very valuable learning resource, as I’ve mentioned in the past. They enable you to learn a language much faster than from textbooks, because they make an enormous amount of content instantly comprehensible.

Unfortunately, it’s nearly impossible to find parallel texts. The most common commercially available ones seem to be books of poetry and “classic” works of literature. Call me uncultured, but I usually get easily bored by books from the 1800s. I want something with an *interesting* plot, and I’ve been known to read a lot of fantasy and sci-fi, for which there are basically zero parallel texts commercially available. Also, the commercial ones are not usually sentence-aligned or even paragraph-aligned…at best they’re page-aligned, if that. For easy learning, you want all the little translated bits right beside each other for easy comparison.

So, for that reason, it’s more realistic to assume that you’re going to have to either make your parallel texts yourself, or get someone else to make them for you. To this end, I’ll give you a bit of info about how I do it, so that you can perhaps give it a try.

Ok, first the basics. What you’re going to start with is two ebooks. I don’t care where you get them, that’s not my problem. You might find public domain works at Project Gutenberg, or maybe you buy modern ebooks from online booksellers (for example, I found some Danish ebooks and mp3 audiobooks for sale here). Or maybe you borrow them from a friend. Ideally you want a place that doesn’t sell crippled files, like the bastards at I really really want to buy a lot of their audiobooks, but I just can’t play them on my operating system due to their crippling DRM. Some places sell ebooks with DRM as well, which make them only viewable on certain devices, and prevent you from sharing them with your neighbour. This is bad…you should help your neighbour 🙂

Anyway, back to ebooks. So you need an ebook in your target language, and another one in a language that you understand really well (hopefully your native language, if such a translation exists). The next step is that you probably want a text format version of these ebooks, since that’s much easier to process than things like PDF and EPUB. There are some software programs that will convert between several different ebook formats, but I just use a document viewer called Okular, which is able to view a PDF or EPUB and then “export to text” to give me a clean file.

Next, you need a way to align these texts. What this means is that you’re going to create a file in which the equivalent paragraphs or sentences will match up with each other. For example, the one I’m currently reading has individual Dutch sentences on the left-hand column, and each sentence is matched with its English translation in the right-hand column. There are two main ways to achieve this. One is more time-consuming but technically very simple, and the other involves a bit of computer know-how but is much more time-efficient.

In this article, I’ll be describing the “easy” way, and then my next article will be for the people who know what I mean when I say things like “emacs”, “regular expressions”, and “Makefile”. You know who you are. For those who don’t recognize the software terms, but are still keen to put your growing computer skills to the test, be sure to take a look at that article when I publish it in a few days. That method requires much less manual repetitive work. But for now, the less-technical way!

First you change all empty lines (ie [ENTER][ENTER] ) in the book to something unique (like a weird character like Ĉ that doesn’t exist in that language) in order to save the paragraph breaks. Then you remove all remaining [ENTER]s from the document so it’s all one line. Now you go back and restore the paragraph breaks by changing Ĉ to [ENTER], which means each paragraph is now on a separate line. Do this for both copies of the book. This step got rid of a bunch of [ENTER]s that were just breaking individual sentences into a bunch of pieces unnecessarily. You only want the paragraphs to be divided, in this method.

Now that you have a collection of separate paragraphs, you open up a spreadsheet program (such as, gnumeric, koffice, or maybe that famous one from Microsoft, if you’re desperate), and you create a table with two columns and one row. Paste one language on the left-hand cell, and the other language on the right-hand cell. Now you just have to make sure that each paragraph lines up with its appropriate neighbour by adding extra [ENTER]s to make them even out. Sometimes you may have to bust 1 paragraph into smaller ones to do that.

At the end, once you know that they all line up, then you remove excess lines by changing [ENTER][ENTER] into just [ENTER] (perhaps multiple times if necessary), and now you have one paragraph per line. Now you copy-paste as table cells with one line per row, so the whole text of each language is still in one column. Now that each paragraph has its own row, then the matching paragraphs show up beside each other!

Now, just as a disclaimer, I’ve never actually done this method myself, so you might have to experiment a bit if you get stuck. I just wanted to mention a method that doesn’t require tons of in-depth computer hackery. I heard about this method from people who have used it successfully many times, and I’ve seen the result of their work (such as a paragraph-aligned Chinese / English Harry Potter, for example), so I know it can work well for some people.

Next time I’ll elaborate on my more automated process, but first I want to try and automate it a bit more. I think I can save a couple of steps in the sentence-dividing stage by using another little script, so then I might be able to automate the whole thing from start to finish. Hopefully this will also make it a bit more accessible to others as well.

Until then, keep reading!


14 Responses to How to create parallel texts for language learning – Part 1

  1. WC says:

    I haven’t tried to create my own parallel text before. Do you find they usually match up fairly well?

    I would think you could always rely on the chapters lining up, almost always rely on the paragraphs, and often rely on the individual sentences.

    If the grammar is a lot different, I think it would be harder to line up, too.

    • doviende says:

      The translations that I’ve seen for modern books have been very close. There are times when two sentences get merged into one, or vice-versa, but that doesn’t hinder me at all. I’ve been very satisfied with my sentence-aligned texts that I’ve made, and they’ve been really helpful.

      For something like Chinese with English, I would guess that it might be better paragraph-aligned, but honestly I’ve never tried yet. I’m sure some other people could comment about that here.

  2. Brian says:

    I am awaiting the second post as I want to be more familiar with your more advanced method. The way that I line up texts is by using a program called LF_aligner. Seems to do the job fairly well, but is far from perfect. I am looking forward to trying your method!

  3. Judith says:

    Just a quick note: you won’t want to change all these [ENTER]’s by hand. In Microsoft Word or other office programs, one [ENTER] is represented as ^p . So you can use the search & replace function to change ^p^p to ^p for example. In other tools, [ENTER] is the equivalent of \n or (if the file is Windows-encoded) \r\n .

    I’ll release a video of this on Youtube soon.

  4. […] to create parallel texts for language learning, оригинал которой находится здесь. Повествование от первого лица мужского пола […]

  5. durak says:

    There are tools to align texts.
    ABBYY Aligner, hunalign, etc.

    For Japanese and Chinese, you have to do it manually, but it is fast, if you know the two languages.

    The fastest way, though, is to make somebody else to do it. I have hundreds, if not thousands, of parallel novels, the majority of them aligned by somebody else.

    Aligning is not a problem, proofreading is a HUGE problem.

    I know nothing about computers, so I do everything the old fashioned way.

    if you want to have a job done really well, do it yourself.

    Anyway, the most important thing is how fast you read in your native language (between the lines included)and how fast you process what you hear.
    Parallel texts are very useful at the beginning, but after a while, they are not necessary, you just read the L1 text and listen to the L2 recording, even the L2 text is not necessary then.
    No scanning, no proofreading, no aligning! Just a book and a recording. What a relief!

    For Japanese and Chinese, parallel e-texts are a must, though, or else you’ll get stuck. And they are EXTREMELY useful to learn kanji and hanzi.

  6. Max says:

    @ author and commenters: You guys wouldn’t happen to remember where you came across those Chinese/English parallel texts? 😀

  7. Juzekk says:

    I just found nice tool doing everything quick and quite well: based on mentioned above hualign..

  8. Andrew says:

    Are there places where people share and trade their own parallel texts? You mentioned the English/Chinese Harry Potter one, I’m thinking there just must be a website or torrent tracker where people exchange these things.


  9. Judith says: has some, if you click on “Library”. I don’t think anyone will openly post Harry Potter parallel texts on such a website though (even though people commonly create these for their own use) because copyright enforcers are quick to warn people posting the text of the Harry Potter novels online. Best figure out who might have the parallel texts you’re looking for and then arrange for a private exchange.

  10. Aaron says:

    I love the idea of parallel texts, but this seems like a lot of work. Especially for someone like me who isn’t too technically adept. What about just buying the English and the target language copy of your favorite book and reading a chapter at a time? They are not side by side, but it’s easy enough to read a bit in one, and then skip back to read the other.

    I do like the idea of a community generated library of texts though that Andrew mentions. I’ll have to check out Judith’s site.

    • durak says:

      you won’t be able to tell the difference until you actually try vertically aligned parallel novels with good audio. Add to that a mouse-over pop-up dictionary (and line-by-line audio).

  11. McSalty says:

    Something worth looking into is using Amazon’s Mechanical Turk (mturk) for laborious time-intensive tasks like this. Basically how mechanical turk works is you post a task and how much you’re willing to pay for it ($0.01 to $10). The thousands of people who use the website to find work then scan through tasks, and decide if it’s worth the price to complete it. It’s a really cheap way to outsource simple tasks.

    I’d be willing to pay a couple bucks just to have the bulk of this work done for me.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: