[This article was first published on rOpenSci – open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

As part of rOpenSci’s multilingual publishing project1, we have been developing the babeldown R package, for translating Markdown-based content using the DeepL API.

In a previous tech note we demonstrated the use of babeldown for translating a blog post in a workflow supported by Git.
Here we use babeldown for translating living documents, such as our developer’s guide.
In this case, translations not only need to be created at the time in first writing, but also updated as the document is changed over time.

In this tech note, we’ll show how you can use babeldown to update a translation after you’ve edited a document.

Initial situation: an English document and its French translation

Let’s assume we have an English document called bla.md.

dir <- withr::local_tempdir()
file <- file.path(dir, "bla.md")
fs::file_create(file)
english_text <- c("# header", "", "this is some text", "", "## subtitle", "", "nice!")
brio::write_lines(english_text, file)

# header
this is some text
## subtitle
nice!

We have already translated it with babeldown, which provides us with an AI-based translation from DeepL, then edited the translation manually to provide the context the AI missed.

Sys.setenv("DEEPL_API_URL" = "https://api.deepl.com")
Sys.setenv(DEEPL_API_KEY = keyring::key_get("deepl"))

out_file <- file.path(dir, "bla.fr.md")
deepl_translate(
 path = file,
 out_path = out_file,
 source_lang = "EN",
 target_lang = "FR",
 formality = "less",
 yaml_fields = NULL
)

Here’s the French text:

# titre
ceci est du texte
## sous-titre
chouette !

At this stage let’s set up the Git infrastructure for the folder containing the two documents.
In real life, we might already have it in place.
The important thing is to start tracking changes before we edit the English document again.

gert::git_init(dir)
gert::git_config_set("user.name", "Jane Doe", repo = dir)
gert::git_config_set("user.email", "jane@example.com", repo = dir)
gert::git_add(c(fs::path_file(file), fs::path_file(out_file)), repo = dir)

 file status staged
1 bla.fr.md new TRUE
2 bla.md new TRUE

gert::git_commit_all("First commit", repo = dir)

[1] "5b7ae61fb72bd89ee912889207efbce5e662c405"

gert::git_log(repo = dir)

 commit author
1 5b7ae61fb72bd89ee912889207efbce5e662c405 Jane Doe <jane@example.com>
time files merge message
1 2024-01-16 15:59:49 2 FALSE First commitn

Changing the English document

Now imagine we change the English document.

new_english_text <- c("# a title", "", "this is some text", "", "awesome", "", "## subtitle", "")
brio::write_lines(
 new_english_text,
 file
)
gert::git_add(fs::path_file(file), repo = dir)

 file status staged
1 bla.md modified TRUE

gert::git_commit("Second commit", repo = dir)

[1] "b398bf63c6c86cb3817d88e40f47afde72158e7a"

# a title
this is some text
awesome
## subtitle

Updating the translation

We don’t want to send the whole document to DeepL API again!
Indeed, we do not want the text fragments that haven’t to be updated, as we would lose the improvements from careful work by human translators.
Furthermore, if we were to send all the text to the API again, we’d be spending unnecessary money (or free credits).

Fortunately we have two babeldown functions at our disposal:

  • babeldown::deepl_translate_markdown_string(), which sends an individual string for translation. We could copy-and-paste the changed text into this function. We won’t show this approach here.
  • babeldown::deepl_update() that operates more automatically by sending the lines or blocks of text that have changed for translation. This may be more text than needed, as it will send whole paragraphs to DeepL API if it changed, even if a single sentence or less changed.
Sys.setenv("DEEPL_API_URL" = "https://api.deepl.com")
Sys.setenv(DEEPL_API_KEY = keyring::key_get("deepl"))
babeldown::deepl_update(
 path = file,
 out_path = out_file,
 source_lang = "EN",
 target_lang = "FR",
 formality = "less",
 yaml_fields = NULL
)

Let’s look at the new French document:

# titre
ceci est du texte
## sous-titre
chouette !

One would then carefully look at the Git diff to ensure only what was needed was changed, then commit the automatic translation.
That translation would then be should then be reviewed by a human. For our multilingual work at rOpenSci, a translator (native speaker) reviews all our patches for consistency, tone, and context.

You can also find an example of babeldown::deepl_update() in a Pull request: the first two commits update the English document, the third one uses the function to update the Spanish document.

How babeldown::deepl_update() works under the hood

Contrary to what one might guess, babeldown::deepl_update() doesn’t use the Git diff at all!
Although that definitely was the first idea we explored.

babeldown::deepl_update() does scour the Git log to use the snapshot of the main language version that was in sync with the translation.
It’s the “old English document”, that goes with the “old French document”.
The old English document is the English document as it was the last time the French document was featured in a Git commit.
The French document is the French document as it was in that same commit snapshot.

We have the “new English document” and what’s missing is the “new French document”.
We want that new French document to use as much as possible of the old French document, only using an automatic translation for the parts that are new.

The function uses an XML representation of the documents, as created by tinkr.
A necessary condition for using babeldown::deepl_update() is that the old English document and the old French document need to have the same XML structure: say, one heading followed by two paragraphs then a list.

For each child of the body of the new English document (a paragraph, a list, a heading…), babeldown::deepl_update() tries finding the same tag in the old English document (identified by having the same xml2::xml_text() and children of the same type).
If it finds the same tag, it uses the tag located at the same position in the old French document.
If it does not find it, it sends it to DeepL API.

A consequence of this approach is that we find the largest matching structural block between two documents. For instance, in list where we changed one item, the whole list would be re-translated, as opposed to only the item. However, this also means we use logical blocks, rather than fragments of text as defined by words or line breaks.

Conclusion

In this post we explained how to use babeldown to update translations of living document.
We at rOpenSci are ourselves users of babeldown for this scenario!
Maintaining translations is time consuming but important work.
We’d be thrilled to hear your feedback if you use babeldown::deepl_update()!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci – open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: How to Update a Translation with Babeldown

Deep Dive into Babeldown: The Implication, Possible Developments and How to Leverage Its Full Potential

In demonstrating the potential of OpenSci’s innovative Babeldown R package, the user becomes well equipped in the creation and maintenance of live documents with an understandnig of the textual changes over time. The implications and potential future developments of this tool and how it can be fully utilized is broken down in this article.

Long-Term Implications

In the long run, the Babeldown package’s capability to send only updated blocks of text for translation will enable organizations to save resources and time. This uniqueness provides a robust way to manage living documents, eliminating the need for manual tracking of changes. For multinational organizations that need to maintain the coherence of documents across different languages, Babeldown presents a revolutionary tool.

Possible Future Developments

As development progresses, Babeldown’s evolutionary path could see it expanding beyond its current functionalities. One such possibility is the inclusion of support for more languages. Likewise, the mechanism that tracks changes could be further finetuned to recognize smaller units of change, such as updated list items. As artificial intelligence continues to advance, DeepL’s automatic translations are likely to become even more accurate – improving the usefulness of Babeldown.

How to Maximize Babeldown:

Complete Set-up Before Editing

Before editing a document, it’s crucial to set up a Git infrastructure that will allow changes to be tracked effectively. Doing this ensures that all the changes made can be easily identified and translated where necessary.

Apply deepl_translate() and deepl_update() Efficiently

Babeldown’s deepl_translate() and deepl_update() are crucial functions that enable selective translation and updating. When applied effectively, these functions reduce the task of translation to only the necessary parts, therefore, saving time, resources and maintaining the quality of the translation.

Assessing Git Diff

Allowing Git to manage variations, homemade edits can be clearly seen, making sure only areas that demand modification get changed. This prevents unnecessary repetition and helps maintain the coherence and context of the document.

Human Review

While Babeldown does excellent work in automatically translating edited parts of documents, human oversight is still necessary. Trained translators provide the most accurate translations in line with the context and tone of the content.

Conclusion

In today’s interconnected world where global communication is often required, OpenSci’s Babeldown package offers a strategic tool in managing live documents across multiple languages. By understanding its potential, future implications, and how to apply it effectively, organizations can enhance their communication quality while saving on resources.

The best practices mentioned should be rigorously applied if babeldown is to be used effectively. The hope is also for a more sophisticated package as technology further advancements.

Read the original article