Musings of the Mad Wordsmith (and other things): Open Source Corpora from Meedan

Sunday, April 04, 2010

Open Source Corpora from Meedan

I wrote about the Meedan Project just a few weeks ago. The site has been launched, and although the translations are not brilliant, they are readable and make sense.

Meedan had originally planned to use the Worldwide Lexicon (WWL) project's open source system, but right now we're using IBM's Machine Translation engine and the IBM Transbrowser" -- a browser-based tool for creating a translation layer on the web.

Meedan's data -- its 'translation memory' of over 3m words -- is available to other translators. George Weyman, Meedan's content and community manager, says: "the translations that are done with the Transbrowser are part of our agreement with IBM that makes sure all those translations are open source."

The 'translation memory' is important because having a corpus of texts in two languages allows you to apply statistical techniques to improve a translation engine. The whole translation memory is downloadable from http://github.com/anastaw/Meedan-Memory.

For all your English to Arabic and vice versa translations that will help you expand your business into the Middle East visit Arabic Language Experts at http://www.arabic.com.au/

Musings of the Mad Wordsmith (and other things)

Sunday, April 04, 2010

Open Source Corpora from Meedan

No comments:

Pages

About Me

Blog Archive

Readers

Links