Launched in February 2015, the TraMOOC project aims at providing reliable machine translation for Massive Open Online Courses (MOOCs) from English to eleven European and BRIC languages. As the Languages & The Media conference is fast approaching, we took the opportunity to catch up with speaker Yota Georgakopoulou to chat about the TraMOOC project she will be addressing at this year’s event.
Why is a special SMT system needed for these educational materials as opposed to one that is already available?
Yota Georgakopoulou: Commercial systems tend to be general purpose, and machine-translation (MT) solutions that are adapted to a specific domain generally perform better on that domain. The TraMOOC MT systems are adapted to the educational domain, made to cope with the multi-genre nature of Massive Open Online Courses (MOOCs), and the TraMOOC platform will provide easy ways to integrate the translation service into MOOCs. Also, TraMOOC targets language pairs for which the current MT infrastructure is considered weak or fragmentary, and the project will contribute new resources – obtained via crawling, bootstrapping, and crowdsourcing – for these language pairs.
What innovations will this project bring to MT?
Yota Georgakopoulou: The TraMOOC project is innovative on many levels. The TraMOOC MT systems are based on a new technology, neural machine translation (NMT), which brings big quality improvements over conventional phrase-based statistical machine translation systems (PBSMT). At the shared translation task of WMT16, MT systems developed within the TraMOOC project have performed best for language pairs such as English-German and English-Czech, outperforming both research and commercial systems. Other innovations include the bootstrapping of resources for low-resource language pairs, new evaluation schemata and metrics for translation quality, and the use of crowdsourcing for data collection and evaluation.
In particular, in-domain parallel and monolingual corpora for all of the project’s eleven language pairs are developed via extensive crawling, bootstrapping, and crowdsourcing. The corpora belong to the educational domain and cover a wide range of course subjects and text genres, e.g. lecture subtitles, slides, assignments, quizzes, and forum texts. Microtasking quality assurance methods and the intervention of translation experts will ensure a high level of data quality. According to the TraMOOC Data Management Plan, the developed infrastructure will be made publically available after the end of the project.
How important is crowdsourcing in regard to TraMOOC,
and how do you view the phenomenon in general?
Yota Georgakopoulou: TraMOOC is the first large-scale EU project that includes research in the scientific area of crowdsourcing. The EU has expressly shown its interest in crowdsourcing by including it in calls for tenders in the last couple of years, for instance as a means for increasing circulation of European audiovisual works in a cost-effective manner. The phenomenon of crowdsourcing in general has been attracting increasing attention, as the use of organised crowd labour has spread to include not only routine, but also content-creation tasks and those that require expertise. Crowdsourcing has been applied to transcription, translation, and subtitling as well, and it is a practice frequently used by computational linguists who need to produce large amounts of in-domain parallel translation corpora for the training and tuning of MT systems when such corpora are not otherwise available. In the TraMOOC project, we plan to carry out three distinct crowdsourcing activities: one on parallel translation corpora creation, another on MT evaluation, and a third on crowdsourced wikification activities. The results of the latter will be used in novel translation evaluation schemata for implicit MT evaluation via topic and entity annotation.
How do you ensure the quality and consistency of your finished translations?
Yota Georgakopoulou: In TraMOOC, the quality of the translations is evaluated via a series of assessments that are state of the art in the MT-evaluation field. Such evaluation is carried out during the development phase of MT systems to compare which systems are retrieving the best translations (in our case the comparison is between PBSMT and NMT) and to identify what changes can be made to the MT system to improve the translation output, as well as to assess the quality of the final translation produced by the MT system. The assessment involves the evaluation of translations by professional translators, who are able to identify the types of errors they contain, as well as their fluency and adequacy. Additionally, we calculate the post-editing effort that is required to render the MT output to translations of ‘publishable’ quality. MT-quality evaluation is also carried out during the development phase by the general crowd in terms of adequacy and fluency, as well as error markup, with a view to assessing how the crowd perceives the quality of the translations. We are considering performing usability evaluation as well by MOOC end users, who will answer comprehensibility questions about the content of the courses they attend in order to identify how usable and understandable the translations are for them.
Furthermore, in the project we introduce novel translation-evaluation schemata in the form of implicit evaluation. This is done via topic detection of the translations provided by our MT systems. We make use of the fact that most Wikipedia pages have translations in many languages. We aim to detect and compare topical information elements (named entities, events, specific terms) in source and target documents and assess whether they match or not by linking them to and checking their corresponding Wikipedia pages. At a later stage of the project, after the system web service is launched, automatic sentiment analysis on forum text uploaded by MOOC end users will be offered to reveal their opinion about the translation quality and will be used as another indirect means of translation evaluation.
Given the many resources available for the production of online learning tools, what makes this project necessary?
Yota Georgakopoulou: MOOCs have been growing rapidly in size and impact in recent years. Despite the fact that the majority of MOOC students are non-native English speakers, the vast majority of these courses are provided in English, rendering them inaccessible to those who do not speak the language. TraMOOC aims at removing the language barrier that hinders MOOCs from reaching out to all people and making education available to them by developing high-quality machine translation for all types of text genre included in MOOCs (e.g. assignments, tests, presentations, lecture subtitles, blog texts) from English into eleven European and BRIC languages (DE, IT, PT, EL, NL, CS, BG, HR, PL, RU, and ZH). These languages have been chosen to constitute strong use cases for MOOCs, have proven difficult to translate into in previous MT solutions, and have weak or fragmentary MT support. The project introduces novel translation-evaluation schemata that add value to existing tools and resources in linguistics, natural language processing text analytics, data mining, and the MT scientific communities. Finally, the core of the TraMOOC service will be open source, with some premium add-on services to be commercialised. Being open source, the platform will enable the integration of any MT solution in the educational domain for any language, thus ensuring scalability and sustainability in the Digital Single Market, and maximizing the business opportunities for European educational companies at the same time.
What are the main languages in which MOOCs are offered, and in which languages are they relatively infrequent?
Yota Georgakopoulou: As already mentioned, the majority of MOOCs are delivered in English, though the majority of the students attending MOOCs are non-native speakers of the language. This makes the project a very strong case for the European Commission’s focus on open education, as well as for the uptake of the TraMOOC software by educational stakeholders. The most frequent languages among MOOC students are Spanish, French, Italian, German, and Portuguese, with an exceptional rise in Chinese, Arabic, and Russian. The TraMOOC project addresses several of these languages on the basis of criteria that have to do with the lack of available MT support; ensures the inclusion of a large number of under-resourced European languages; and tackles eleven language pairs in total, all for translation out of English.
Localisation is an important part of making translated educational materials engaging for the user. Will the translated online materials take account of this?
Yota Georgakopoulou: Achieving high-quality MT output is a primary goal for us in the TraMOOC project. As already outlined, extensive MT evaluation activities are planned in the project to assess the output of the MT systems we are working on, both explicitly (by automated means, as well as by an expert and general crowd) and implicitly through topic and sentiment analysis. Our aim is to make MOOC content comprehensible by end users of the eleven languages we are tackling in the project – the usability of the machine-translated text is the measure of our success.
Premium add-on services will also be added to the TraMOOC platform, such as a tailor-made MT service for MOOC providers, organisations, foundations, universities, governments, and companies that wish to offer training to their employees, as well as the possibility to add languages other than the ones covered in the project. Translation post-editing services for the text produced by the TraMOOC MT service will also be offered for customers who want to localize their content to a professional quality level.
Yota Georgakopoulou will be speaking on Crowdsourcing for the Creation of Parallel Translation Corpora: The Case of TraMOOC at the Languages & The Media conference taking place in November 2-4 in Berlin. See the full programme here.