Multilingual Web Retrieval

Multilingual web retrieval has been attracting a lot of attention for quite some time, especially after it became clear that the amount of non-English information on the world wide web will continue to grow. At the moment of writing, more than 50 percent of information on the web is in English language. But the proportion of English websites and documents accessible via the Internet has been declining steadily since the end of the 1990s when about 80 percent of online content was in English - even though the percentage of web users speaking English as their first language was incomparably lower. Either way, the rapid increase of both non-English online content and web users has started presenting a major challenge for information retrieval.

The Traditional Multilingual Information Retrieval (MLIR) Techniques Inappropriate

Accessing and retrieving information in multiple languages from the web requires a different approach than the approach used by the traditional multilingual information retrieval (MLIR). Many MLIR techniques have been shown to be very useful for multilingual web retrieval and therefore, they have been included into most systems that focus on web retrieval. But unfortunately, they are not suitable to be directly used for retrieving information from the world wide web due to the unique properties of web content.

First, web information can appear not only in different languages but it can also appear in different formats, e.g. PDF, HTML, PHP, etc. As a result, multilingual web retrieval also has to address the issue of document format. In addition, web content/pages are not static. On the contrary, they tend to change. And quite rapidly too. Third, many websites give the advantage to browsing rather than asking questions or conducting a query. Last but not the least important, information on the world wide web is typically accessed via search engines. These, however, provide access only to the indexed web content. As a result, there is a risk that the user can’t access to potentially very important information.

Dramatic Progress in Multilingual Web Retrieval but Many Challenges Remain

The need to access and retrieve information in non-English languages has prompted an increased interest of the MLIR research community in multilingual web retrieval. The result of this increased interest is a dramatic progress of the techniques used to access, collect and store multilingual information from the web. It is now easier than ever to retrieve information in different languages, however, many challenges remain.