Commonly Used Cross-Language Information Retrieval (CLIR) Techniques

In order to enable the user to retrieve information from documents written in languages other than the language used for query, cross-language information retrieval (CLIR) uses a combination of different techniques. The majority of the most commonly used techniques, however, rely heavily on translation. According to the type of translation materials or resources, most commonly used CLIR techniques can be roughly divided into:

Dictionary-based.

This is the most simple technique which literally uses dictionary to retrieve information in other language(s) than the one used for the query. Despite its simplicity, however, it remains one of the most effective strategies and is in one way or another implemented by the majority of CLIF systems. Unfortunately, it has a few but very serious drawbacks, most notably the issue of words having different meanings which raises the question of accuracy.

Parallel corpora.

This is a very effective and reliable technique as all the information is retrieved from the so-called parallel corpora which are made up of the same text that has previously been translated into two or multiple languages. Since the translation has already been done, the technique using parallel corpora eliminates the risk of mistranslations and other problems associated with the dictionary-based technique. However, information that is retrievable from parallel corpora is obviously limited.

Comparable corpora.

Another commonly used technique is very similar to that using parallel corpora. The only difference is that the basis is comparable corpora. These contain text in multiple languages which, however, is not translation but rather deals with the same subject. As a result, the vocabulary is more or less the same.

Machine translation programme.

Although it has its drawbacks, machine translation is actually very useful and quite reliable too under condition that it is used properly. In the recent years, machine translation programmes got much more accurate than they used to be just a decade ago but unfortunately, they are still not accurate enough to eliminate the need for human translation.