CMIR-2024

Code-Mixed Information Retrieval

Overview

Overview

Code-mixing, the mixing of lexical items and grammatical features from multiple languages in a single sentence, is prevalent worldwide. With the rise of online social networking, many users converse in their native languages using foreign scripts. In India, people often use Roman script on social media. This is especially true for migrants who form online communities to share information and experiences relevant to their need.

For example, Bengali speakers from West Bengal who migrate to cities like Delhi or Bangalore create groups like "Bengali in Delhi" on platforms such as Facebook and WhatsApp. They seek advice on various local issues, which became crucial during the COVID-19 pandemic for sharing experiences and navigating frequently changing government guidelines.

These conversations typically involve code-mixed text, with users employing informal, colloquial language often transliterated into Roman script. This lack of standardization makes it difficult to identify and highlight relevant answers within these discussions, particularly for those seeking similar information later.

Our task aims to develop a mechanism to pinpoint the most relevant answers from these code-mixed conversations. The focus is on Roman transliterated Bengali mixed with English language.

To address this, we have collected and annotated a dataset of queries and documents from Facebook, creating query relevance files (QRels).

Task

Track participants will develop retrieval systems that return documents in a specific Bengali-English code-mixed language when given a query in the same code-mixed language. The retrieval process happens at the document level, with queries presented as natural language questions. Documents are considered relevant if they contain answers to these questions. More details about the training and test sets can be found in the Dataset section.