Task To Classify

compettion link:https://www.kaggle.com/c/contradictory-my-dear-watson/overview

You can have multilingual premise and hypothesis pairs to train.They can be classifiaed into three relations; entailment,contradiction ,and neutral.The last category means that a pair is neither entailment nor contradiction.
A Task is detecting these relations from paired text using TPUs.

A Turtorial notebook is given here(https://www.kaggle.com/anasofiauzsoy/tutorial-notebook). I made some changes to it and finally it scored 0.70933. (Therefore, only additional code will be written below.)

Changing Turtorial Code

First of all, I run the tutorial code and the score was 0.64542.
It used a BERT model to pretrain and finetuned to answer.

turtorial code model

I changed two points for this model.
The first change is model type. I used XLM RoBERTa (jplu/tf-xlm-roberta-base) model and its pretrained parameters like this.

1
2
3
4
from transformers import AutoTokenizer, TFXLMRobertaModel
model_name="jplu/tf-xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model =TFXLMRobertaModel.from_pretrained(model_name)

The Secondaly revision is translating all languages in English. I used googetrans libraly for this revision.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
!pip install googletrans
from googletrans import Translator
from tqdm import tqdm
tqdm.pandas()

def translate2english(text):
translator = Translator()
return translator.translate(text, dest = "en").text

original_train.loc[(original_train['lang_abv'] != 'en'),'premise']=original_train['premise'].progress_apply(lambda x: translate2english(x))

original_train.loc[(original_train['language'] != 'English'),'hypothesis']=original_train['hypothesis'].progress_apply(lambda x: translate2english(x))

test.loc[(test['lang_abv'] != 'en'),'premise']=test['premise'].progress_apply(lambda x: translate2english(x))
test.loc[(test['language'] != 'English'),'hypothesis']=test['hypothesis'].progress_apply(lambda x: translate2english(x))

The final Code is here (https://github.com/gojiteji/kaggle/blob/master/watson.ipynb).

I’ve trained in a single language because English accounts for about 60% in both train.csv and test.csv dataset this time. I think that the score will increase if I split it up into each language and let it learn the data because XLM RoBERTa is already scaled cross lingual sentence encoder.

references