Can BERT tell the difference between "candy (candy)" and "candy (rain)"?

background

I've used Word2Vec before, but I don't think it's a context-sensitive expression.

"Tomorrow's weather forecast is candy." "I bought candy at a candy store."

When there are two sentences like the above Although the meanings of "rain" and "candy" are different, they are the same word, so they recognize the same thing.

Meanwhile, by saying that BERT can express in consideration of the context, I wanted to check if I think the above "candy" has a different meaning.

plan

Target data

Check using the following corpus by referring to the ELMo implementation in the article here.

rain
"Tomorrow's weather forecast is candy."
"I didn't go for a dog walk because it was candy this morning."
"It's rainy season, so it's raining every day."
Candy
"I bought candy at a candy store."
"Work while licking candy."
"I'm not good at sour candy."
"Tomorrow's weather forecast is candy."

BERT model

BERT uses huggingface / transformers to get distributed representations.

Implementation and results

The mounting was carried out in the following steps.

Word split
Number the words
Convert to model input format (tensorized)
Model preparation and input
Calculate the similarity between outputs (rain and rain, rain and candy, candy and candy)

import torch
import numpy as np
from transformers import BertJapaneseTokenizer, BertForMaskedLM

tokenizer = BertJapaneseTokenizer.from_pretrained('bert-base-japanese-whole-word-masking')

def tokenize(text):
    return tokenizer.tokenize(text)

def word_to_index(text):
    return tokenizer.convert_tokens_to_ids(text)

def to_tensor(tokens):
    return torch.tensor([tokens])


def cos_sim(vec1, vec2):
    x = vec1.detach().numpy()
    y = vec2.detach().numpy()

    x_l2_norm = np.linalg.norm(x, ord=2)
    y_l2_norm = np.linalg.norm(y, ord=2)
    xy = np.dot(x,y)

    return xy / (x_l2_norm * y_l2_norm)


if __name__ == "__main__":
    d_rainy_01 = "[CLS]The weather forecast for tomorrow is candy.[SEP]"
    d_rainy_02 = "[CLS]I didn't go for a dog walk because it was candy this morning.[SEP]"
    d_rainy_03 = "[CLS]It's rainy season, so it's raining every day.[SEP]"
    d_candy_01 = "[CLS]I bought candy at a candy store.[SEP]"
    d_candy_02 = "[CLS]Work while licking candy.[SEP]"
    d_candy_03 = "[CLS]I'm not good at sour candy.[SEP]"

    # 1.Word split
    tokenize_rainy_01 = tokenize(d_rainy_01)
    tokenize_rainy_02 = tokenize(d_rainy_02)
    tokenize_rainy_03 = tokenize(d_rainy_03)
    tokenize_candy_01 = tokenize(d_candy_01)
    tokenize_candy_02 = tokenize(d_candy_02)
    tokenize_candy_03 = tokenize(d_candy_03)

    # 2.Number words
    indexes_rainy_01 = to_vocabulary(tokenize_rainy_01)
    indexes_rainy_02 = to_vocabulary(tokenize_rainy_02)
    indexes_rainy_03 = to_vocabulary(tokenize_rainy_03)
    indexes_candy_01 = to_vocabulary(tokenize_candy_01)
    indexes_candy_02 = to_vocabulary(tokenize_candy_02)
    indexes_candy_03 = to_vocabulary(tokenize_candy_03)

    # 3.Convert to model input format(Tensorization)
    tensor_rainy_01 = to_tensor(indexes_rainy_01)
    tensor_rainy_02 = to_tensor(indexes_rainy_02)
    tensor_rainy_03 = to_tensor(indexes_rainy_03)
    tensor_candy_01 = to_tensor(indexes_candy_01)
    tensor_candy_02 = to_tensor(indexes_candy_02)
    tensor_candy_03 = to_tensor(indexes_candy_03)

    # 4.Model preparation and input
    bert = BertForMaskedLM.from_pretrained('bert-base-japanese-whole-word-masking')
    bert.eval()

    index_rainy_01 = tokenize_rainy_01.index('Rain')
    index_rainy_02 = tokenize_rainy_02.index('Rain')
    index_rainy_03 = tokenize_rainy_03.index('Rain')
    index_candy_01 = tokenize_candy_01.index('Rain')
    index_candy_02 = tokenize_candy_02.index('Rain')
    index_candy_03 = tokenize_candy_03.index('Rain')
    vec_rainy_01 = bert(tensor_rainy_01)[0][0][index_rainy_01]
    vec_rainy_02 = bert(tensor_rainy_02)[0][0][index_rainy_02]
    vec_rainy_03 = bert(tensor_rainy_03)[0][0][index_rainy_03]
    vec_candy_01 = bert(tensor_candy_01)[0][0][index_candy_01]
    vec_candy_02 = bert(tensor_candy_02)[0][0][index_candy_02]
    vec_candy_03 = bert(tensor_candy_03)[0][0][index_candy_03]

    # 5.Between outputs(Rain and rain, rain and candy, candy and candy)Calculate the similarity of
    print("rain_01 and rain_02 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_rainy_01, vec_rainy_02)))
    print("rain_01 and rain_03 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_rainy_01, vec_rainy_03)))
    print("rain_02 and rain_03 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_rainy_02, vec_rainy_03)))
    print("-"*30)


    print("rain_01 and candy_01 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_rainy_01, vec_candy_01)))
    print("rain_01 and candy_02 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_rainy_01, vec_candy_02)))
    print("rain_01 and candy_03 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_rainy_01, vec_candy_03)))
    print("-"*30)

    print("rain_02 and candy_01 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_rainy_02, vec_candy_01)))
    print("rain_02 and candy_02 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_rainy_02, vec_candy_02)))
    print("rain_02 and candy_03 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_rainy_02, vec_candy_03)))
    print("-"*30)

    print("rain_03 and candy_01 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_rainy_03, vec_candy_01)))
    print("rain_03 and candy_02 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_rainy_03, vec_candy_02)))
    print("rain_03 and candy_03 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_rainy_03, vec_candy_03)))
    print("-"*30)

    print("candy_01 and candy_02 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_candy_01, vec_candy_02)))
    print("candy_01 and candy_03 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_candy_01, vec_candy_03)))
    print("candy_02 and candy_03 :Cos similarity of "candy": {:.2f}".format(cos_sim(vec_candy_02, vec_candy_03)))

To summarize the results,

	rainy_01	rainy_02	rainy_03	candy_01	candy_02	candy_03
rainy_01	*	0.79	0.88	0.83	0.83	0.83
rainy_02	*	*	0.79	0.77	0.75	0.77
rainy_03	*	*	*	0.87	0.89	0.84
candy_01	*	*	*	*	0.93	0.90
candy_02	*	*	*	*	*	0.90
candy_03	*	*	*	*	*	*

The average value of similarity between the same meanings is * 0.865 *
The average value of similarity between different positions is * 0.820 *

For the time being, the meanings of rain and candy are different (?) But why didn't the value deviate from expectations?

bonus

NICT released a Pre-trained model in March 2020, so I compared it with bert-base-japanese-whole-word-masking. The results of using the NICT model to obtain similarities in the same process are as follows.

	rainy_01	rainy_02	rainy_03	candy_01	candy_02	candy_03
rainy_01	*	0.83	0.82	0.86	0.82	0.85
rainy_02	*	*	0.88	0.87	0.79	0.84
rainy_03	*	*	*	0.84	0.80	0.86
candy_01	*	*	*	*	0.82	0.85
candy_02	*	*	*	*	*	0.81
candy_03	*	*	*	*	*	*

The average value of similarity between the same meanings is * 0.835 *
The average value of similarity between different positions is * 0.837 *

	bert-base-japanese-whole-word-masking	NICT
Same meaning	0.865	0.835
Different meanings	0.820	0.837

in conclusion

The result was not what I expected ... I don't know if this is the way to go, so please let me know!

reference

https://qiita.com/norihitoishida/items/85150cfacc1f75f552f3
https://github.com/huggingface/transformers