I'll write it down as a note so that people who stumble on the same error will have less time to look up.
The following is running on Google Colab.
Install MeCab and huggingface transformers on Colab by referring to here.
!apt install aptitude swig
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3
!pip install transformers
Try to divide with the tokenizer of BERT for Japanese.
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
#Declared tokenizer for Japanese BERT
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
text = "Natural language processing is a lot of fun."
wakati_ids = tokenizer.encode(text, return_tensors='pt')
print(tokenizer.convert_ids_to_tokens(wakati_ids[0].tolist()))
print(wakati_ids)
I got the following error.
----------------------------------------------------------
Failed initializing MeCab. Please see the README for possible solutions:
https://github.com/SamuraiT/mecab-python3#common-issues
If you are still having trouble, please file an issue here, and include the
ERROR DETAILS below:
https://github.com/SamuraiT/mecab-python3/issues
You don't have to write the issue in English.
------------------- ERROR DETAILS ------------------------
arguments:
error message: [ifs] no such file or directory: /usr/local/etc/mecabrc
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-3-f828f6470517> in <module>()
2
3 #Declared tokenizer for Japanese BERT
----> 4 tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
5
6 text = "Natural language processing is a lot of fun."
4 frames
/usr/local/lib/python3.6/dist-packages/MeCab/__init__.py in __init__(self, rawargs)
122
123 try:
--> 124 super(Tagger, self).__init__(args)
125 except RuntimeError:
126 error_info(rawargs)
RuntimeError:
Kindly ask me to look at here in the error output, so install mecab-python3 according to the instructions at the URL. When you do
pip install unidic-lite
If you also execute, MeCab's initializing will no longer fail. But this time, I got angry with ʻencode saying
Value Error: too many values to unpack (expected 2) `.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-f828f6470517> in <module>()
6 text = "Natural language processing is a lot of fun."
7
----> 8 wakati_ids = tokenizer.encode(text, return_tensors='pt')
9 print(tokenizer.convert_ids_to_tokens(wakati_ids[0].tolist()))
10 print(wakati_ids)
8 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_bert_japanese.py in tokenize(self, text, never_split, **kwargs)
205 break
206
--> 207 token, _ = line.split("\t")
208 token_start = text.index(token, cursor)
209 token_end = token_start + len(token)
ValueError: too many values to unpack (expected 2)
About this error here, the developer of mecab-python3? As mentioned by, it was solved by specifying the version of mecab-python3 as 0.996.5
.
In summary, when installing pip, I think that it will not be an error if you declare it as follows.
!apt install aptitude swig
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.996.5
!pip install unidic-lite
!pip install transformers
If you have already installed the latest version of mecab-python3 with pip before running ↑, don't forget to reconnect your colab session once. You can disconnect the session from the session management by clicking ▼ on the top right of the colab screen, such as RAM or disk.
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
#Declared tokenizer for Japanese BERT
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
text = "Natural language processing is a lot of fun."
wakati_ids = tokenizer.encode(text, return_tensors='pt')
print(tokenizer.convert_ids_to_tokens(wakati_ids[0].tolist()))
print(wakati_ids)
#Downloading: 100%
#258k/258k [00:00<00:00, 1.58MB/s]
#
#['[CLS]', 'Nature', 'language', 'processing', 'Is', 'very much', 'pleasant', '。', '[SEP]']
#tensor([[ 2, 1757, 1882, 2762, 9, 8567, 19835, 8, 3]])
I was able to successfully divide it with BertJapaneseTokenizer
.
end
Recommended Posts