When I try to divide a list in MeCab I get'TypeError: in method'Tagger_parse', argument 2 of type'char const *''.
The error message states that argument 2 is incorrect, so I thought that the way to write CSV and the way to write code was bad, but I could not solve it.
In addition, the reference source site is as follows Since it is not necessary to index the label, I got various errors one after another when I deleted it. I thought I wouldn't get that much error because I intended to delete only the label-dependent variables. Reference site: https://qiita.com/Qazma/items/0daf927e34d22617ddcd
We apologize for the inconvenience, but we would appreciate it if anyone could understand it.
Supplement: The CSV file has one line and one column, and one sentence per line.
2020-12-25 11:55:30.878680: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
C:\Users\Katuta\AppData\Local\Programs\Python\Python38\lib\site-packages\requests\__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.2) or chardet (4.0.0) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
Traceback (most recent call last):
File "ex.py", line 5, in <module>
padded, one_hot_y, word_index, tokenizer, max_len, vocab_size = wakatigaki.create_tokenizer()
File "C:\Users\Katuta\gotou\wakatigaki.py", line 21, in create_tokenizer
text_wakati = wakati.parse(text)
File "C:\Users\Katuta\AppData\Local\Programs\Python\Python38\lib\site-packages\MeCab.py", line 293, in parse
return _MeCab.Tagger_parse(self, *args)
TypeError: in method 'Tagger_parse', argument 2 of type 'char const *'
Additional information:
Wrong number or type of arguments for overloaded function 'Tagger_parse'.
Possible C/C++ prototypes are:
MeCab::Tagger::parse(MeCab::Model const &,MeCab::Lattice *)
MeCab::Tagger::parse(MeCab::Lattice *) const
MeCab::Tagger::parse(char const *)
import MeCab
import csv
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
def create_tokenizer() :
text_list = []
with open("C:/Users/Katuta/gotou/corpus_MEIDAI.csv",'r',encoding="utf-8",errors='ignore') as csvfile :
texts = csv.reader(csvfile)
for text in texts :
text_list.append(text)
#Use MeCab to divide Japanese text.
wakati_list = []
for text in text_list :
text = list(map(str.lower,text))
wakati = MeCab.Tagger("-O wakati")
text_wakati = wakati.parse(text)
wakati.parse('')
wakati_list.append(text_wakati)
#Find out the number of elements in the largest sentence.
#Create a list of text data to use in the tokenizer.
max_len = -1
split_list = []
sentences = []
for text in wakati_list :
text = text.split()
split_list.extend(text)
sentences.append(text)
if len(text) > max_len :
max_len = len(text)
print("Max length of texts: ", max_len)
vocab_size = len(set(split_list))
print("Vocabularay size: ", vocab_size)
#Use Tokenizer to assign numbers from index 1 to words.
#Also create a dictionary.
tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token="<oov>")
tokenizer.fit_on_texts(split_list)
word_index = tokenizer.word_index
print("Dictionary size: ", len(word_index))
sequences = tokenizer.texts_to_sequences(sentences)
# to_categorical()Is the actual label data passed to the model using One-Create Hot vector.
one_hot_y = tf.keras.utils.to_categorical(sentences)
#To match the size of the training data, add 0 to the short text to match the longest text data.
padded = pad_sequences(sequences, maxlen=max_len, padding="post", truncating="post")
print("padded sequences: ", padded)
return padded, one_hot_y, word_index, tokenizer, max_len, vocab_size
wakati = MeCab.Tagger("-Owakati")
text_wakati = wakati.parse(text)
wakati.parse('')
wakati_list.append(text_wakati)
Recommended Posts