I happened to see [article](https://qiita.com/Sak1361/items/2519f29af82ffe965652#mecab%E3%81%AE%E3%83%90%E3%82%B0%E5%AF%BE%E7%AD The memo that I checked because it was written as a bug of mecab in% 96).
This can be seen by looking at the value defined in common.h, ** MAX_INPUT_BUFFER_SIZE **.
#define NBEST_MAX 512
#define NODE_FREELIST_SIZE 512
#define PATH_FREELIST_SIZE 2048
#define MIN_INPUT_BUFFER_SIZE 8192
#define MAX_INPUT_BUFFER_SIZE (8192*640)
#define BUF_SIZE 8192
It can be seen that it is likely to receive 8192 x 640 bytes of data at the maximum. In other words, 8192 x 640/2 = ** 2,621,440 characters **. that's all.
size_t ibufsize = std::min(MAX_INPUT_BUFFER_SIZE,
std::max(param.get<int>
("input-buffer-size"),
MIN_INPUT_BUFFER_SIZE));
const bool partial = param.get<bool>("partial");
if (partial) {
ibufsize *= 8;
}
MeCab::scoped_array<char> ibuf_data(new char[ibufsize]);
char *ibuf = ibuf_data.get();
MeCab::scoped_ptr<MeCab::Tagger> tagger(model->createTagger());
if (!tagger.get()) {
WHAT_ERROR("cannot create tagger");
}
for (size_t i = 0; i < rest.size(); ++i) {
MeCab::istream_wrapper ifs(rest[i].c_str());
if (!*ifs) {
WHAT_ERROR("no such file or directory: " << rest[i]);
}
while (true) {
if (!partial) {
ifs->getline(ibuf, ibufsize);
} else {
std::string sentence;
MeCab::scoped_fixed_array<char, BUF_SIZE> line;
for (;;) {
if (!ifs->getline(line.get(), line.size())) {
ifs->clear(std::ios::eofbit|std::ios::badbit);
break;
}
sentence += line.get();
sentence += '\n';
if (std::strcmp(line.get(), "EOS") == 0 || line[0] == '\0') {
break;
}
}
std::strncpy(ibuf, sentence.c_str(), ibufsize);
}
if (ifs->eof() && !ibuf[0]) {
return false;
}
if (ifs->fail()) {
std::cerr << "input-buffer overflow. "
<< "The line is split. use -b #SIZE option." << std::endl;
ifs->clear();
}
const char *r = (nbest >= 2) ? tagger->parseNBest(nbest, ibuf) :
tagger->parse(ibuf);
if (!r) {
WHAT_ERROR(tagger->what());
}
*ofs << r << std::flush;
}
}
return EXIT_SUCCESS;
#undef WHAT_ERROR
From this code, it can be seen that the processing of MeCab.Tagger.parse () does not exceed ** MAX_INPUT_BUFFER_SIZE ** at the maximum. Next, string_buffer.h and tagger.cpp /blob/3a07c4eefaffb4e7a0690a7f4e5e0263d3ddb8a3/mecab/src/tagger.cpp) About lattice analysis. (string_buffer.h: Excerpt from lines 15-37)
bool StringBuffer::reserve(size_t length) {
if (!is_delete_) {
error_ = (size_ + length >= alloc_size_);
return (!error_);
}
if (size_ + length >= alloc_size_) {
if (alloc_size_ == 0) {
alloc_size_ = DEFAULT_ALLOC_SIZE;
ptr_ = new char[alloc_size_];
}
size_t len = size_ + length;
do {
alloc_size_ *= 2;
} while (len >= alloc_size_);
char *new_ptr = new char[alloc_size_];
std::memcpy(new_ptr, ptr_, size_);
delete [] ptr_;
ptr_ = new_ptr;
}
return true;
}
This reserve that is acquiring the area is called only in tagger.cpp. (tagger.cpp: Excerpt from lines 733-741)
LatticeImpl::LatticeImpl(const Writer *writer)
: sentence_(0), size_(0), theta_(kDefaultTheta), Z_(0.0),
request_type_(MECAB_ONE_BEST),
writer_(writer),
ostrs_(0),
allocator_(new Allocator<Node, Path>) {
begin_nodes_.reserve(MIN_INPUT_BUFFER_SIZE);
end_nodes_.reserve(MIN_INPUT_BUFFER_SIZE);
}
And LatticeImpl is (I think) executed when Lattice is materialized. (tagger.cpp: 227-239 excerpt)
class LatticeImpl : public Lattice {
public:
explicit LatticeImpl(const Writer *writer = 0);
~LatticeImpl();
// clear internal lattice
void clear();
bool is_available() const {
return (sentence_ &&
!begin_nodes_.empty() &&
!end_nodes_.empty());
}
From these things, it can be seen that there seems to be no limit in the analysis of Lattice because it seems that the area is doubled and acquired by memcpy when the memory is insufficient (of course it should end if the memory is consumed, but before that It seems to stop at MAX_INPUT_BUFFER_SIZE).
mecab.h: Definition of structure such as lattice, etc. libmecab.cpp: [tagger.cpp](https://github.com/taku910/mecab/blob /3a07c4eefaffb4e7a0690a7f4e5e0263d3ddb8a3/mecab/src/tagger.cpp) mutable_lattice () and the definition of functions associated with model such as mecab_model_new_lattice ()
END.
Recommended Posts