Generate a MeCab dictionary from Nico Nico Pedia data

Overview

From "Nico Nico Encyclopedia Data" (by Future Search Brazil Co., Ltd.) in National Institute of Informatics website Generate and apply the dictionary of MeCab. This method may be available for text mining for research purposes.

Method

01. Get Nico Nico Pedia data.

http://www.nii.ac.jp/cscenter/idr/nico/nicopedia-apply.html

02. Place the following Python code in the same hierarchy as the unzipped head folder.

nc2mecab.py


# -*- encoding: utf-8 -*-

import os
import csv
import re

def main():
  #Input folder name
  pth = 'head'
  #Output file name
  wtnme = 'ncnc.csv'
  #Deleted string pattern for word formatting
  rmvptn = re.compile(r'(^\d[1,2]Month\d[1,2]Day$)|((\(|().+(\)|))$)') #MonthDayタグとタグ後ろのジャンル名は削除

  with open(wtnme,'wb') as wtfh:
    wt = csv.writer(wtfh)
    fnmes = os.listdir(pth)
    for fnme in fnmes:
      with open(os.path.join(pth,fnme),'rb') as rdfh:
        rd = csv.reader(rdfh)
        for row in rd:
          if row[3]=='a':
            wrd = rmvptn.sub('',row[1]).lower()
            if(0 < len(wrd)):
              wt.writerow(
                [wrd,'0','0',int(max(-32768.0, (6000 - 200 *(len(wrd)**1.3)))),'noun','General','*','*','*','*',wrd,row[2],row[2],'Nico Nico Pedia']
              )

if __name__ == '__main__':
  main()

03. Execute Python code.

python nc2mecab.py

04. Generate and apply MeCab dictionary.

Using the output CSV, "Add to user dictionary" of MeCab: How to add words was executed. However, the dictionary generation command is as follows.

/usr/local/libexec/mecab/mecab-dict-index -d/usr/local/lib/mecab/dic/ipadic -u ncnc.dic -f utf-8 -t utf-8 ncnc.csv

result

vocaloid and love live! Is the taste of Nico Kitchen.

vocaloid noun, general, *, *, *, *, vocaloid, vocaloid, vocaloid, Nico Nico Pedia And filler, *, *, *, *, *, and, to, to lovelive! Noun, general, *, *, *, *, love live! , Love Live, Love Live, Nico Nico Pedia Is a particle, a particle, *, *, *, *, is, ha, wa Nico Kitchen Noun, General, *, *, *, *, Nico Kitchen, Nico Chu, Nico Chu, Nico Nico Pedia Particles, adnominal forms, *, *, *, *, of, no, no Taste Noun, General, *, *, *, *, Taste, Tashinami, Tashinami .. Symbols, Kuten, *, *, *, * ,. ,. ,. EOS

Recommended Posts

Generate a MeCab dictionary from Nico Nico Pedia data
Receive dictionary data from a Python program in AppleScript
Generate a vertical image of a novel from text data
Metaclass (wip) to generate a dictionary
Generate a class from a string in Python
How to generate a Python object from JSON
Extract data from a web page with Python
A memo to generate a dynamic variable of class from dictionary data (dict) that has only standard type data in Python3
MeCab from Python
Automatically generate a polarity dictionary used for sentiment analysis
Python-Read data from a numeric data file and calculate covariance
Generate a random sentence from your tweet with trigram
I tried collecting data from a website with Scrapy
[Morphological analysis] How to add a new dictionary to Mecab
I tried reading data from a file using Node.js.
Generate Word Cloud from case law data in python3