The Nico Nico Pedia Dataset is a collection of 2008-2014 articles of Nico Nico Pedia published on IDR and comments on those articles.
It is suitable for research on natural language systems such as knowledge extraction, but it is not a docile data set like Wikipedia, but a rather quirky data set.
For example, nearly half of the sentences in Nico Nico Pedia are sentences that lack the subject, the writing style is not always unified, and AA is also included.
This time, I will introduce the contents of the data together with a simple preprocessing tool in search of ** interesting people ** who want to analyze this data set.
The data provided is ** a little special CSV ** and can be transformed into a standard CSV with proper pre-processing. Also, HTML is a bit cumbersome for parsing because some tags are cumbersome. For this reason
I will try this in this article.
The important environment required for preprocessing is not the memory but the disk capacity. If you inadvertently have only 50GB of extra space, preprocessing will fail with an error.
Also, if you use Python, you should have more CPU and memory. ~~ Or rather, the performance of Pandas is not so good ... ~~
https://www.nii.ac.jp/dsc/idr/nico/nicopedia-apply.html
Apply from here. When you apply, you will receive the URL for download within at least a few days, so keep it.
Please download it from the URL and expand it to apply like this.
.
└── nico-dict
└── zips
├── download.txt
├── head
│ ├── head2008.csv
│ ├── ...
│ └── head2014.csv
├── head.zip
├── res
│ ├── res2008.csv
│ ├── ...
│ └── res2014.csv
├── res.zip
├── rev2008.zip
├── rev2009
│ ├── rev200901.csv
│ ├── rev200902.csv
│ ├── rev200903.csv
│ ├── ...
│ └── rev200912.csv
├── rev2009.zip
├──...
├── rev2013.zip
├── rev2014
│ ├── rev201401.csv
│ └── rev201402.csv
└── rev2014.zip
Originally, I used Clojure (Lisp) for analysis because of the ease of lazy evaluation and preprocessing, but I made a tool for HTML-> JSON that does not process as much as possible so that it can be analyzed with Python.
https://github.com/MokkeMeguru/niconico-parser
Please clone from.
git clone https://github.com/MokkeMeguru/niconico-parser
https://github.com/MokkeMeguru/niconico-parser/blob/master/resources/preprocess.sh
To zips / preprocess.sh
sh preprocess.sh
please do it. This file is the processing required to modify the CSV escape to modern specifications. (Back story: I've been testing this process quite a bit, but maybe there's a bug. If you have a bug, please comment.)
The Nico Nico Pedia dataset can be broadly divided.
It has become. Of these, 1. is an amount that can be easily created into a database, so we will create a database.
Required files are https://github.com/MokkeMeguru/niconico-parser/blob/master/resources/create-table.sql and https://github.com/MokkeMeguru/niconico-parser/blob/master/resources/ import-dir.sh.
Arrange these so that they are zips / head / <file>
sh import-dir.sh
Please. Then you will get a db of sqlite3 called header.db
.
Let's access it for a trial.
sqlite3 headers.db
sqlite3 > select * from article_header limit 10
...> ;
1|Nico Nico Pedia|Nico Nico Daihakka|a|20080512173939
4|curry|curry|a|20080512182423
5|I asked Hatsune Miku to sing the original song "You have flowers and I sing".|\N|v|20080719234213
9|Go Go Curry|Go Go Curry|a|20080512183606
13|Authentic Gachimuchi Pants Wrestling|\N|v|20080513225239
27|The head is pan(P)┗(^o^ )┓3|\N|v|20080529215132
33|[Hatsune Miku] "A little fun time report" [Arranged song]|\N|v|20080810020937
37|【 SYNC.ART'S × U.N.Is Owen her? ] -Sweets Time-|\N|v|20080616003242
46|Nico Nico Douga Meteor Shower|\N|v|20080513210124
47|I made a high potion.|\N|v|20090102150209
It has a Nico Nico Pedia feeling, and it smells like you can get knowledge that Wikipedia does not have.
HTML->JSON!
One of the big problems with Nico Nico Pedia articles is that there are many weird tags.
Unlike Wikipedia, there are a lot of <br>
tags and <span>
tags for formatting, and it was a personal response that I had a lot of trouble trying to retrieve the sentence.
(~~ Also, AA is almost out of order. Make a tag for AA ... ~~)
The easiest way to parse HTML is to use a DSL (domain-specific language). A well-known one is Kotlin's HTML Parsing Tool.
This time I tried to process it easily using Lisp. The detailed code is ... well () ...
lein preprocess-corpus -r /path/to/nico-dict/zips
Well, please execute it like this. (Click here for Jar execution (Bug report)) It takes about 10 to 15 minutes to eat up about 20 to 30 GB of disc.
Let's take a quick look at the contents.
head -n 1 rev2008-jsoned.csv
1,"{""type"":""element"",""attrs"":null,""tag"":""body"",""content"":[{""type"":""element"",""attrs"":null,""tag"":""h2"",""content"":[""Overview""]},{""type"":""element"",""attrs"":null,""tag"":""p"",""content"":[""What is Nico Nico Pedia?(abridgement)Is.""]}]}",200xxxxxxxx939,[],Nico Nico Pedia,Nico Nico Daihakka,a
To explain one item at a time
<a>
tags) contained within the pageI can't really introduce the effect of JSON conversion + preprocessing this time, but for example, it is easier to handle things like <p> hoge <span /> hoge <br /> bar </ p>
, and the graph You can mention that it is easier to bring it to and apply tools like Snorkel.
I made a pre-processing tool! It's not very boring just by itself, so let's do something like statistics. Speaking of data processing, it seems to be Python + Pandas, so I will investigate using Python + Pandas. (However, Pandas is very heavy or slow, so please use another tool for full-scale analysis.)
Follow the steps below like Jupyter Notebook.
import pandas as pd
import json
from pathlib import Path
from pprint import pprint
Please change for each environment.
############################
#Global variables(Change as appropriate) #
############################
#CSV header
header_name = ('article_id', 'article', 'update-date',
'links', 'title', 'title_yomi', 'category''')
dtypes = {'article_id': 'uint16',
'article': 'object',
'update-date': 'object',
'links': 'object',
'title': 'object',
'title_yomi': 'object',
'category': 'object'
}
#Sample CSV
sample_filepath = "/home/meguru/Documents/nico-dict/zips/rev2014/rev201402-jsoned.csv"
sample_filepath = Path(sample_filepath)
#Sample CSVs
fileparent = Path("/home/meguru/Documents/nico-dict/zips")
filepaths = [
"rev2014/rev201401-jsoned.csv",
"rev2014/rev201402-jsoned.csv",
"rev2013/rev201301-jsoned.csv",
"rev2013/rev201302-jsoned.csv",
"rev2013/rev201303-jsoned.csv",
"rev2013/rev201304-jsoned.csv",
"rev2013/rev201305-jsoned.csv",
"rev2013/rev201306-jsoned.csv",
"rev2013/rev201307-jsoned.csv",
"rev2013/rev201308-jsoned.csv",
"rev2013/rev201309-jsoned.csv",
"rev2013/rev201310-jsoned.csv",
"rev2013/rev201311-jsoned.csv",
"rev2013/rev201312-jsoned.csv",
]
filepaths = filter(lambda path: path.exists(), map(
lambda fpath: fileparent / Path(fpath), filepaths))
##################
def read_df(csvfile: Path, with_info: bool = False):
"""read jsoned.csv file
args:
- csvfile: Path
a file path you want to read
- with_info: bool
with showing csv's information
returns:
- df
readed data frame
notes:
if you call this function, you will got some log message
"""
df = pd.read_csv(csvfile, names=header_name, dtype=dtypes)
print('[Info] readed a file {}'.format(csvfile))
if with_info:
df.info()
return df
def read_dfs(fileparent: Path, csvfiles: List[Path]):
"""read jsoned.csv files
args:
- fileparent: Path
parent file path you want to read
- csvfiles: List[Path]
file paths you want to read
returns:
- dfl
concated dataframe
note:
given.
fileparent = \"/path/to\"
csvfiles[0] = \"file\"
then.
search file <= \"/path/to/file\"
"""
dfl = []
for fpath in filepaths:
dfi = pd.read_csv(fileparent / fpath,
index_col=None, names=header_name, dtype=dtypes)
dfl.append(dfi)
dfl = pd.concat(dfl, axis=0, ignore_index=True)
return dfl
This time, let's take a look at how the links (<a>
tags) in HTML show how they are scattered for each type of article.
df = read_df(sample_filepath, True)
# [Info] readed a file /home/meguru/Documents/nico-dict/zips/rev2014/rev201402-jsoned.csv
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 6499 entries, 0 to 6498
# Data columns (total 7 columns):
# article_id 6499 non-null int64
# article 6499 non-null object
# update-date 6499 non-null int64
# links 6499 non-null object
# title 6491 non-null object
# title_yomi 6491 non-null object
# category 6491 non-null object
# dtypes: int64(2), object(5)
# memory usage: 355.5+ KB
For the time being, I was able to confirm that this file itself has a 6.5k article.
Then parse the JSON-ized link information to calculate the number of links.
#Confirmation of raw data
df['links'][0]
# => '[{"type":"element","attrs":{"href":"http://wwwxxxxhtml"},"tag":"a","content":["Kochi xxxx site"]}]'
dfs= pd.DataFrame()
dfs['links']= df['links'].map(lambda x: len(json.loads(x)))
dfs['links'][0]
# => 1
Let's take a quick statistic.
dfs['category']=df['category']
dfsg=dfs.groupby('category')
dfsg.describe()
# links
# count mean std min 25% 50% 75% max
# category
# a 5558.0 41.687298 209.005652 0.0 0.0 2.0 11.00 2064.0
# c 36.0 54.305556 109.339529 0.0 2.0 2.0 38.25 376.0
# i 4.0 7.500000 5.507571 2.0 3.5 7.0 11.00 14.0
# l 786.0 22.760814 106.608535 0.0 0.0 2.0 9.00 1309.0
# v 107.0 32.887850 46.052744 0.0 3.0 11.0 37.00 153.0
"a" = word "v" = video "i" = product "l" = live broadcast "c" = community article, so on average there are many ** community article links **. However, if you look at the median and maximum values, you can observe that it seems necessary to look at (classify) the word articles in more detail.
6k articles aren't enough, so let's increase the data.
dfl = read_dfs(fileparent, filepaths)
# >>> article_id article ... title_yomi category
# 0 8576 {"type":"element","attrs":null,"tag":"body","c... ...Kabekick a
# [223849 rows x 7 columns]
dfls = pd.DataFrame()
dfls['links'] = dfl['links'].map(lambda x: len(json.loads(x)))
dfls['category'] = dfl['category']
dflsg = dfls.groupby('category')
dflsg.describe()
# links
# count mean std min 25% 50% 75% max
# category
# a 193264.0 32.400566 153.923988 0.0 0.0 2.0 10.0 4986.0
# c 1019.0 34.667321 77.390967 0.0 1.0 2.0 34.0 449.0
# i 247.0 6.137652 6.675194 0.0 1.0 3.0 10.0 28.0
# l 24929.0 20.266477 100.640253 0.0 0.0 1.0 5.0 1309.0
# v 3414.0 14.620387 22.969974 0.0 1.0 6.0 16.0 176.0
Overall, you can see that the average value of live and video links is reversed as the number of video links decreases. In addition, the point that the fluctuation range of the number of links of the word article is too large can be confirmed as in the case of one sample. Also counter-intuitive is that ** word articles are less than the average in the third quartile **.
From the above results, it can be seen that at least the number of links varies considerably depending on the type of article, and I think that it seems better to study after observing the properties of each article individually. (Throw to the viewer how to study and produce results from here)
From the previous experiment, you can see that the variance is large, especially for word articles. The reason for this is ** from my experience and intuition when I usually look at Nico Nico Pedia **, I came up with the correlation between article size and the number of links. So, let's consider the number of characters in the JSON-converted data as the article size and check the correlation.
dfts.corr()
# links article_size
# links 1.000000 0.713465
# article_size 0.713465 1.000000
Well, at least there seems to be a strong positive correlation.
If you step a little further, it will look like this.
#About word articles
dfts[dfts['category'] == "a"].loc[:, ["links", "article_size"]].corr()
# links article_size
# links 1.000000 0.724774
# article_size 0.724774 1.000000
#About community articles
dfts[dfts['category'] == "c"].loc[:, ["links", "article_size"]].corr()
# links article_size
# links 1.00000 0.63424
# article_size 0.63424 1.00000
#About product articles
dfts[dfts['category'] == "i"].loc[:, ["links", "article_size"]].corr()
# links article_size
# links 1.000000 0.254031
# article_size 0.254031 1.000000
#About live broadcast articles
dfts[dfts['category'] == "l"].loc[:, ["links", "article_size"]].corr()
# links article_size
# links 1.00000 0.58073
# article_size 0.58073 1.00000
#About video articles
dfts[dfts['category'] == "v"].loc[:, ["links", "article_size"]].corr()
# links article_size
# links 1.000000 0.428443
# article_size 0.428443 1.000000
News We have developed a CLI for parsing articles published on the Web.
lein parse-from-web -u https://dic.nicovideo.jp/a/<contents-title>
You can get JSON-converted article data like this. See Repository for an example of acquisition.
However, this ** puts a load on the other server **, so please use it for purposes such as trying out a tool for a while. Even if you make a mistake, do not imitate carpet bombing scraping from the university IP.