I will leave it as a personal memorandum. I would like to write it as concisely as possible so that I can get the file quickly.
Collect Wikipedia redirects and create a file like the one below.
{"src": "COVID-19", "dst": "New Coronavirus Infection_ (2019)"}
{"src": "COVID-2019", "dst": "Coronavirus disease _ (2019)"}
{"src": "Covid-19", "dst": "New Coronavirus Infection_ (2019)"}
{"src": "Covid-2019", "dst": "Coronavirus disease _ (2019)"}
{"src": "New Coronavirus Infection", "dst": "New Coronavirus Infection_ (2019)"}
{"src": "Covid 19", "dst": "New Coronavirus Infection_ (2019)"}
{"src": "COVID19", "dst": "Coronavirus disease _ (2019)"}
{"src": "2019 New Coronavirus Infection", "dst": "New Coronavirus Infection_ (2019)"}
See below [Wikipedia: Redirect](https://ja.wikipedia.org/wiki/Wikipedia: Redirect)
When I try to access https://ja.wikipedia.org/wiki/COVID-19, You will be automatically skipped to https://ja.wikipedia.org/wiki/New Coronavirus Infection_ (2019).
pip install mysqlclient
Please download the necessary data from the following.
https://dumps.wikimedia.org/jawiki/
--jawiki-[dump acquisition date]-redirect.sql.gz
--jawiki-[dump acquisition date]-page.sql.gz
$ gunzip jawiki-[dump acquisition date]-redirect.sql.gz
$ gunzip jawiki-[dump acquisition date]-page.sql.gz
$ mysql -u [user name] -p [DB name] <jawiki-[dump acquisition date] -page.sql
$ mysql -u [user name] -p [DB name] <jawiki-[dump acquisition date] -redirect.sql
Code that hits the database to extract redirects and saves them in JSON.
import json
import MySQLdb
USERNAME = "[MySQL user name]"
PASSWORD = "[password]"
DB_NAME = "[DB name]"
OUTPUT = "./redirects.json"
def save_jsonl(file_path, data):
json_dumps = lambda d:json.dumps(d, ensure_ascii=False)
dumps = map(json_dumps, data)
with open(file_path, "w") as f:
f.write("\n".join(dumps))
if __name__ == '__main__':
#Connect to database
conn = MySQLdb.connect(
user=USERNAME,
passwd=PASSWORD,
host='localhost',
db=DB_NAME
)
#Create Cursor and execute query
cur = conn.cursor(MySQLdb.cursors.DictCursor)
sql = "select page.page_title, redirect.rd_title from page, redirect where redirect.rd_from=page.page_id"
cur.execute(sql)
rows = cur.fetchall()
# Organize execution results
redirects = []
for row in rows:
row = {key:cell.decode() if type(cell) is bytes else cell for key, cell in row.items()}
redirects.append({
"src":row["page_title"],
"dst":row["rd_title"]
})
#Save
save_jsonl(OUTPUT, redirects)
cur.close()
conn.close()
python extract_redirects.py
that's all!
page_id
in addition to title
.In jawiki-[dump acquisition date]-redirect.sql.gz
, the redirect source page_id
and the redirect destination title
are linked by a record.
In jawiki-[dump acquisition date]-page.sql.gz
, page_id
and title
are linked by a record.
By combining these two dumps, the redirect source title
and the redirect destination title
are linked.
Recommended Posts