If you are doing machine learning etc., you will need to prepare learning data. It would be nice if we could prepare the actual data, but there are many cases where it is difficult to obtain or the amount of data is insufficient. In such a case, I think that the flow will be to create dummy data and increase the amount of data.
This time, I will create various dummy data using Python's ** faker ** library.
The environment uses Google Colaboratory. The Python version is below.
import platform
print("python " + platform.python_version())
# python 3.6.9
In addition, it is necessary to install the library faker that creates dummy data in advance.
pip install faker
Now let's write the code.
First, import the library faker that creates dummy data. Also, make sure to target Japanese data.
from faker import Faker
fake = Faker('ja_JP')
First, let's create dummy data for the address. I tried to display 5 data.
[fake.address() for _ in range(5)]
# ['824 Palace Shibadaimon, 36-21-10 Koiri, Takatsu-ku, Kawasaki City, Miyagi Prefecture',
# '17-11-18 Nishikanda, Nakano-ku, Kagawa Prefecture Otagaya Crest 528',
# '6-10-4 Tokorono, Katsushika-ku, Hiroshima',
# '24-17-14 Chouka, Saiwai-ku, Kawasaki-shi, Kumamoto Heights Yugu 667',
# '34-12-7 Kuramae, Inba-mura, Inba-gun, Oita Prefecture Corp. Momura 228']
You can create address data with fake.address ().
You can also create data for other addresses.
#Prefectures
[fake.prefecture() for _ in range(5)]
# ['Okinawa Prefecture', 'Kyoto', 'Tochigi Prefecture', 'Saga Prefecture', 'Hiroshima Prefecture']
#Municipality
[fake.city() for _ in range(5)]
# ['Naka-ku, Yokohama', 'Hamura City', 'Toshima Village', 'Mitaka City', 'Miyakejima Miyake Village']
#Area name
[fake.town() for _ in range(5)]
# ['Satte', 'Tsurugaoka', 'Nishikawa', 'Iriya', 'Haneoricho']
#Building name
[fake.building_name() for _ in range(5)]
# ['Sharm', 'coat', 'Sharm', 'Park', 'Urban']
Next, let's create dummy data for the name. Name data can be created in kanji, katakana, and romaji.
First, let's create kanji name data.
#Name (Kanji)
[fake.name() for _ in range(5)]
# ['Chiyo Nakatsugawa', 'Yuta Wakamatsu', 'Kaori Kudo', 'Kana Uno', 'Yoko Hirokawa']
#Name (Kanji, male)
[fake.name_male() for _ in range(5)]
# ['Ryohei Sasaki', 'Atsushi Sato', 'Shota Sasaki', 'Kenichi Kato', 'Ryohei Aoyama']
#Name (Kanji, female)
[fake.name_female() for _ in range(5)]
# ['Akemi Inoue', 'Kaori Matsumoto', 'Tomomi Wakamatsu', 'Haruka Takahashi', 'Hanako Sugiyama']
#Surname (Kanji)
[fake.last_name() for _ in range(5)]
# ['Matsumoto', 'Kondo', 'Fujimoto', 'Murayama', 'Kato']
#First name (kanji)
[fake.first_name() for _ in range(5)]
# ['Minoru', 'zero', 'Hanako', 'Ryosuke', 'Kaori']
#First name (Kanji, male)
[fake.first_name_male() for _ in range(5)]
# ['Hiroki', 'Naoto', 'Atsushi', 'Naoki', 'Akira']
#First name (kanji, female)
[fake.first_name_female() for _ in range(5)]
# ['dance', 'Mikako', 'Tomomi', 'Akemi', 'Akemi']
Next, let's create katakana name data.
#Name (Katakana)
[fake.kana_name() for _ in range(5)]
# ['Yui Ogaki', 'Harada Takuma', 'Nakamura Tsubasa', 'Yamada Sayuri', 'Tsuchiya Sotaro']
#Surname (Katakana)
[fake.last_kana_name() for _ in range(5)]
# ['Miyake', 'Kanou', 'Kudo', 'Harada', 'Aota']
#First name (katakana)
[fake.first_kana_name() for _ in range(5)]
# ['Maaya', 'Naoko', 'Miki', 'Kenichi', 'Yasuhiro']
#First name (katakana, male)
[fake.first_kana_name_male() for _ in range(5)]
# ['Manab', 'Manab', 'Yasuhiro', 'Kenichi', 'Atsushi']
#First name (katakana, female)
[fake.first_kana_name_female() for _ in range(5)]
# ['Tomomi', 'Sayuri', 'Aska', 'Tsubasa', 'Yui']
As the end of the name data, let's create the one written in Roman letters.
#Name (in romaji)
[fake.romanized_name() for _ in range(5)]
# ['Akira Nakamura','Ryosuke Yamada','Yui Takahashi','Maaya Ogaki','Mituru Fujimoto']
#Surname (Roman alphabet)
[fake.last_romanized_name() for _ in range(5)]
# ['Tsuda', 'Tsuchiya', 'Yamada', 'Nakatsugawa', 'Nakamura']
#First name (romaji)
[fake.first_romanized_name() for _ in range(5)]
# ['Mai', 'Manabu', 'Nanaka', 'Kenichi', 'Taro']
#First name (Romaji, male)
[fake.first_romanized_name_male() for _ in range(5)]
# ['Tomoya', 'Hiroshi', 'Taichi', 'Mituru', 'Manabu']
#First name (romaji, female)
[fake.first_romanized_name_female() for _ in range(5)]
# ['Haruka', 'Maaya', 'Kaori', 'Kumiko', 'Yoko']
We have created address and name data, but you can also create other data. Here are some of them.
#company name
[fake.company() for _ in range(5)]
# ['Harada Gas Co., Ltd.', 'Sasada Mining Co., Ltd.', 'Miyake Gas Co., Ltd.', 'Kudo Construction Co., Ltd.', 'Kobayashi Fisheries Co., Ltd.']
#industry
[fake.company_category() for _ in range(5)]
# ['gas', 'printing', 'Bank', 'Food', 'insurance']
#Profession
[fake.job() for _ in range(5)]
# ['Bus guide', 'Esthetician', 'Wedding planner', 'fortune teller', 'pharmacist']
#word
[fake.word() for _ in range(5)]
# ['weave', 'College', 'Performer', 'today', 'To modernize']
This time, I used Python faker to create various dummy data.
When preparing data for machine learning etc., I think that it is often the case that actual data alone is not enough. In such a case, I think that dummy data will be useful.
In addition to the ones introduced here, faker allows you to create various dummy data. For details, please refer to the following page. https://faker.readthedocs.io/en/master/locales/ja_JP.html
Recommended Posts