Home page: https://program-board.com
I wanted to play with various sample data this time, but I felt that there was an API that I could easily create, but the degree of freedom was low, so I decided to create it myself.
Install the module. Here, we use pandas, numpy and a module called faker that can easily generate fake data.
pip install faker
Describes the basic usage of faker. Prepare to generate fake data for Japanese with Faker ('ja_JP'). After that, try to generate any kind of data you like. Although it is in English, the following site has data that Faker can generate. https://www.nblog09.com/w/2019/01/24/python-faker/
Here, the name, prefecture, company, date of birth, and occupation are generated.
#Japanese fake data settings
from faker import Faker
#Japanese fake data settings
fakegen = Faker('ja_JP')
print(fakegen.name())#name
print(fakegen.prefecture())#Prefectures
print(fakegen.company()) #Company
print(fakegen.date_of_birth()) #Birthday
print(fakegen.job()) #Profession
First, generate fake data of name, date of birth, address (prefecture), occupation, and company.
import numpy as np
import pandas as pd
from numpy.random import *
from faker import Faker
#Japanese fake data settings
fakegen = Faker('ja_JP')
faker_list = []
for i in range(1000):
name = fakegen.name() #name
pref = fakegen.prefecture() #Prefectures
company = fakegen.company() #Company
birth = fakegen.date_of_birth() #Birthday
job = fakegen.job() #Profession
faker_list.append([name,pref,company,birth,job])
#Data frame
df = pd.DataFrame(faker_list,columns=['name','Street address','Company','Birthday','Profession'])
You can easily create it like this.
Generates fake age data. Here, numpy is used to generate data from random numbers with a uniform distribution of 15 to 85. Considering the missing values, only 800 is created here. Also, when generating missing values for other items, the data arrangement is messed up so that they do not overlap on the same line.
age_list = randint(15,85,800) #Uniform distribution(lower limit,upper limit,Number of generations)
#Add to data frame
df['age'] = pd.DataFrame(age_list)
#The way the data is arranged is messed up.
df = df.sample(frac=1)
Generates fake data of annual income and height. Here, numpy is used to determine the mean and variance values appropriately, and data is generated from the normal distribution.
#height
height_list = normal(170,6,900) #normal distribution(average,Distributed,Number of generations)
height_list = np.round(height_list,decimals=0) #Integer stop
#Add to data frame
df['height'] = pd.DataFrame(height_list)
#The way the data is arranged is messed up.
df = df.sample(frac=1)
#annual income
income_list = normal(400,8,850) #normal distribution(average,Distributed,Number of generations)
income_list = np.round(income_list,decimals=0) #Integer stop
#Add to data frame
df['annual income'] = pd.DataFrame(income_list)
#The way the data is arranged is messed up.
df = df.sample(frac=1)
Marriage fake data generates random numbers at a fixed rate for the options prepared by np.random.choice of numpy.
#sex
mariage = ['Unmarried','married','Divorce'] #Choices
Weight = [0.4,0.3,0.3] #Percentage
mariage_list = np.random.choice(mariage,700,p=Weight) #np.random.choice(Choices,Number of generations,Percentage)
#Add to data frame
df['marriage'] = pd.DataFrame(mariage_list)
#The way the data is arranged is messed up.
df = df.sample(frac=1)
Fake data was generated for practicing data analysis. However, in order to generate data for analysis practice, it is necessary to confirm and execute the type and parameters of the distribution used for generation.
https://www.nblog09.com/w/2019/01/24/python-faker/ https://qiita.com/ogamiki/items/4821173ca713a6b77510
Recommended Posts