Easy sample data creation procedure

Home page: https://program-board.com

motivation

I wanted to play with various sample data this time, but I felt that there was an API that I could easily create, but the degree of freedom was low, so I decided to create it myself.

module

Install the module. Here, we use pandas, numpy and a module called faker that can easily generate fake data.

pip install faker

About faker

Describes the basic usage of faker. Prepare to generate fake data for Japanese with Faker ('ja_JP'). After that, try to generate any kind of data you like. Although it is in English, the following site has data that Faker can generate. https://www.nblog09.com/w/2019/01/24/python-faker/

Here, the name, prefecture, company, date of birth, and occupation are generated.

#Japanese fake data settings
from faker import Faker

#Japanese fake data settings
fakegen = Faker('ja_JP')

print(fakegen.name())#name
print(fakegen.prefecture())#Prefectures
print(fakegen.company()) #Company
print(fakegen.date_of_birth()) #Birthday
print(fakegen.job()) #Profession

Fake data generation

First, generate fake data of name, date of birth, address (prefecture), occupation, and company.

import numpy as np
import pandas as pd
from numpy.random import *
from faker import Faker

#Japanese fake data settings
fakegen = Faker('ja_JP')
faker_list = []
for i in range(1000):
    name = fakegen.name() #name
    pref = fakegen.prefecture() #Prefectures
    company = fakegen.company() #Company
    birth = fakegen.date_of_birth() #Birthday
    job = fakegen.job() #Profession
    faker_list.append([name,pref,company,birth,job])

#Data frame
df = pd.DataFrame(faker_list,columns=['name','Street address','Company','Birthday','Profession'])

You can easily create it like this.

age

Generates fake age data. Here, numpy is used to generate data from random numbers with a uniform distribution of 15 to 85. Considering the missing values, only 800 is created here. Also, when generating missing values for other items, the data arrangement is messed up so that they do not overlap on the same line.

age_list = randint(15,85,800) #Uniform distribution(lower limit,upper limit,Number of generations)
#Add to data frame
df['age'] = pd.DataFrame(age_list)
#The way the data is arranged is messed up.
df = df.sample(frac=1)

Annual income / height

Generates fake data of annual income and height. Here, numpy is used to determine the mean and variance values appropriately, and data is generated from the normal distribution.

#height
height_list = normal(170,6,900) #normal distribution(average,Distributed,Number of generations)
height_list = np.round(height_list,decimals=0) #Integer stop
#Add to data frame
df['height'] = pd.DataFrame(height_list)
#The way the data is arranged is messed up.
df = df.sample(frac=1)

#annual income
income_list = normal(400,8,850) #normal distribution(average,Distributed,Number of generations)
income_list = np.round(income_list,decimals=0) #Integer stop
#Add to data frame
df['annual income'] = pd.DataFrame(income_list)
#The way the data is arranged is messed up.
df = df.sample(frac=1)

marriage

Marriage fake data generates random numbers at a fixed rate for the options prepared by np.random.choice of numpy.

#sex
mariage = ['Unmarried','married','Divorce'] #Choices
Weight = [0.4,0.3,0.3] #Percentage
mariage_list = np.random.choice(mariage,700,p=Weight) #np.random.choice(Choices,Number of generations,Percentage)
#Add to data frame
df['marriage'] = pd.DataFrame(mariage_list)
#The way the data is arranged is messed up.
df = df.sample(frac=1)

スクリーンショット-2019-11-19-3.53.49-1024x432.png

in conclusion

Fake data was generated for practicing data analysis. However, in order to generate data for analysis practice, it is necessary to confirm and execute the type and parameters of the distribution used for generation.

reference

https://www.nblog09.com/w/2019/01/24/python-faker/ https://qiita.com/ogamiki/items/4821173ca713a6b77510