I took the immigration statistics of the Ministry of Justice and created Simple version of notebok to process.
I have a lot of comments, but I think it is convenient to modify because it is not classified.
Paste the jupyter code below.
#Make the minimum library settings.
import pandas as pd #Loading a data processing library called pandas
import re #For regular expressions
from IPython.display import display, HTML #Load library for display
#Get the excel data and put it in pandas. This alone is okay
#The original Excel is the premise of one sheet.
moto= pd.read_excel('http://www.e-stat.go.jp/SG1/estat/GL08020103.do?_xlsDownload_&fileId=000007747517&releaseCount=1')
#The date is 2016 8, so as additional information, year,Set the month. (It's good to scroll and get it automatically.)
year = 2016
month = 8
#First, check the total number
count_all = int(moto.iloc[2,3])
#Check the total number( count_Since all is an integer, display it as a string and display it as HTML. )
display(HTML('<h1>Total number= ' + str(count_all) + '<h1>'))
#moto.head()Then you can quickly see the first 5 lines of the data.
#The result of the last run row in the cell is displayed. print(moto.head())You don't have to do it.
#You can see that there are extra rows and columns when summarizing.
moto.head()
#columns Information The port name is taken from the 5th (Kushiro) in the first line.
columns = moto.iloc[1,4:]
#There are various difficulties such as line breaks being inserted just by fetching.
display(columns[12:18])
#index is Unnamed:Since it is 16, it is difficult to understand, so change it to a serial number. It's just a matter of appearance, so you don't have to do it.
columns.index = range(0,len(columns))
display(columns[12:18])
#Line breaks, blanks,Etc. are removed.
columns = columns.apply(lambda x: x.replace("\n",'').replace(' ',''))
display(columns[12:18])
# index (Country)Create a. (You don't have to create it, but you can add it because it is convenient for deleting line data.)
idx = moto.iloc[3:,1]
#Line breaks, blanks,Etc. are removed.
idx = idx.apply(lambda x: x.replace("\n",'').replace(' ',''))
display(idx[12:18])
#Country(country), Port),Number of immigrants(num)Creates the formatted data of.
#First, create empty data.
xdf = pd.DataFrame(columns=['country','port','num'])
#Vertically arranged country,port,num
for i in range(0,len(columns)):
x = pd.DataFrame(idx)
x.columns = ['country']
x['port'] = columns[i]
x['num'] = moto.iloc[4:,4 + i]
x.index = idx
x.index.name = 'country'
xdf = xdf.append(x)
display(xdf.head(210).tail(10))
#Remove regional subtotals The subtotals are NaN (no data).
area = ['Asia', 'Europe','Africa', 'North America', 'South America', 'Oceania']
#I am using index to delete it. It is easier than specifying in the search. inplace=I am rewriting xdf directly by specifying True.
xdf.drop(area,inplace=True)
display(xdf.head(210).tail(10))
#Add the year and month items.
xdf['year'] = year
xdf['month'] = month
display(xdf.head(210).tail(10))
#Calculate total number for confirmation
display(HTML('<h1>' + str(xdf['num'].sum()) + '</h1>'))
#Save the formatted file in an excel file.
xdf.to_excel('immigration.xlsx')
Recommended Posts