When it is time to file a tax return, if you reconfirm the accounting data, you will find some mistakes in the account, so I thought it would be good if you could learn the accounting data by machine learning and predict the account.
Export the data in CSV format from the accounting software and use it. By the way, this data uses the software "JDL IBEX Treasurer" as an example.
http://www.jdl.co.jp/co/soft/ibex-ab/
Load the data with the following code.
python
import pandas as pd
filename = "JDL account book-xxxx-xxxx-Journal.csv"
df = pd.read_csv(filename, encoding="Shift-JIS", skiprows=3)
Narrow down the data to be used from the read data. Here we use the code and name of the description and debit item.
python
columns = ["Description", "Debit subject", "Debit subject正式名称"]
df_counts = df[columns].dropna()
About the description data By morphological analysis, character data is vectorized as directional numerical values.
A library called Janome is used for morphological analysis.
http://mocobeta.github.io/janome/
If it is not installed, you need to install it with the following command.
python
$ pip install janome
The following code converts the description data into token data.
python
from janome.tokenizer import Tokenizer
t = Tokenizer()
notes = []
for ix in df_counts.index:
note = df_counts.ix[ix,"Description"]
tokens = t.tokenize(note.replace(' ',' '))
words = ""
for token in tokens:
words += " " + token.surface
notes.append(words.replace(' \u3000', ''))
As a result, the following conversion is performed, and it becomes a character string with a half-width space for each word.
Original summary data "Souvenir fee BLUE SKY Haneda" Data after change "Souvenir fee BLUESKY Haneda"
This character string is vectorized with the following code and used as input data.
python
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
vect.fit(notes)
X = vect.transform(notes)
It also uses the account code as teacher data.
python
y = df_counts.Debit subject.as_matrix().astype("int").flatten()
The data converted to numerical values is divided into training data and validation data by cross validation.
python
from sklearn import cross_validation
test_size = 0.2
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=test_size)
Learn using the divided data. Here we use LinearSVC as the model.
python
from sklearn.svm import LinearSVC
clf = LinearSVC(C=120.0, random_state=42)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
The score was "0.89932885906040272".
From the learned result, enter the test data as follows and check what kind of account is expected.
python
tests = [
"Expressway usage fee",
"PC parts cost",
"Stamp fee"
]
notes = []
for note in tests:
tokens = t.tokenize(note)
words = ""
for token in tokens:
words += " " + token.surface
notes.append(words)
X = vect.transform(notes)
result = clf.predict(X)
df_rs = df_counts[["Official name of the debit item", "Debit subject"]]
df_rs.index = df_counts["Debit subject"].astype("int")
df_rs = df_rs[~df_rs.index.duplicated()]["Official name of the debit item"]
for i in range(len(tests)):
print(tests[i], "\t[",df_rs.ix[result[i]], "]")
The output result is ...
python
Expressway usage fee[Travel expenses transportation]
PC parts cost[supplies expense]
Stamp fee[Communication costs]
It feels pretty good (^-^)
By the way, the transfer slip needs a little more ingenuity.
I thought it would be better if I could use other information such as months, days of the week, and financial statements, but I wasn't sure how to handle the learning data, so I'll look into it later.
Well, what should I do next?
If you try to use the above program as it is, it is not efficient because the actual data is read, learned, and then predicted each time it is executed. Therefore, save the training data, and in the part to predict, read the trained data and change it to predict.
You can save the training data by adding the following code to the end of the above program.
python
from sklearn.externals import joblib
joblib.dump(vect, 'data/vect.pkl')
joblib.dump(clf, 'data/clf.pkl')
df_rs.to_csv("data/code.csv")
In the new program, load the training data as follows.
python
import pandas as pd
filename = "data/code.csv"
df = pd.read_csv(filename, header=None)
df.index = df.pop(0)
df_rs = df.pop(1)
from sklearn.externals import joblib
clf = joblib.load('data/clf.pkl')
vect = joblib.load('data/vect.pkl')
After reading the training data, continue to execute the prediction.
python
from janome.tokenizer import Tokenizer
t = Tokenizer()
tests = [
"Expressway usage fee",
"PC parts cost",
"Stamp fee",
]
notes = []
for note in tests:
tokens = t.tokenize(note)
words = ""
for token in tokens:
words += " " + token.surface
notes.append(words)
X = vect.transform(notes)
result = clf.predict(X)
for i in range(len(tests)):
print(tests[i], "\t[",df_rs.loc[result[i]], "]")
The execution result is ...
python
Expressway usage fee[Travel expenses transportation]
PC parts cost[supplies expense]
Stamp fee[Communication costs]
did it!