Introduction

When it is time to file a tax return, if you reconfirm the accounting data, you will find some mistakes in the account, so I thought it would be good if you could learn the accounting data by machine learning and predict the account.

Reading accounting data

Export the data in CSV format from the accounting software and use it. By the way, this data uses the software "JDL IBEX Treasurer" as an example.

http://www.jdl.co.jp/co/soft/ibex-ab/

Load the data with the following code.

`python`


import pandas as pd

filename = "JDL account book-xxxx-xxxx-Journal.csv"
df = pd.read_csv(filename, encoding="Shift-JIS", skiprows=3)

Narrow down the data to be used from the read data. Here we use the code and name of the description and debit item.

`python`


columns = ["Description", "Debit subject", "Debit subject正式名称"]
df_counts = df[columns].dropna()

Morphological analysis

About the description data By morphological analysis, character data is vectorized as directional numerical values.

A library called Janome is used for morphological analysis.

http://mocobeta.github.io/janome/

If it is not installed, you need to install it with the following command.

`python`


$ pip install janome

The following code converts the description data into token data.

`python`


from janome.tokenizer import Tokenizer

t = Tokenizer()

notes = []
for ix in df_counts.index:
    note = df_counts.ix[ix,"Description"]
    tokens = t.tokenize(note.replace('　',' '))
    words = ""
    for token in tokens:
        words += " " + token.surface
    notes.append(words.replace(' \u3000', ''))

As a result, the following conversion is performed, and it becomes a character string with a half-width space for each word.

Original summary data "Souvenir fee BLUE SKY Haneda" Data after change "Souvenir fee BLUESKY Haneda"

This character string is vectorized with the following code and used as input data.

`python`


from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer()
vect.fit(notes)

X = vect.transform(notes)

It also uses the account code as teacher data.

`python`


y = df_counts.Debit subject.as_matrix().astype("int").flatten()

Machine learning

The data converted to numerical values is divided into training data and validation data by cross validation.

`python`


from sklearn import cross_validation

test_size = 0.2
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=test_size)

Learn using the divided data. Here we use LinearSVC as the model.

`python`


from sklearn.svm import LinearSVC

clf = LinearSVC(C=120.0, random_state=42)
clf.fit(X_train, y_train)

clf.score(X_test, y_test)

The score was "0.89932885906040272".

Forecast

From the learned result, enter the test data as follows and check what kind of account is expected.

`python`


tests = [
    "Expressway usage fee",
    "PC parts cost",
    "Stamp fee"
]

notes = []
for note in tests:
    tokens = t.tokenize(note)
    words = ""
    for token in tokens:
        words += " " + token.surface
    notes.append(words)

X = vect.transform(notes)

result = clf.predict(X)

df_rs = df_counts[["Official name of the debit item", "Debit subject"]]
df_rs.index = df_counts["Debit subject"].astype("int")
df_rs = df_rs[~df_rs.index.duplicated()]["Official name of the debit item"]

for i in range(len(tests)):
    print(tests[i], "\t[",df_rs.ix[result[i]], "]")

The output result is ...

`python`


Expressway usage fee[Travel expenses transportation]
PC parts cost[supplies expense]
Stamp fee[Communication costs]

It feels pretty good (^-^)

By the way, the transfer slip needs a little more ingenuity.

I thought it would be better if I could use other information such as months, days of the week, and financial statements, but I wasn't sure how to handle the learning data, so I'll look into it later.

Well, what should I do next?

(Bonus) Separation of learning and prediction

If you try to use the above program as it is, it is not efficient because the actual data is read, learned, and then predicted each time it is executed. Therefore, save the training data, and in the part to predict, read the trained data and change it to predict.

Saving learning results

You can save the training data by adding the following code to the end of the above program.

`python`


from sklearn.externals import joblib

joblib.dump(vect, 'data/vect.pkl')
joblib.dump(clf, 'data/clf.pkl')
df_rs.to_csv("data/code.csv")

Reading learning results

In the new program, load the training data as follows.

`python`


import pandas as pd

filename = "data/code.csv"
df = pd.read_csv(filename, header=None)
df.index = df.pop(0)
df_rs = df.pop(1)

from sklearn.externals import joblib

clf = joblib.load('data/clf.pkl')
vect = joblib.load('data/vect.pkl')

Forecast

After reading the training data, continue to execute the prediction.

`python`


from janome.tokenizer import Tokenizer

t = Tokenizer()
tests = [
    "Expressway usage fee",
    "PC parts cost",
    "Stamp fee",
]

notes = []
for note in tests:
    tokens = t.tokenize(note)
    words = ""
    for token in tokens:
        words += " " + token.surface
    notes.append(words)

X = vect.transform(notes)

result = clf.predict(X)

for i in range(len(tests)):
    print(tests[i], "\t[",df_rs.loc[result[i]], "]")

The execution result is ...

`python`


Expressway usage fee[Travel expenses transportation]
PC parts cost[supplies expense]
Stamp fee[Communication costs]

did it!

Learn accounting data and try to predict accounts from the content of the description when entering journals

Introduction

Reading accounting data

python

python

Morphological analysis

python

python

python

python

Machine learning

python

python

Forecast

python

python

(Bonus) Separation of learning and prediction

Saving learning results

python

Reading learning results

python

Forecast

python

python

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`