Introduction

I thought I should try patent analysis at the same time, so I decided to try it. Specifically, when it comes to patent analysis, for example, there is a plot like the one below. (The figure is taken from J-STORE Manual) The vertical axis shows the patent applicant, the horizontal axis shows the year when the patent was issued, and the size of the circle shows the number of patents, which is called a bubble chart. There is something in the world that makes such plots for a fee.

This time, by searching for patent documents using the free patent database J-PlatPat and processing the results in Python. I would like to make a plot like this.

What is J-PlatPat?

According to wikipedia,

The Japan Platform for Patent Information (Japan Platform for Patent Information) is related to industrial property rights such as patents, utility models, designs and trademarks operated by the National Center for Industrial Property Information and Training (INPIT). This is a database that allows you to search and inquire about industrial property gazettes, etc. for free.

... apparently ... The important thing here is that it doesn't cost money to use, but there are many restrictions on how it's free, and you risk being indignant if you touch it easily.

This time, the strategy of this platform is half, and the explanation of the processing of the obtained data is half.

Data collection

I'm interested in batteries, especially so-called rechargeable batteries, so I'd like to search for them.

However, if you normally search for "batteries", for example, solar cells and fuel cells will also be searched. Up to this point, it happens even with a normal search.

While soliciting hatred for humankind who translated Solar cell and Fuel cell as "battery", enter the following in J-PlatPat's patent / utility model search => logical formula input.

This / TI indicates the title, in which case patents with a'secondary battery'or'storage battery' in the patent title will be searched. If you use a logical formula, you can specify various other things, so the world will expand. For details, refer to Help of J-PlatPat.

This designation will result in 60459 patents. However, the maximum number of J-PlatPat search results displayed is 3000, so it will die. So, specify the issue date and search.

The issue date is specified from the search option, not from the logical expression. In this case, the patents issued during the one year period from January 1, 2019 to January 1, 2020 will be displayed.

This will display 1408 patents as shown below (as of 2/4/2020).

I would like to scrape the patent contents one by one here, but that is prohibited (see Notes). Never do it. I think you should forgive me with the "CSV output" in the list. Let's press it.

To output CSV, narrow down to 100 or less.

Yes I died. This is the end of this article. Is a lie. Don't rush, press "Print List".

Then, the window "Print list | J-PlatPat [JPP]" will pop up as shown above. This screen contains all 1408 patents. Right-click and save the HTML file instead of printing. This completes the acquisition of patent data for 2019.

After that, while shifting the issue year appropriately, we will manually acquire the HTML of the patent list such as 2018, 2017 .... This time, we have acquired data for 15 years from 2005 to 2019.

It's a work that feels empty, but it's actually not that much of a hassle, and it only takes about 10 minutes.

HTML to CSV conversion

With Python's BeautifulSoup, it's easy to convert the contents of an HTML table to CSV.

A little bit like below. Read the HTML file from ./html and write the CSV file to ./csv.


from bs4 import BeautifulSoup
import re
import glob
import pandas as pd

path = glob.glob('./html/*.html')

for p in path:
    
    name = p.replace('.html','').replace('./html\\','')  
    html = open(p,'r',encoding="utf-8_sig")
    soup = BeautifulSoup(html,"html.parser")
    
    tr = soup.find_all('tr')
    columns = [i.text.replace('\n','') for i in tr[0].find_all('th')]
    
    df = pd.DataFrame(index=[],columns=columns[1:])
    
    for l in tr[1:]:
        lines = [i.text for i in l.find_all('td')]
        lines = [i.replace('\n','') if n != 6 else re.sub(r'[\n]+', ",", i)  for n,i in enumerate(lines)]
        lines = pd.Series(lines, index=df.columns)
        df = df.append(lines,ignore_index=True)
    
    df.to_csv('./csv/'+name+'.csv', encoding='utf_8_sig', index=False)

I don't access J-PlatPat at all because I am only processing the HTML file saved locally manually. It's safe.

A part of the obtained data frame is shown below (patent2005.csv). 15 CSVs like this will be saved from 2005 to 2019. ~~ It is not necessary if J-PlatPat supports all CSV output ~~

Reference number	application number	Filing date	Known date	Title of invention	applicant/Right holder	FI
Re-table 2005/124920	Japanese Patent Application No. 2006-514740	2005/06/14	2005/12/29	Lead-acid battery	Panasonic Corporation	,C22C11/00,C22C11/02,C22C11/06,other,
Re-table 2005/124899	Japanese Patent Application No. 2006-514659	2005/03/09	2005/12/29	Secondary battery and its manufacturing method	Panasonic Corporation	,H01M2/16@M,H01M4/02@B,H01M4/02@Z,other,
Re-table 2005/124898	Japanese Patent Application No. 2006-514732	2005/06/13	2005/12/29	Positive electrode active material powder for lithium secondary batteries	AGC Seimi Chemical Co., Ltd.	,H01M4/02@C,H01M4/36@E,H01M4/36,other,
Re-table 2005/122318	Japanese Patent Application No. 2006-514434	2005/05/18	2005/12/22	Non-aqueous electrolyte and lithium secondary battery using it	Ube Industries, Ltd.	,H01M4/02@C,H01M4/02@D,H01M4/36,other,
JP 2005-353584	Japanese Patent Application No. 2005-140521	2005/05/13	2005/12/22	Lithium-ion secondary battery and its manufacturing method	Panasonic Corporation, etc.	,H01M2/16@P,H01M4/02,101,H01M4/02,108,other,

Creating a bubble chart

When creating a bubble chart, there is the problem of how to select the target applicant (company). This time, if we pull in the top 10 companies with the highest number of patent applications in each year, we have just 30 companies, so we will do so.

Organize your data as follows:

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import glob
import pandas as pd
import collections
import numpy as np
import seaborn as sns

path = glob.glob('./csv/*.csv')

app_top10_dic = {} #TOP10 companies and number of applications for each year
app_total_dic = {} #All companies and number of applications in each year
app_all = [] #List of companies subject to bubble chart
for p in path:
    name = p.replace('.csv',' ').replace('./csv\\patent','').replace(' ','')
    df = pd.read_csv(p)
    app_list = df['applicant/Right holder']
    app_list = [i.rstrip('other').replace(' ','').replace('\u3000',' ') for i in app_list]
    app_set = collections.Counter(app_list)
    app_top10 = app_set.most_common()[:10]
    app_all.extend([i[0] for i in app_top10])
    app_top10_dic[name] = app_top10
    
    app_total = app_set.most_common()
    app_total_dic[name] = app_total
    
app_all = list(set(app_all))

Create a matrix with the applicant (company) in the column and the application year in the row. First, create an empty matrix (df) as shown below, and then

years = list(app_top10_dic.keys())
df = pd.DataFrame(index=app_all,columns=years).fillna(0)

Add the number of applications to this as follows.

for i in app_total_dic:
    dic = app_total_dic[i]
    for d in dic:
        if(d[0] in app_all):
            df[i][d[0]] += d[1]

Finally, you get a matrix like this:

	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019
Sumitomo Metal Mining Co., Ltd.	6	14	7	13	2	7	15	18	16	25	35	54	56	60	61
Furukawa Battery Co., Ltd.	46	40	34	41	33	31	23	9	12	13	11	7	11	11	13
Dai Nippon Printing Co., Ltd.	4	14	12	0	5	9	29	59	5	9	2	1	5	0	0
Sanyo Electric Co., Ltd.	181	137	155	120	96	102	120	110	104	180	73	65	26	59	45
Samsung SDI Co., Ltd.	83	138	24	32	38	52	108	57	41	49	69	40	16	14	22
：	:	:	:	:	:	:	:	:	:	:	:	:	:	:	:
Semiconductor Energy Laboratory Co., Ltd.	0	0	0	0	0	0	9	7	15	16	22	54	24	12	12
GS Yuasa Corporation	57	20	11	17	0	0	0	0	0	0	0	0	0	0	0
GS Yuasa Co., Ltd.	43	16	38	40	39	44	56	76	79	75	99	72	83	41	31
Hitachi Maxell Co., Ltd.	29	20	8	28	26	17	39	49	48	46	63	35	39	14	0
Zeon Corporation	2	5	5	5	6	16	28	29	43	67	62	45	55	22	2

This is obtained as a pandas dataframe, so you can easily create a heatmap with seaborn.

fig = plt.figure(figsize=(6,10),dpi=150)
sns.heatmap(df,square=True,cmap='Reds')

I think this is enough, but I also make a bubble chart. I can't (should) make this in one shot, so I'll do my best. I noticed the color and put in a legend.

fig = plt.figure(figsize=(6,10),dpi=150)

for n,y in enumerate(years):
    for m,x in enumerate(app_all[::-1]):
        plt.scatter(y,x,s=int(df[y][x]))

size = [100,200,300]
for s in size:
    plt.scatter(-100,-100,s=s,facecolor='white',edgecolor='black',label=str(s))
plt.legend(title='Patents', bbox_to_anchor=(1.03, 1), 
           loc='upper left',labelspacing=1,borderpad=1)

plt.xlim(-1,15)
plt.ylim(-1,30)
plt.xticks(rotation=45)
plt.show()

It's done. Certainly this is easier to understand than the heat map.

A quick look at the whole reveals that Toyota has been strong since 2013. It is surprising that Toshiba has been constantly issuing patents since 2017, when there were various things. It's amazing. Panasonic's patents are decreasing, but isn't that so considering the applications of Panasonic IP Management and Sanyo Electric (subsidiary)? There are too many affiliated companies of Hitachi, and what is the difference between GS Yuasa and GS Yuasa Corporation, so it seems that group companies need to be integrated.

Summary

That's why I used Python to touch on patent analysis for a moment. I was swayed by the specifications of J-PlatPat and forgot how to use Pandas, but for the time being, I made a bubble chart, so I achieved my goal.

However, there is a possibility that some patents have been missed due to insufficient specification of search items. This part seems to require a lot of trial and error.

If you can get the text of the patent, it seems that you can do various interesting things with natural language analysis. Considering referencing the text, [Google Patent Public Datasets](https://cloud.google.com/blog/products/gcp/google-patents-public-datasets-connecting-public-paid-and-private-patent Is it better to use -data)?

I was satisfied for the time being, so I ended here!

Notes

"Restrictions on mass access, robot access, etc." in the J-PlatPat usage guide states as follows.

J-PlatPat is publicly used for industrial property rights information. Therefore, we prohibit actions such as downloading a large amount of data for the purpose of simple data collection and robot access (regular automatic data collection by a program) that may interfere with general use. I will.

That's why programmatic automatic collection is NG. Don't do it. Let's be careful.

I think it's arguable whether this data collection deserves "downloading a large amount of data for the purpose of simple data collection", but I think it's probably okay because I actually downloaded only 15 files. I will.

[Patent analysis] I tried to make a patent map with Python without spending money