motivation

I heard from a university clerk that it is difficult to check the Kakenhi Application Form. .. Currently, he seems to be putting red one by one by hand.

--Are you filling out according to the application guidelines? --Are the figures and references properly included? ――Is the achievement list in the proper format?

It would be nice to have an automatic check tool. I hope I can make an automatic check tool with Python! While thinking, it is a heavy load to make immediately, so I first tried reading and writing Word files from Python. As a sample according to the Grant-in-Aid for Scientific Research application, the author submitted in June 2011 Japan Society for the Promotion of Science (JSCE) I read the application form. If you are not familiar with Gakushin, you may get an emotional response when you ask a familiar doctoral student.

How to read a Word file from Python?

Read from a word file in python

There are mainly python-docx and docx2txt, both of which are .docx. Only files are supported. As we will see later, when reading a .doc file, you will need to convert it to .docx with antiword. Since docx2txt can read text from headers, footers, and hyperlinks, I mainly tried it with docx2txt this time.

environment

MacOS Mojave 10.14.5
Anaconda 2020.02
Python 3.7.6
Jupyter Notebook 6.0.3

Install python-docx

`bash`


pip install python-docx

It seems that python-docx only supports up to Python 3.4, but it works with Python 3.7. I didn't get Python 3.4 in Anaconda, so I left it at 3.7.

Install docx2txt

`bash`


pip install docx2txt

antiword installation

As I'll explain later, the Word file I wanted to read as a sample was in .doc format instead of .docx. Cannot open .doc format files with python-docx. I feel like I lost to opening it in Word and saving it as .docx, so I tried to convert it with antiword.

Install with apt-get: failed

In conclusion, I couldn't install antiword with apt-get on Mac. I thought that antiword should be apt-get, and fink / yu-sa / items / 351969b281f3aea5e03d) is inserted, and it is said that there is no JDK during the installation of fink. I was skipped to the download page (of course installing Flash player didn't help).

`bash`


sudo apt-get antiword

`Output result`


E: Invalid operation antiword

Install with brew: Success

`bash`


brew install antiword

I looked at here and entered the brew command, and it was installed successfully.

`bash`


(base) akpro:~ kageazusa$ antiword
	Name: antiword
	Purpose: Display MS-Word files
	Author: (C) 1998-2005 Adri van Os
	Version: 0.37  (21 Oct 2005)
	Status: GNU General Public License
	Usage: antiword [switches] wordfile1 [wordfile2 ...]
	Switches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]
		-f formatted text output
		-t text output (default)
		-a <paper size name> Adobe PDF output
		-p <paper size name> PostScript output
		   paper size like: a4, letter or legal
		-x <dtd> XML output
		   like: db (DocBook)
		-m <mapping> character mapping file
		-w <width> in characters of text output
		-i <level> image level (PostScript only)
		-L use landscape mode (PostScript only)
		-r Show removed text
		-s Show hidden (by Word) text

It's a 2005 tool!

Read / write test with python-docx

After copying and pasting the code in the latter half of here, it worked and I was able to create and read a Word file. There seems to be no problem with Python 3.7. In addition, when I copied and moved the code in the comment section of here, docx_simple_service could not be read. I'm guessing it's probably due to the Python version.

`error`


ModuleNotFoundError: No module named 'docx_simple_service'

Convert .doc file to .docx file with antiword and read with doc2txt

I will finally read the sample.

sample

Use an application like this. It was an era when Borders that have disappeared recently was active. Since it is a .doc file, it cannot be read by Python as it is. I couldn't find the final version of the Word file, so I will use the version slightly earlier than the final version that I submitted an email and had the office checked. スクリーンショット 2020-10-26 23.22.11.png

Read .doc file

I could read it immediately with the function in the answer here. Only the specified part of path has been changed slightly. I converted the .doc file to a .docx file with antiword and read it, and immediately deleted the read .docx file.

`python`


import os, docx2txt

def get_doc_text(filepath, file):
    
    if file.endswith('.docx'):
       text = docx2txt.process(file)
       return text
    
    elif file.endswith('.doc'):
       # converting .doc to .docx
       doc_file = os.path.join(filepath, file)
       docx_name = file + 'x'
       docx_file = os.path.join(filepath, docx_name)
        
       if not os.path.exists(docx_file):
          os.system('antiword ' + doc_file + ' > ' + docx_file)
            
          with open(docx_file) as f:
             text = f.read()
          os.remove(docx_file) #docx_file was just to read, so deleting
        
       else:
          # already a file with same name as doc exists having docx extension, 
          # which means it is a different file, so we cant read it
          print('Info : file with same name of doc exists having docx extension, so we cant read it')
          text = ''
        
       return text

I was able to read it! スクリーンショット 2020-10-27 22.45.58.png

Try to extract the contents of each subheading

I would like to use the text I read to extract the content of each subheading. In this example, the subheadings are enclosed in []. [Problem]

Text formatting

Delete line breaks, etc.

`python`


gakushin = get_doc_text('./sample', '110624GakushinDraftKAGE2.1.doc')
gakushin = gakushin.replace('\n', '').replace('|', '').replace('\u3000', '')

スクリーンショット 2020-10-28 19.57.37.png There is still continuous space left, so delete it while looking at here.

`python`


import re

#Delete continuous space to make one half-width space
gakushin = re.sub('[ 　]+', ' ', gakushin)

There are places where I want space to remain, such as between et al. And the year, so I left one half-width space for the time being. Strictly speaking, it's best to remove all spaces and then replace only around et al. Or &. スクリーンショット 2020-10-28 20.06.55.png

Extraction of subheadings

Try to search for the part enclosed in []. I'm a weak person with regular expressions, so I searched and referred to here and it worked.

`python`


re.findall('\【.+?\】', gakushin)

`Output result`


['【background】',
 '【problem】',
 '[Solutions, research objectives, research methods, features and original points]',
 '[Research progress 1]',
 '[Research progress 2]',
 '[Background of future research plans]',
 '[Problems / Points to be solved]',
 '[How did you come up with the idea]',
 '【2-1】',
 '【2-2】',
 '[Refereed]',
 '[No oral presentation / peer review]',
 '[Poster presentation / peer review]',
 '[Motivation for aspiring to a research position]',
 '[Aiming researcher image]',
 '[Self-advantages, etc.]',
 '[Especially excellent academic performance and awards]',
 '[Characteristic extracurricular activities]']

The subheadings have been extracted!

Extraction of sentences under subheadings

Let's store the subheadings in a variable and use the subheadings themselves to split the text gakushin.

`python`


subhead = re.findall('\【.+?\】', gakushin)
text = gakushin
split_result = []

for i in range(len(subhead)):
    new_text = text.split(subhead[i])
    split_result.append(new_text[0])
    text = new_text[1]
    
#Only the last one[1]Put in
split_result.append(new_text[1])

スクリーンショット 2020-10-28 22.15.43.png I was able to divide the text into subheadings and list them. Let's check the number of elements.

`python`


print('Number of subheading elements', len(subhead))
print('Number of elements in the divided sentence', len(split_result))

`Output result`


Number of subheading elements 18
Number of elements in the divided sentence 19

Number of subheading elements + 1 = Number of elements of sentences divided by subheadings, and the calculation seems to be correct. Try storing it in a pandas DataFrame so that the subheading and the text below it match. The first element of the list split_result will be discarded.

`python`


import pandas as pd

df = pd.DataFrame([subhead, split_result[1:19]]).T
df.columns = ['subhead', 'text']

スクリーンショット 2020-10-28 22.22.46.png

Subheadings and the text below them have been associated. Let's count the number of characters and put it in the data frame.

`python`


df['length'] = df.text.apply(len)

スクリーンショット 2020-10-28 22.38.25.png The item [2-2] seems to be particularly long. Even just looking at this, it is not clear what [2-2] stands for. It looks like a research plan, but the reason why there is no [1] is unknown.

Summary

I was able to read the .doc file from Python and manipulate the text. I would like to try various things in the future.

reference

-Research Fellow | Japan Society for the Promotion of Science

Read the old Gakushin DC application Word file (.doc) from Python and try to operate it.

motivation

How to read a Word file from Python?

environment

Install python-docx

bash

Install docx2txt

bash

antiword installation

Install with apt-get: failed

bash

Output result

Install with brew: Success

bash

bash

Read / write test with python-docx

error

Convert .doc file to .docx file with antiword and read with doc2txt

sample

Read .doc file

python

Try to extract the contents of each subheading

Text formatting

python

python

Extraction of subheadings

python

Output result

Extraction of sentences under subheadings

python

python

Output result

python

python

Summary

reference

`bash`

`bash`

`bash`

`Output result`

`bash`

`bash`

`error`

`python`

`python`

`python`

`python`

`Output result`

`python`

`python`

`Output result`

`python`

`python`