I heard from a university clerk that it is difficult to check the Kakenhi Application Form. .. Currently, he seems to be putting red one by one by hand.
--Are you filling out according to the application guidelines? --Are the figures and references properly included? ――Is the achievement list in the proper format?
It would be nice to have an automatic check tool. I hope I can make an automatic check tool with Python! While thinking, it is a heavy load to make immediately, so I first tried reading and writing Word files from Python. As a sample according to the Grant-in-Aid for Scientific Research application, the author submitted in June 2011 Japan Society for the Promotion of Science (JSCE) I read the application form. If you are not familiar with Gakushin, you may get an emotional response when you ask a familiar doctoral student.
Read from a word file in python
There are mainly python-docx and docx2txt, both of which are .docx. Only files are supported. As we will see later, when reading a .doc file, you will need to convert it to .docx with antiword. Since docx2txt can read text from headers, footers, and hyperlinks, I mainly tried it with docx2txt this time.
bash
pip install python-docx
It seems that python-docx only supports up to Python 3.4, but it works with Python 3.7. I didn't get Python 3.4 in Anaconda, so I left it at 3.7.
bash
pip install docx2txt
As I'll explain later, the Word file I wanted to read as a sample was in .doc format instead of .docx. Cannot open .doc format files with python-docx. I feel like I lost to opening it in Word and saving it as .docx, so I tried to convert it with antiword.
In conclusion, I couldn't install antiword with apt-get on Mac. I thought that antiword should be apt-get, and fink / yu-sa / items / 351969b281f3aea5e03d) is inserted, and it is said that there is no JDK during the installation of fink. I was skipped to the download page (of course installing Flash player didn't help).
bash
sudo apt-get antiword
Output result
E: Invalid operation antiword
bash
brew install antiword
I looked at here and entered the brew command, and it was installed successfully.
bash
(base) akpro:~ kageazusa$ antiword
Name: antiword
Purpose: Display MS-Word files
Author: (C) 1998-2005 Adri van Os
Version: 0.37 (21 Oct 2005)
Status: GNU General Public License
Usage: antiword [switches] wordfile1 [wordfile2 ...]
Switches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]
-f formatted text output
-t text output (default)
-a <paper size name> Adobe PDF output
-p <paper size name> PostScript output
paper size like: a4, letter or legal
-x <dtd> XML output
like: db (DocBook)
-m <mapping> character mapping file
-w <width> in characters of text output
-i <level> image level (PostScript only)
-L use landscape mode (PostScript only)
-r Show removed text
-s Show hidden (by Word) text
It's a 2005 tool!
After copying and pasting the code in the latter half of here, it worked and I was able to create and read a Word file.
There seems to be no problem with Python 3.7.
In addition, when I copied and moved the code in the comment section of here, docx_simple_service
could not be read. I'm guessing it's probably due to the Python version.
error
ModuleNotFoundError: No module named 'docx_simple_service'
I will finally read the sample.
Use an application like this. It was an era when Borders that have disappeared recently was active. Since it is a .doc file, it cannot be read by Python as it is. I couldn't find the final version of the Word file, so I will use the version slightly earlier than the final version that I submitted an email and had the office checked.
I could read it immediately with the function in the answer here. Only the specified part of path has been changed slightly. I converted the .doc file to a .docx file with antiword and read it, and immediately deleted the read .docx file.
python
import os, docx2txt
def get_doc_text(filepath, file):
if file.endswith('.docx'):
text = docx2txt.process(file)
return text
elif file.endswith('.doc'):
# converting .doc to .docx
doc_file = os.path.join(filepath, file)
docx_name = file + 'x'
docx_file = os.path.join(filepath, docx_name)
if not os.path.exists(docx_file):
os.system('antiword ' + doc_file + ' > ' + docx_file)
with open(docx_file) as f:
text = f.read()
os.remove(docx_file) #docx_file was just to read, so deleting
else:
# already a file with same name as doc exists having docx extension,
# which means it is a different file, so we cant read it
print('Info : file with same name of doc exists having docx extension, so we cant read it')
text = ''
return text
I was able to read it!
I would like to use the text I read to extract the content of each subheading. In this example, the subheadings are enclosed in []. [Problem]
Delete line breaks, etc.
python
gakushin = get_doc_text('./sample', '110624GakushinDraftKAGE2.1.doc')
gakushin = gakushin.replace('\n', '').replace('|', '').replace('\u3000', '')
There is still continuous space left, so delete it while looking at here.
python
import re
#Delete continuous space to make one half-width space
gakushin = re.sub('[ ]+', ' ', gakushin)
There are places where I want space to remain, such as between et al. And the year, so I left one half-width space for the time being. Strictly speaking, it's best to remove all spaces and then replace only around et al. Or &.
Try to search for the part enclosed in []. I'm a weak person with regular expressions, so I searched and referred to here and it worked.
python
re.findall('\【.+?\】', gakushin)
Output result
['【background】',
'【problem】',
'[Solutions, research objectives, research methods, features and original points]',
'[Research progress 1]',
'[Research progress 2]',
'[Background of future research plans]',
'[Problems / Points to be solved]',
'[How did you come up with the idea]',
'【2-1】',
'【2-2】',
'[Refereed]',
'[No oral presentation / peer review]',
'[Poster presentation / peer review]',
'[Motivation for aspiring to a research position]',
'[Aiming researcher image]',
'[Self-advantages, etc.]',
'[Especially excellent academic performance and awards]',
'[Characteristic extracurricular activities]']
The subheadings have been extracted!
Let's store the subheadings in a variable and use the subheadings themselves to split the text gakushin
.
python
subhead = re.findall('\【.+?\】', gakushin)
text = gakushin
split_result = []
for i in range(len(subhead)):
new_text = text.split(subhead[i])
split_result.append(new_text[0])
text = new_text[1]
#Only the last one[1]Put in
split_result.append(new_text[1])
I was able to divide the text into subheadings and list them. Let's check the number of elements.
python
print('Number of subheading elements', len(subhead))
print('Number of elements in the divided sentence', len(split_result))
Output result
Number of subheading elements 18
Number of elements in the divided sentence 19
Number of subheading elements + 1 = Number of elements of sentences divided by subheadings, and the calculation seems to be correct.
Try storing it in a pandas DataFrame so that the subheading and the text below it match. The first element of the list split_result
will be discarded.
python
import pandas as pd
df = pd.DataFrame([subhead, split_result[1:19]]).T
df.columns = ['subhead', 'text']
Subheadings and the text below them have been associated. Let's count the number of characters and put it in the data frame.
python
df['length'] = df.text.apply(len)
The item [2-2] seems to be particularly long. Even just looking at this, it is not clear what [2-2] stands for. It looks like a research plan, but the reason why there is no [1] is unknown.
I was able to read the .doc file from Python and manipulate the text. I would like to try various things in the future.
-Research Fellow | Japan Society for the Promotion of Science
Recommended Posts