There are various documents in the office such as contracts and reports, but I think the mainstream is [Word documents].
When automating operations, there are many cases where you want to automate the creation and reading of Word documents. In fact, I also use Python to automate the creation of consignment contracts that I create every three months.
In this article, I will explain how to ** read Word documents with Python ** using a library called python-docx. (Next time, I will introduce how to create and replace Word documents.)
python-docx is not a standard library. Even Anaconda is not included by default, so let's install it first.
pip install python-docx
Import the library after installation. Please note that when importing, it is ** docx **, not python-docx.
python
import docx
Then read the Word document and create the object. Here, the following document called "test.docx" is read.
python
document = docx.Document("test.docx")
This document object has a list called paragraphs and a list called tables.
paragraphs are paragraphs in the text, and tables are tables. tables has rows as a list of rows, and rows has columns (cells) as a list called cells. If you want to get the text, refer to the attribute called text.
In other words, it has such a structure.
python
for paragraph in word.paragraphs:
print(paragraph.text)
Execution result
This is the first paragraph.
This is the second paragraph.
python
for table in document.tables:
for row in table.rows:
for cell in row.cells:
print(cell.text)
Execution result
Here is one row of the table/1 row
Here is one row of the table/2 rows
Here are two rows of the table/1 row
Here are two rows of the table/2 rows
Unfortunately, python-docx can't read ** footnotes **.
If you want to operate including footnotes, you may need to consider another method.
Recommended Posts