[Java] Text extraction from PowerPoint (ppt) using Apache POI

Introduction I needed to extract text from a large number of PowerPoint files, so I originally planned to use python-pptx, but I gave up because the files were up to PowerPoint 2003 (extension ppt). Therefore, I decided to extract text using Apache POI , which is an external Java library.

What is Apache POI ![apache poi.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/215216/93f84b23-ae06-5ae5-1904-1e35e8b675d8.png) Apache POI is a 100% Java library that can read and write files in Microsoft Office format. You can operate Excel and Word as well as PowerPoint covered in this article.

Download Apache POI Download Apache POI from here .

Program Get the text of the first page slide of the following ppt file. ![ppt2txt.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/215216/47d3df1d-0137-e1a2-506e-24b1328ba356.png)

It works with the following program

PPT2txt.java


import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.List;

import org.apache.poi.hslf.usermodel.HSLFSlide;
import org.apache.poi.hslf.usermodel.HSLFSlideShow;
import org.apache.poi.util.IOUtils;

public class PPT2txt {
	public static void main(String[] args) throws IOException {
		File file = new File("./data/test.ppt");
		FileInputStream inputStream = new FileInputStream(file);

		//Increase the maximum value if the file to be read is large
//		IOUtils.setByteArrayMaxOverride(10000000);

		HSLFSlideShow ppt = new HSLFSlideShow(inputStream);

		//Get an array of all slides in a presentation
		List<HSLFSlide> slides = ppt.getSlides();

		int page = 1;
		int paragraph = 1;

		System.out.println(slides.get(page).getTextParagraphs().get(paragraph));

		ppt.close();
	}
}

Result You can get the text on the first page by running the program.
[1st page text]

Conclusion In this article, I introduced the method of text extraction for PowerPoint 2003 and later (extension ppt), but of course it is also possible for PowerPoint 2007 and later file formats (extension pptx). In that case, use org.apache.poi.xslf instead of org.apache.poi.hslf. In addition to text extraction, you can also acquire images and create slides. In that case, look at the document and do your best.

Thank you for reading for me until the end.

Recommended Posts

[Java] Text extraction from PowerPoint (ppt) using Apache POI
Text extraction from documents using POI, Tika
Notes for reading and generating xlsx files from Java using Apache POI
[Java] Creating an Excel file using Apache POI
Text extraction in Java from PDF with pdfbox-2.0.8
Excel operation using Apache POI
Using Docker from Java Gradle
Excel output using Apache POI!
Sample code using Minio from Java
[Java] Handle Excel files with Apache POI
Using JavaScript from Java in Rhino 2021 version
Connect from Java to MySQL using Eclipse
Call Java methods from Nim using jnim
Access Forec.com from Java using Axis2 Enterprise WSDL
Try accessing the dataset from Java using JZOS
Java adds a text box to PowerPoint slides
Java enables extraction of PDF text and images
Ssh connect using SSHJ from a Java 6 app