[Java] Text extraction from PowerPoint (ppt) using Apache POI
Introduction h2>
I needed to extract text from a large number of PowerPoint files, so I originally planned to use python-pptx, but I gave up because the files were up to PowerPoint 2003 (extension ppt).
Therefore, I decided to extract text using Apache POI , which is an external Java library.
What is Apache POI h2>
![apache poi.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/215216/93f84b23-ae06-5ae5-1904-1e35e8b675d8.png)
Apache POI is a 100% Java library that can read and write files in Microsoft Office format.
You can operate Excel and Word as well as PowerPoint covered in this article.
Download Apache POI h2>
Download Apache POI from here .
Program h2>
Get the text of the first page slide of the following ppt file.
![ppt2txt.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/215216/47d3df1d-0137-e1a2-506e-24b1328ba356.png)
It works with the following program
PPT2txt.java
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.List;
import org.apache.poi.hslf.usermodel.HSLFSlide;
import org.apache.poi.hslf.usermodel.HSLFSlideShow;
import org.apache.poi.util.IOUtils;
public class PPT2txt {
public static void main(String[] args) throws IOException {
File file = new File("./data/test.ppt");
FileInputStream inputStream = new FileInputStream(file);
//Increase the maximum value if the file to be read is large
// IOUtils.setByteArrayMaxOverride(10000000);
HSLFSlideShow ppt = new HSLFSlideShow(inputStream);
//Get an array of all slides in a presentation
List<HSLFSlide> slides = ppt.getSlides();
int page = 1;
int paragraph = 1;
System.out.println(slides.get(page).getTextParagraphs().get(paragraph));
ppt.close();
}
}
Result h2>
You can get the text on the first page by running the program.
[1st page text]
Conclusion h2>
In this article, I introduced the method of text extraction for PowerPoint 2003 and later (extension ppt), but of course it is also possible for PowerPoint 2007 and later file formats (extension pptx).
In that case, use org.apache.poi.xslf instead of org.apache.poi.hslf.
In addition to text extraction, you can also acquire images and create slides. In that case, look at the document and do your best.
Thank you for reading for me until the end.