I thought that there would be as many libraries as possible to easily read and convert PDFs, but unexpectedly, I couldn't find one that was easy to use. However, the library called Poppler written using Qt was quite good, so I tried using it. Poppler-qt4 (C++) http://people.freedesktop.org/~aacid/docs/qt4/ python-poppler-qt4 https://pypi.python.org/pypi/python-poppler-qt4/
Also, Poppler's documentation is here. First of all, read from Document type to find out what you can do.
There is also a Qt5 version, but this time I used the Qt4 version. Some were written in both C ++ and Python. Python used Python3.
All the code posted here is posted on github.
As a demo, I made the following.
Load with doc = Poppler.Document.load (path)
.
Note that if you don't write doc.setRenderHint (Poppler.Document.Text Antialiasing)
, the output will be very messy.
After loading, retrieve the Page object with page = doc.page (i)
. Page numbers start at zero.
If ʻimage = page.renderToImage (), the image will be returned by Qt's [QImage type](http://doc.qt.io/qt-4.8/qimage.html). Images can be saved with ʻimage.save ()
.
Written in Python. If you want to, I think it's easy to go to C ++.
dump_image.py
import sys
import os.path
from contextlib import closing
from PyQt4 import QtCore
from popplerqt4 import Poppler
FORMAT = 'PNG'
EXT = '.png'
def dump_image(path):
doc = Poppler.Document.load(path)
doc.setRenderHint(Poppler.Document.TextAntialiasing)
filename_fmt = os.path.splitext(path)[0] + '_{0}' + EXT
for n,page in ((i+1, doc.page(i)) for i in range(doc.numPages())):
page.renderToImage().save(filename_fmt.format(n), FORMAT)
if __name__ == '__main__':
app = QtCore.QCoreApplication(sys.argv)
if len(sys.argv) != 2:
print('Usage: {0} pdf_path'.format(sys.argv[0]))
else:
dump_image(sys.argv[1])
sys.exit(0)
For text extraction, I think there are many libraries besides Poppler. I can do it for the time being. page.textList () returns a list of Textbox objects with words, so we're looping through each one.
dump_text.py
import sys
from PyQt4 import QtCore
from popplerqt4 import Poppler
def dump_text(path):
doc = Poppler.Document.load(path)
for n,page in ((i+1, doc.page(i)) for i in range(doc.numPages())):
print('\n-------- Page {0} -------'.format(n))
for txtbox in page.textList():
print(txtbox.text(), end=' ')
if __name__ == '__main__':
app = QtCore.QCoreApplication(sys.argv)
if len(sys.argv) != 2:
print('Usage: {0} pdf_path'.format(sys.argv[0]))
else:
dump_text(sys.argv[1])
sys.exit(0)
It turns out that Poppler can convert pages to QImage type images. You can also have QPainter draw the page. You can use QPainter, but this time, I will take the QImage type, convert it to the QPixmap type, and display it on the QLabel widget (PdfWidget that inherits it).
I didn't want to make something too elaborate, so I didn't have a file reading UI. The file path is specified in the widget constructor and it is loaded. I think I should have resized it to fit the widget size, but I haven't done it this time.
pdfwidget.h
#include <QLabel>
#include <QString>
namespace Poppler{
class Document;
};
class PdfWidget : public QLabel{
Q_OBJECT
public:
PdfWidget(QString path, QWidget *parent = 0);
public slots:
void next_page();
void prev_page();
private:
void load_current_page();
int n_pages;
int current_page;
QString path;
Poppler::Document *doc;
};
pdfwidget.cpp
#include "pdfwidget.h"
#include <poppler-qt4.h>
#include <QAction>
#include <QPixmap>
PdfWidget::PdfWidget(QString path, QWidget *parent)
: QLabel(parent), path(path)
{
doc = Poppler::Document::load(path);
doc->setRenderHint(Poppler::Document::TextAntialiasing);
n_pages = doc->numPages();
current_page = 0;
load_current_page();
QAction *next_page = new QAction("Next Page", this);
next_page->setShortcut(Qt::Key_Right);
connect(next_page, SIGNAL(triggered()), this, SLOT(next_page()));
addAction(next_page);
QAction *prev_page = new QAction("Prev Page", this);
prev_page->setShortcut(Qt::Key_Left);
connect(prev_page, SIGNAL(triggered()), this, SLOT(prev_page()));
addAction(prev_page);
}
void PdfWidget::next_page()
{
current_page++;
if(current_page >= n_pages) current_page = 0;
load_current_page();
}
void PdfWidget::prev_page()
{
if(current_page) current_page--;
else current_page = n_pages - 1;
load_current_page();
}
void PdfWidget::load_current_page()
{
setPixmap(QPixmap::fromImage(doc->page(current_page)->renderToImage()));
setWindowTitle(QString("%1/%2 - %3").arg(current_page+1)
.arg(n_pages).arg(path));
}
For the time being, I try to go to the next page / previous page with the right / left of the keyboard.
As far as I know, there aren't many libraries that can convert PDF to images, either in C ++ or Python. I introduced Poppler as an easy-to-use library on multiple platforms, even if it's not so much.
It goes well with Qt, so it's very easy to use when you want to handle PDF with GUI.
Recommended Posts