In Python, I scraped a web page containing markdown notation, output it to markdown, and then converted the markdown to HTML and then to PDF. (Complicated)
I used Beautiful Soup
for scraping, but I will omit it this time.
I didn't have much information about Markdown → HTML → PDF, so I summarized it.
Markdown → HTML used Markdown
, HTML → PDF used pdfkit
.
The environment is as follows.
In case of Windows environment, it is necessary to install wkhtmltopdf
before using pdfkit
.
Install the 64-bit version from this site.
You can use the path, or you can specify the path of the executable file directly in the Python code as you do this time.
We also use Pygments
to highlight the code blocks.
The folder structure is as follows.
app
|- file
| |- source
| | └─ source.md ← Original markdown
|└─ pdf ← PDF output destination
|- app.py
└─requirements.txt
Install each library.
requirements.txt
Markdown==3.2.1
pdfkit==0.6.1
Pygments==2.6.1
> pip install -r requirements.txt
The markdown that is the conversion source is as follows.
source.md
#title
##subtitle
Markdown text.
###list
1.Numbered list
1.Nested numbered list
1.Numbered list
###text
this is*italic*It is a description of.
this is**Emphasis**It is a description of.
this is[Link](https://qiita.com/)It is a description of.
###Source code
- Java
\```java
class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello World");
}
}
\```
- Python
\```python
class HelloWorld:
def __init__(self, id, name):
self.id = id
self.name = name
if __name__=='__main__':
hello = HelloWorld(1, 'python')
\```
###Ruled line
***
---
* * *
###table
| Table Header | Table Header | Table Header |
| :-- | :--: | --: |
| Body | Body | Body |
| Left | Center | Right |
Use the Markdown
library to convert Markdown to HTML.
Enable the extension codehilite
and highlight the source code with Pygments
.
app.py
import markdown
from pygments import highlight
from pygments.formatters import HtmlFormatter
def mark_to_html():
#Read markdown file
f = open('file/source/source.md', mode='r', encoding='UTF-8')
with f:
text = f.read()
#Create stylesheets for highlights with Pygments
style = HtmlFormatter(style='solarized-dark').get_style_defs('.codehilite')
# #Markdown → HTML conversion
md = markdown.Markdown(extensions=['extra', 'codehilite'])
body = md.convert(text)
#Fit to HTML format
html = '<html lang="ja"><meta charset="utf-8"><body>'
#Import stylesheets created with Pygments
html += '<style>{}</style>'.format(style)
#Add style to add border to Table tag
html += '''<style> table,th,td {
border-collapse: collapse;
border:1px solid #333;
} </style>'''
html += body + '</body></html>'
return html
Markdown has the following extensions that you can specify as a list when creating an object.
md = markdown.Markdown(extensions=["Extensions"])
Only the ones that can be used are excerpted. For all extensions below.
https://python-markdown.github.io/extensions/#officially-supported-extensions
Expansion | function |
---|---|
extra | Convert basic markdown notation such as abbreviation elements, lists, code blocks, citations, tables to HTML |
admonition | A note can be output |
codehilite | You can add syntax highlighting defined in Pygments to code blocks. Requires Pygments. |
meta | You can get the meta information of the file |
nl2br | One line feed code<br> Convert to tag |
sane_lists | Supports lists with line breaks and numbered lists from specified numbers |
smarty | " ,> Supports HTML special characters such as |
toc | Automatically create a table of contents from the composition of headings |
wikilinks | [[]] Corresponds to the link notation in |
You can also use third-party extensions.
https://github.com/Python-Markdown/markdown/wiki/Third-Party-Extensions
The Pygments
that I'm trying to highlight in the code specifies the style and class names to apply.
style = HtmlFormatter(style="Style name").get_style_defs("name of the class")
Applicable styles can be obtained from the Pygments
command line tool.
> pygmentize -L styles
The class name will be output as <code class =" codehilite ">
when codehilite
is enabled in markdown
, so set it to codehilite
.
HTML → PDF
We use a library called pdfkit
to convert from HTML to PDF.
Since this library uses wkhtmltopdf
internally, you need to put the path in the environment variable or specify the path of the executable file at runtime.
app.py
import pdfkit
def html_to_pdf(html:str):
"""
html : str HTML
"""
#Specifying the output file
outputfile = 'file/pdf/output.pdf'
#Specifying the path of the wkhtmltopdf executable file
path_wkhtmltopdf = r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=path_wkhtmltopdf)
#Perform HTML → PDF conversion
pdfkit.from_string(html, outputfile, configuration=config)
This time, the MTML format string is converted to a PDF file, but you can also convert the website to PDF by specifying the URL. In that case, use the from_url
function. Also, if you want to convert an HTML file to PDF, use from_file
.
import markdown
import pdfkit
from pygments import highlight
from pygments.formatters import HtmlFormatter
def mark_to_html():
#Read markdown file
f = open('file/source/source.md', mode='r', encoding='UTF-8')
with f:
text = f.read()
#Create stylesheets for highlights with Pygments
style = HtmlFormatter(style='solarized-dark').get_style_defs('.codehilite')
# #Markdown → HTML conversion
md = markdown.Markdown(extensions=['extra', 'codehilite'])
body = md.convert(text)
#Fit to HTML format
html = '<html lang="ja"><meta charset="utf-8"><body>'
#Import stylesheets created with Pygments
html += '<style>{}</style>'.format(style)
#Add style to add border to Table tag
html += '''<style> table,th,td {
border-collapse: collapse;
border:1px solid #333;
} </style>'''
html += body + '</body></html>'
return html
def html_to_pdf(html: str):
#Specifying the output file
outputfile = 'file/pdf/output.pdf'
#Specifying the path of the wkhtmltopdf executable file
path_wkhtmltopdf = r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=path_wkhtmltopdf)
#Perform HTML → PDF conversion
pdfkit.from_string(html, outputfile, configuration=config)
if __name__=='__main__':
html = mark_to_html()
html_to_pdf(html)
The result of executing the above function and converting it is as follows.
The output result of markdown can be converted to PDF. The code block is also highlighted. I didn't include it in the example this time, but you can output the image without any problem.
I used the Python libraries Markdown
and pdfkit
to convert markdown to PDF.
If you create a template, you can convert the minutes to PDF immediately, which is convenient.
I hope you can use it as a material for improving work efficiency.
Recommended Posts