I use the general-purpose document format conversion tool pandoc. For more information on Pandoc, please refer to Japanese User's Guide.
However, in some cases, you may want to subtly modify the document during conversion. For example, you may want to replace the URL of a link all at once when converting a document written in Markdown to HTML. It's easy to convert with a regular expression, but pandoc actually provides a filter feature. Filters allow you to take advantage of the syntax tree of parsed documents. Filters can be written in Haskell as well as pandoc itself, but mechanically they can be written in any language and Python is officially supported.
As shown below, the syntax tree of the document parsed by pandoc is converted to JSON format and passed to the filter via standard input / output (the figure is from the manual).
source format
↓
(pandoc)
↓
JSON-formatted AST
↓
(filter)
↓
JSON-formatted AST
↓
(pandoc)
↓
target format
You can use it to write intelligent filters. First, let's install the officially provided pandocfilters.
pip install pandocfilters
Let's use this to write a filter that changes the link URL in the document immediately.
convertlink.py
from pandocfilters import toJSONFilter, Link
def myfilter(key, value, format_, meta):
if key == 'Link':
value[1][0] = "prefix/" + value[1][0]
return Link(*value)
if __name__ == "__main__":
toJSONFilter(myfilter)
To do this, specify the filter option when running pandoc. Note that you have to write "./convertlink.py" to specify the script in the current directory.
sample.txt
## sample document
text text text
[link](path/to/otherpage)
$ pandoc --filter=./convertlink.py -t markdown sample.txt
sample document
---------------
text text text
[link](prefix/path/to/otherpage)
A sample syntax tree (pandoc AST) used by pandoc can be output with pandoc. If you specify json, you can check it in JSON format, and if you specify native, you can check it in Haskell format.
$ pandoc -t json sample.txt
[{"unMeta":{}},[{"t":"Header","c":[2,["sample-document",[],[]],[{"t":"Str","c":"sample"},{"t":"Space","c":[]},{"t":"Str","c":"document"}]]},{"t":"Para","c":[{"t":"Str","c":"text"},{"t":"Space","c":[]},{"t":"Str","c":"text"},{"t":"Space","c":[]},{"t":"Str","c":"text"}]},{"t":"Para","c":[{"t":"Link","c":[[{"t":"Str","c":"link"}],["path/to/otherpage",""]]}]}]]
$ pandoc -t native sample.txt
[Header 2 ("sample-document",[],[]) [Str "sample",Space,Str "document"]
,Para [Str "text",Space,Str "text",Space,Str "text"]
,Para [Link [Str "link"]("path/to/otherpage","")]]
Format details can be found in the Text.Pandoc.Definition documentation (http://hackage.haskell.org/package/pandoc-types).
Also, specifying filter options is equivalent to the following command pipeline, which you can use while debugging.
$ pandoc -t json sample.txt | python ./convertlink.py | pandoc -f json -t markdown
Recommended Posts