For that reason, I wanted to document the healthy walking exercises on this page. I was studying ** Sphinx ** as a trial, so I will use this. I feel like I'm using it forcibly, so it may have been extremely inefficient, but it's good because it was relatively well done.
Blank pages are mixed in the middle, but it's okay.
Sphinx
http://sphinx-users.jp/
BeautifulSoup
http://tdoc.info/beautifulsoup/
There seems to be ** HTMLParser **, but I didn't know how to use it, and it was easy for me to understand personally, so I used it.
pip install BeautifulSoup
After that, I will use this to extract the explanation image and sentences from the HTML of the page in question and convert it into the form of the material with Sphinx. More on that later.
First of all, the URL of the gymnastics page is
http://kaigoouen.net/program/gymnastics/gymnastics_{Gymnastics type number}_{page number}.html
With that feeling, there are two places where numbers are entered (I don't care, but does this site make HTML for each page?).
There are three levels of healthy walking exercises: hops, steps, and jumps, but it seems that ** hops have 2, steps have 3, and jumps have 4 ** (By the way, 1 seems to be something like the previous story, so this time. Is unnecessary).
Hops, steps, and jumps are all made up of multiple pages, so here is the ** number of pages **.
If you think based on these, is it a flow like this?
I plan to make the materials for hops, steps, and jumps individually, so this is generally OK.
In the flow considered above, consider part 3.
Apparently, the part I want is in the div element ** of the class ** box02.
For example, hops look like this.
<div class="box02">
<h2>
<img alt="item name" src="text.gif"></img>
</h2>
<div class="box02_txt_hop">
<div class="hop_box">
<div class="hop_txt">Commentary</div>
<div class="hop_image">
<img alt="Reference photo" src="image.jpg "></img>
</div>
</div>
<div class="hop_box">
<!--Item continuation-->
</div>
</div>
<div class="box02">
<!--Next item-->
</div>
So what I want in this is ...
that's all. Use the BeautifulSoup module to narrow down these. It wasn't too difficult because you can ** find ** or ** findAll ** with element names or class names. After that, write a Python script that rewrites the narrowed down one to .rst format for Sphinx. Just do your best while looking at the Document.
For the time being, I used the heading ** for the item name, and ** the table method ** because I wanted to arrange the commentary and images side by side.
When extracting the explanations and images of each item, I first thought of ** let's trace the relevant part from the top by class name **. So, when I examined the elements of Firefox with ** Verify **, it seemed that I should dig into it like ** box02> box02_txt_hop> hop_box> hop_image or hop_txt **, so I was doing it. Then, the last commentary for each item will be omitted. So, when I looked closely, it seemed that for some reason, only the explanation at the end of each item was not included in the ** hop_box class for both the text and the image, and it was directly under the upper box02_txt_hop ** (reason). Is unknown). So, hop_box is decided to ignore from the beginning. In the first place, Find and findAll of Beautiful Soup seem to ** search from all the elements below the target element according to the conditions **, so it seems that it was not necessary to search one layer at a time.
If the contents were only text, it would be more beautiful, but I personally didn't like it because it was hard to see if it included images. However, I didn't know how to change the style (PDF aside from HTML), so I dealt with it by creating a new table one by one.
Frequently used when outputting .rst files from Python scripts.
type(text)
I will fix it steadily while checking it.
There seem to be two ways, ** rst2pdf ** and ** latex **, but the former seemed easier, so I decided to use ** rst2pdf this time **.
And I was addicted to it one after another (maybe I didn't want to install it with pip).
ImportError: reportlab requires Python 2.7+ or 3.3+; 3.0-3.2 are not supported.
Then, refer to here
pip install -U reportlab==2.5
Then the error changes.
>[ERROR] pdfbuilder.py:130 need more than 3 values to unpack
Traceback (most recent call last):
File "/home/vagrant/www/public/hopstepjump/venv/lib/python2.6/site-packages/rst2pdf/pdfbuilder.py", line 78, in write
docname, targetname, title, author = entry[:4]
ValueError: need more than 3 values to unpack
FAILED
build succeeded, 1048 warnings.
In short, the way to write ** conf.py ** was different. At first I put the documents I want to pdf in ** pdf_documents **
```py
('hop','step','jump'),
I thought that it would be connected like this, but it was different,
('docName', u'file name', u'The title that appears on the cover of the PDF', u'Author'),
It seems that (in fact, it was written in the manual). This tuple is included in the list, so if you want to convert multiple documents to PDF, you can add multiple tuples.
Anyway, take a second look and `` `make pdf```
IOError: decoder jpeg not available identity=[ImageReader@0x3a3ce10 filename='/home/vagrant/www/public/hopstepjump/venv/lib/python2.6/site-packages/rst2pdf/images/image-missing.jpg'] FAILED build succeeded.
Another new problem. Is jpeg useless?
After investigating, it seems that libjpeg is necessary, not that PIL is bad. http://kwmt27.net/index.php/2013/07/14/python-pil-error-decoder-jpeg-not-available/
Install this and reinstall PIL Can you go with yum?
yum search libjpeg
============================= N/S Matched: libjpeg ============================= libjpeg-turbo-devel.i686 : Headers for the libjpeg-turbo library libjpeg-turbo-devel.x86_64 : Headers for the libjpeg-turbo library libjpeg-turbo-static.x86_64 : Static version of the libjpeg-turbo library libjpeg-turbo.x86_64 : A MMX/SSE2 accelerated library for manipulating JPEG : image files libjpeg-turbo.i686 : A MMX/SSE2 accelerated library for manipulating JPEG image : files
I don't know which one, so try `` `yum install libjpeg``` (I hope that you will choose the best one for you).
Updated: libjpeg-turbo.x86_64 >0:1.2.1-3.el6_5
Complete!
It seems that it was done, so follow the page I referred to
pip install -i pillow
make pdf
IOError: decoder jpeg not available identity=[ImageReader@0x221e150 filename='/home/vagrant/www/public/hopstepjump/venv/lib/python2.6/site-packages/rst2pdf/images/image-missing.jpg'] FAILED build succeeded.
No progress ...
yum list | grep libjpeg
When I tried it, it was installed properly, but pillow and pil are the same, aren't they?
So
pip install pil
But it doesn't change ...
Is this this? http://all-rounder-biz.blogspot.jp/2013/06/macioerror-decoder-jpeg-not-available.html Wrong·····.
Then this? http://d.hatena.ne.jp/rougeref/20130116 Yes, I was disappointed ...
Well, if you look at the display after installing PIL
*** TKINTER support not available
*** JPEG support not available
--- ZLIB (PNG/ZIP) support available
*** FREETYPE2 support not available
*** LITTLECMS support not available
--------------------------------------------------------------------
It has become. Certainly JPEG remains not available. Error messages don't lie.
Is it here? http://dev-pao.blogspot.jp/2010/04/python-imaging-library-piljpeg.html No ...
Next http://d.hatena.ne.jp/rougeref/20130116
Oh, maybe ...
yum install libjpeg-devel
Then, erase PIL and reinstall.
--- JPEG support available
** It seems that the cause was that devel could not be installed **. Anyway, the JPEG problem is solved, so `` `make pdf```.
[ERROR] image.py:110 Missing image file: /home/vagrant/www/public/hopstepjump/http://kaigoouen.net/img/hop_pic_108.jpg
\u304b\u304b\u3068\u304b\u3089\u3064\u3044\u3066\u30fb\u30fb\u30fb\u3064\u307e\u5148\u3067\u3051\u308a\u51fa\u3059\u3001\u3092\u7e70\u308a\u8fd4\u3057\u306a\u304c\u3089\u3001\u6b69\u304d\u307e\u3057\u3087\u3046\u3002 line done build succeeded.
Alright, it's not working, but ** the place where I'm trying to load the image is obviously strange **. But I don't know how to fix it. Isn't it supposed to fetch images from the outside?
It was difficult to check it as an error message, so I tried various experiments and found that replace in **. ) Seemed to be okay **. Is it a kind of bug? Anyway, the solution is to not use it.
Next, the characters are garbled.
What is suspicious is
sh: fc-match: command not found [ERROR] findfonts.py:208 Unknown font: DejaVu Sans Mono-Bold
Part (full out). What is ** fc-match **? So when I google it, it looks like a command that comes in a library called ** fontconfig **.
Then with yum
yum search fontconfig
=========================== N/S Matched: fontconfig ============================ fontconfig.i686 : Font configuration and customization library fontconfig.x86_64 : Font configuration and customization library fontconfig-devel.i686 : Font configuration and customization library fontconfig-devel.x86_64 : Font configuration and customization library
There are also ordinary and develop. I will not step on the same rut twice ... so I installed develop. And when I tried `` `make pdf``` again, the previous "command not found" was resolved. However, the line below is unresolved. In the first place, there is'DejaVu Sans Mono-Bold', but I don't know such a font, what is this?
http://www.fontsquirrel.com/fonts/dejavu-sans-mono
Oh, it's an English font by all means. Does that mean you can't read the stylesheet? ** conf.py ** didn't seem to be wrong, so check out ** ja.json **, which is the stylesheet. I think I wrote it according to the document. Then, the result of worrying about several hours was that I didn't close the quotation marks in only one place. I also learned that singles are not good and that they have to be doubles.
Anyway, it finally started to search for the set Japanese font ... but ** I can't find it **. In the first place, the ** fc-list command did not return anything **.
Well, when I moved the font file under / usr / share / fonts, it was solved and the garbled characters were fixed, but then I don't understand the meaning of pdf_font_path ** in ** conf.py. Currently, the contents are empty, but it can be displayed in Japanese without any problem, and even if I try to write the path of another place, it is not reflected. What if I want to put it somewhere else? (Rewrite the settings of ** Fontconfig ** somewhere?)
hopstepjump.py
#! /usr/bin/env python
# *-*coding:utf-8*-*
import sys,re,urllib2
from BeautifulSoup import BeautifulSoup
#### reference page ####
base_url = 'http://kaigoouen.net'
hopstepjump = {
'hop':{
'index':2,
'last_page':25,
},
'step':{
'index':3,
'last_page':39,
},
'jump':{
'index':4,
'last_page':46,
},
}
#### for make .rst files ####
br = u'\n'
page_break = u".. raw:: pdf%s PageBreak" % (br*2)
def soup_to_sphinx(pg):
p = 1
while p <= pg['last_page']:
url = base_url + '/program/gymnastics/gymnastics_{index}_{page}.html'.format(index=pg['index'],page=p)
htmldata = urllib2.urlopen(url)
soup = BeautifulSoup( unicode(htmldata.read(),'utf-8') )
for box in soup.findAll('div',{'class':'box02'}):
lessons = box.find('div',{'class':'box02_txt_%s' % choice})
if lessons is not None:
title = box.contents[1].contents[0]['alt']
print( sphinx_head(title) )
images = lessons.findAll('div',{'class':'%s_image' % choice})
texts = lessons.findAll('div',{'class':'%s_txt' % choice})
texts = iter(texts)
for image in images:
src = base_url + image.contents[0]['src']
image = sphinx_image(src)
text = texts.next().renderContents()
text = sphinx_text(text)
print( sphinx_listtable(image,text) )
print(page_break)
htmldata.close()
p += 1
def sphinx_head(txt):
return br.join([br,txt,u"="*30+br]).encode('utf-8')
def sphinx_listtable(s_img,s_txt):
table = u".. list-table::" + br
image = u" * - %s" % s_img
text = u" - | %s" % s_txt
return br.join([table,image,text,br]).encode('utf-8')
def sphinx_image(src):
option = u":width: 150pt"
return u".. image:: %s" % src + br + u" %s" % option
def sphinx_text(txt):
text = txt.decode('utf-8').replace(u"<br />",br)
if br in text:
texts = text.splitlines()
text = reduce(lambda x,y: x + br + u" | " + y,texts)
return text
if __name__ == '__main__':
try:
choice = sys.argv[1]
page = hopstepjump[choice]
except IndexError:
print("[error]: Option required ('hop'Or'step'Or'jump'のいずれOr)。")
exit(" -Example:'python %s hop'" % sys.argv[0])
except KeyError:
print("[error]: The options are different ('hop'Or'step'Or'jump'のいずれOr)。")
exit(" -Example:'python %s hop'" % sys.argv[0])
soup_to_sphinx(page)
python hopstepjump.py hop > hop.rst
It's a pleasure to write a file with more than 1000 lines with a single command.
Anyway, if you make a configuration with this, all you have to do is `make html``` or
`make pdf```.
GitHub
https://github.com/juniskw/hopstepjump/tree/no_replace_listtable
By the way, a foreigner I don't know is authorized. Is this something mischievous?
Recommended Posts