motivation

For some reason, I decided to try OCR in multiple languages, but I didn't have a dataset that I could use freely, so I had to create my own, so I created a script.

What I used

Use Pillow (PIL), Python's image processing library. http://pillow.readthedocs.org/en/3.0.x/index.html

Image to generate

Generates one image for each character.

script

The code of the generated part body is as follows.

from PIL import Image
from PIL import ImageDraw
from PIL import ImageFont

def generate_char_img(char, fontname='Osaka', size=(64, 64)):
    img=Image.new('L', size, 'white')
    draw = ImageDraw.Draw(img)
    fontsize = int(size[0]*0.8)
    font = ImageFont.truetype(fontname, fontsize)

    # adjust charactor position.
    char_displaysize = font.getsize(char)
    offset = tuple((si-sc)//2 for si, sc in zip(size, char_displaysize))
    assert all(o>=0 for o in offset)

    # adjust offset, half value is right size for height axis.
    draw.text((offset[0], offset[1]//2), char, font=font, fill='#000')
    return img

def save_img(img, filepath):
    img.save(filepath, 'png')

I put the whole executable code in gist. https://gist.github.com/lazykyama/dabe526246d60fa937d1 ** (2015/10/18 23:47 Addendum: It seems that the specification of ʻImage.save ()` or the file name of uppercase and lowercase letters is not distinguished, so please be careful.) **

To generate a character list for each language, do the following.

English (alphabet case + number)

eng_char_list = list(string.digits+string.ascii_letters)

(Reference of string module → http://docs.python.jp/3.3/library/string.html)

Japanese

Let's do our best and pull out the characters from Wikipedia.

Other languages

(゜ ⊿ ゜) Silane

Caution

Note that the font name must be the name of the actual * .ttf file. ――As soon as I got hooked on it because of my weakness

Generate many single-character images with Pillow (PIL)