Download the image from the text file containing the URL

Introduction

――Actually, it downloads images one after another like $ wget -i urls.txt. --However, if the image does not exist, .html and .txt will be downloaded. --This time, we will check Content-Type, check the image data, and uniformly convert to .jpeg. --The previous article was Getting Image Links with Google Custom Search Engine. --The complete source is here.

Library installation

--Download images from URL requests --Check and convert image data pillow

$ pip install pillow requests

Configuration file config.py

--As shown below, the URL file is referenced based on CLASSES and LINK_PATH. --Also, download the image to DOWNLOAD_PATH. ――For details, please check the previous article.

$ cat config.py


CLASSES = [
    'Abe Oto',
    'Satomi Ishihara',
    'Yuno Ohara',
    'Fuka Koshiba',
    'Haruna Kawaguchi',
    'Nana Mori',
    'Minami Hamabe',
    'Kaya Kiyohara',
    'Haruka Fukuhara',
    'Kuroshima Yuina'
]


BASE_PATH = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
DATA_PATH = os.path.join(BASE_PATH, 'data')
LINK_PATH = os.path.join(DATA_PATH, 'link')
DOWNLOAD_PATH = os.path.join(DATA_PATH, 'download')

Text file with URL

--The following files.

$head Yuina Kuroshima.txt
http://cm-watch.net/wp-content/uploads/2018/03/b22dc3193fd35ebb1bf7aa4e74c8cffb.jpg
https://www.crank-in.net/img/db/1165407_650.jpg
https://media.image.infoseek.co.jp/isnews/photos/hwchannel/hwchannel_20191107_7062003_0-small.jpg
https://i.pinimg.com/originals/3e/3c/61/3e3c61df2f426a8e4623b58d84d94b40.jpg
http://yukutaku.net/blog/wp-content/uploads/wordpress-popular-posts/253-100x100.jpg
http://gratitude8888.biz/wp-content/uploads/2017/03/cb1175590da467bef3600df48eabf770.jpg
https://www.cinemacafe.net/imgs/p/ATDRThl-6oWF9fpps9341csCOg8ODQwLCgkI/416673.jpg
https://s3-ap-northeast-1.amazonaws.com/moviche-uploads/wp-content/uploads/2019/10/IMG_2547.jpg
https://scontent-frx5-1.cdninstagram.com/vp/05d6926fed565f82247879638771ee46/5E259FCC/t51.2885-15/e35/67735702_2288175727962135_1310736136046930744_n.jpg?_nc_ht=scontent-frx5-1.cdninstagram.com&_nc_cat=103&se=7&ig_cache_key=MjEyMzM1MTc4NDkyMzQ4NzgxMg%3D%3D.2
http://moco-garden.com/wp-content/uploads/2016/05/kurosimayuina.jpg

Download, check and save images

Read a text file containing the URL

--Read the file in which the URL created last time is described with line breaks.

def download(query):
    """Download data, check data, save images."""

    linkfile = os.path.join(LINK_PATH, '{}.txt'.format(query))
    if not os.path.isfile(linkfile):
        print('no linkfile: {}'.format(linkfile))
        return

    with open(linkfile, 'r') as fin:
        link_list = fin.read().split('\n')[:-1]

Download image and check Content-Type

--Based on the list data of the URL read above, download one after another. --Make sure that Content-Type starts with ʻimage /. ―― ʻimage / may be jpeg`` png gif`` bmp.

    for num, link in enumerate(link_list, start=1):

        try:
            result = requests.get(link)
            content = result.content
            content_type = result.headers['Content-Type']
        except Exception as err:
            print('err: {}, link: {}'.format(err, link))
            continue

        if not content_type.startswith('image/'):
            print('err: {}, link: {}'.format(content_type, link))
            continue

Image loading settings with pillow

――If you set the following, even large images will be read.

ImageFile.LOAD_TRUNCATED_IMAGES = True

Check image data

--Read the image data with pillow. --If it cannot be read, there is a high probability that the image data is corrupted.

        try:
            image = Image.open(io.BytesIO(content))
        except Exception as err:
            print('err: {}, link: {}'.format(err, link))
            continue

Convert image data to .jpeg

――When you think about the post-process, I think it is troublesome to process while considering the case of .png and .bmp one by one. --Therefore, it will be converted to .jpeg uniformly. --Since it may be RGBA etc., convert it to RGB of .jpeg.

        if image.mode != 'RGB':
            image = image.convert('RGB')
        data = io.BytesIO()
        image.save(data, 'jpeg', optimize=True, quality=95)
        content = data.getvalue()

Save image

--According to the DOWNLOAD_PATH described in the setting file, save it with a file name such as 0001.jpeg 0002.jpeg. --I don't think you will use the end of the URL to make the file name. --Also, since the number of lines in the text file of the URL and the number in the file name match, I think it is easy to refer to each other.

        filename = os.path.join(DOWNLOAD_PATH, query, '{:04d}.jpeg'.format(num))
        with open(filename, 'wb') as fout:
            fout.write(content)
        print('query: {}, filename: {}, link: {}'.format(query, os.path.basename(filename), link))

Examples of errors during download and file processing

--The URL was about 6,000. Of these, about 180 was an error. --The error looks like the following. --It may be html instead of image data. --However, Content-Type ʻapplication / octet-stream and binary / octet-stream` should be able to be saved as image data, but this time they are omitted because they are few in number.

$ awk '{print $2}' err.txt | sort | uniq -c | sort -nr
  47 text/html;
  31 text/plain,
  30 ('Connection
  27 text/html,
  18 'content-type',
  10 cannot
   5 application/octet-stream,
   2 application/xml,
   1 images
   1 binary/octet-stream,
   1 UserWarning:
   1 HTTPSConnectionPool(host='jpnews24h.com',
   1 HTTPSConnectionPool(host='host-your-site.net',
   1 HTTPSConnectionPool(host='gamers.co.jp',
   1 HTTPConnectionPool(host='youtube.dojin.com',
   1 HTTPConnectionPool(host='nosh.media',
   1 HTTPConnectionPool(host='arukunews.jp',
   1 Exceeded

in conclusion

--$ wget -i urls.txt addresses the itchy part that is a little out of reach. ――Next time, we plan to carry out face recognition from images.

Recommended Posts

Download the image from the text file containing the URL
Download the file from S3 using boto.
Outputs a line containing the specified character string from a text file
Extracted text from image
Download the file in Python
Remove the frame from the image
Download the file deployed with appcfg.py
Let's cut the face from the image
Judge the extension and download the image
[Python scraping] Output the URL and title of the site containing a specific keyword to a text file
Download the top n Google image searches
Extract lines that match the conditions from a text file with python
Download images from URL list in Python
Download XBRL file from EDINET (personal memo)
Add lines and text on the image
Download the file with PHP [Under construction]
Get only the text from the Django form.
[Python] Download original images from Google Image Search
Download the csv file created by Google Colaboratory
Wav file generation from numeric text with python
[Small story] Download the image of Ghibli immediately
Download data directly from Drive URL (Google Colaboratory)
I tried to extract the text in the image file using Tesseract of the OCR engine
I want to see the file name from DataLoader
Download the file while viewing the progress in Python 3.x
I tried to detect the iris from the camera image
[Python] Extract the video ID from the YouTube video URL [Note]
Automatically determine and process the encoding of the text file
[Python] Specify the range from the image by dragging the mouse
Generate a vertical image of a novel from text data
Identify the name from the flower image with keras (tensorflow)
[2020 version] Scraping and processing the text from Aozora Bunko
Read QR code from image file with Python (Mac)
[Blender] Use the text drawing module from within the script
[Python] Change standard input from keyboard to text file
Python OpenCV tried to display the image in text.