Introduction

First post & I've just started learning programming, so there are a lot of poor sentences and codes, but I hope you'll read it.

Motivation

When browsing images on Twitter, I was stressed by many tweets with only text and images other than the target genre. Therefore, I thought that it would be better if I could extract only the desired ones. (Summary: I want erotic images)

Preparation

Get an API key to use the Cloud Vision API This article was helpful

Twitter API applies for usage and obtains an API key and token. It takes a little time and effort because it is necessary to describe the intended use in English. This article was helpful

The following three third-party libraries are used. All can be installed with pip.

schedule
tweepy
requests

Rough processing flow

Get tweets from the Twitter timeline
Save the image from the URL if the tweet contains an image
Send the image to the Cloud Vision API for analysis
Delete all but those judged to be adult

Implementation

`main.py`


import base64
import json
import os
import pickle
import time

import schedule
import tweepy
import requests

Import the library.

`main.py`


API_KEY        = 'Twitter API key'
API_SECRET_KEY = 'Twitter API secret key'
ACCESS_TOKEN        = 'Twitter Access token'
ACCESS_TOKEN_SECRET = 'Twitter Access token secret'

CVA_API_KEY = "Cloud Vision API key"

Keep each key you have obtained.

Extract necessary data from Twitter

First, get the TL that is the source of the tweet. This time I use list_timeline because I want to pull the tweets of the account added to the list, but I think that it is also good to narrow down to a specific account by using user_timeline etc.

`main.py`


auth = tweepy.OAuthHandler(API_KEY, API_SECRET_KEY)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)


api = tweepy.API(auth, wait_on_rate_limit=True)


#Get tweets from the timeline
def main():
    with open('before_tl.pickle', 'rb') as f:
        before_tl = pickle.load(f)
    tl = api.list_timeline(owner_screen_name="List administrator's Twitter ID", slug="The name of the list you want to get")
    with open('before_tl.pickle', 'wb') as f:
        pickle.dump(tl, f)
    for tweet in reversed(tl):    #Reversed to sort tweets and RT times in chronological order
        if not tweet in before_tl:
            media_getter(tweet)

The reason for saving TL with pickle is to avoid over-tapping the pay-as-you-go GCP API. When passing a tweet from TL, it is collated with the previous TL and processing is executed only for new tweets.

`main.py`


#User's screen name from tweet(ID)Get
def username_geter(tweet):
    if 'RT' in tweet.text:
        return tweet.retweeted_status.user.screen_name
    return tweet.user.screen_name


#Get the URL list of images
def media_getter(tweet):
    try:
        medialist = [d.get('media_url') for d in tweet.extended_entities["media"]]
        name = username_geter(tweet)
        for media in medialist:
            img_save(media,name)
    except:
        print('Text Only')

The user's screen name is used as the file name when saving the image.

This completes the process of getting the image URL from Twitter.

Image storage process

From here, you can save the image and pass it to Cloud Vision for analysis.

`main.py`


#Save the image from the url and change the save destination according to the judgment
def img_save(media,name):
    url_path = media.split("/")[-1]
    file_name = "adult/" + name + url_path

    response = requests.get(media)
    image = response.content

    with open(file_name, "wb") as f:
        f.write(image)
    identify = img_sort(file_name)

    if identify == "adult":
        print('---saved image---')
    else:
        import os
        os.remove(file_name)

   


#Returns a judgment according to the result
def img_sort(img_path):
    res_json = img_judge(img_path)
    judgement = res_json['responses'][0]['safeSearchAnnotation']['adult']

    if judgement == "POSSIBLE":
        print(judgement)
        return "possible"
    elif judgement == "LIKELY" or judgement == "VERY_LIKELY":
        print(judgement)
        return "adult"
    else:
        print(judgement)


#Send the image to cloudvisoinapi and receive the result
def img_judge(image_path):
    api_url = 'https://vision.googleapis.com/v1/images:annotate?key={}'.format(CVA_API_KEY)
    with open(image_path, "rb") as img:
        image_content = base64.b64encode(img.read())
        req_body = json.dumps({
            'requests': [{
                'image': {
                    'content': image_content.decode('utf-8')
                },
                'features': [{
                    'type': 'SAFE_SEARCH_DETECTION'
                }]
            }]
        })
        res = requests.post(api_url, data=req_body)
        return res.json()

The save destination of the image is determined by dividing the URL with "/" and listing it, and combining the extracted directory at the end with the screen name and directory.

The saved image is passed to the API, and the process is branched based on the returned result. Click here to see what value is returned (https://cloud.google.com/vision/docs/reference/rpc/google.cloud.vision.v1?hl=ja#google.cloud.vision. v1.SafeSearchAnnotation).

It is a specification to save LIKELY (highly likely) and above and delete the others, but this time I changed the save destination according to the judgment to check the accuracy of Cloud Vision.

`main.py`


import shutil

elif identify == "possible":
    new_file_name = "possible/" + name + url_path
    shutil.move(file_name, new_file_name)
    print('---saved possibleimage---')
else:
    new_file_name = "other/" + name + url_path
    shutil.move(file_name, new_file_name)
    print('---saved otherimage---')

Run

Let's do it. Processing is performed every 8 seconds using schedule.

`main.py`


if __name__ == "__main__":
    schedule.every(8).seconds.do(main)
    while True:
        schedule.run_pending()
        time.sleep(1)

Execution result

(Because the image of a third party is used, the image is blurred)

I was able to safely extract and save only the erotic images. It is a masterpiece that images are added more and more. It was confirmed that the accuracy was quite high when compared with the one judged as POSSIBLE. Everything that you can see is judged to be LIKELY or higher.

Afterword

This time I used SAFE_SEARCH_DETECTION (the ability to determine if an image contains harmful content), but there are many other features in the Cloud Vision API. If you make good use of the function, you can use it for various image collection and classification.

References

Try Google Cloud Vision API TEXT_DETECTION in Python I tried using Google Cloud Vision API How to use Tweety ~ Part 1 ~ [Getting Tweet]

I tried to automatically collect erotic images from Twitter using GCP's Cloud Vision API

Introduction

Motivation

Preparation

Rough processing flow

Implementation

main.py

main.py

Extract necessary data from Twitter

main.py

main.py

Image storage process

main.py

main.py

Run

main.py

Execution result

Afterword

References

`main.py`

`main.py`

`main.py`

`main.py`

`main.py`

`main.py`

`main.py`