See @ tyokuyoku's "I made a script that goes back to the tweets of a specific user on Twitter and saves the posted images at once" It's not functionalized and the nesting is quite deep, so I think it can be improved. I thought it was a refactoring. I commented on the refactoring results, but I will explain it because I have reviewed it further. I myself am refactoring through trial and error, so if you have any other good ideas, I would appreciate it if you could comment.
First, make the deepest part of the nest a function. Is the processing content "save the found image data to a file from one end"? Since it is a crawl process, the function name is crawl.
def crawl():
for image, filename in images():
path = os.path.join(SAVE_DIRECTORY, filename)
with open(path, 'wb') as localfile:
localfile.write(image)
I don't have an images function yet, but if there is a function that retrieves (lists) image data and file names one after another, it's just a matter of saving it to a file. Now, let's create an images function that does that. First, implement only the process of retrieving the image data.
def images():
for url in image_url():
image = urllib.urlopen(url)
yield image.read()
image.close()
There is no image_url function yet, but if there is a function that enumerates the url of the image data, it is just a job to read the image data at the url destination and yield it. If it is return, the result will be returned once and the processing will end, but by using a generator function that uses yield, the results can be notified one after another. Add a process to this function that also notifies the file name. It deviates from the SOLID principle of single responsibility, but ...
def images():
for urls, timestamp in image_urls():
date_time = timestamp.strftime("%Y%m%d_%H%M%S")
for index, url in enumerate(urls):
_, ext = os.path.splitext(url)
filename = '%s_%s_%s%s' % (USER_NAME, date_time, index, ext)
image = urllib.urlopen(url)
yield image.read(), filename
image.close()
The file name is created from the tweet time, but since the image does not include the tweet time, I will let the image_urls function tell me.
Since multiple images can be pasted on one tweet, one tweet time will be linked to several images. Please tell me the tweet time with python standard datetime and convert it to year / month / day_hour / minute / second
used for the file name. To
The extension of the file name was fixed to " .jpg "
in the original program, but there are also " .png "
etc., so I will take out the extension included in the url and use it.
Next is the image_urls function.
def image_urls():
for media, timestamp in medias():
urls = [content["media_url_https"]
for content in media
if content.get("type") == "photo"]
yield urls, timestamp
The tweet contains media information such as images along with a short sentence (text), and if there is a medias function that enumerates the media information, it is possible to extract only the "photo" type image URL from the media information and notify it. I will. Since the media information does not include the tweet time, I will ask the media function to tell me the tweet time as well. Next is the medias function.
def medias():
for tweet in tweets():
created_at = tweet["created_at"]
timestamp = dateutil.parser.parse(created_at)
extended_entities = tweet.get("extended_entities", {})
media = extended_entities.get("media", ())
yield media, timestamp
If you have a tweets function that enumerates tweets in order, you can retrieve the tweet time and media and notify them. It is possible that there is no extended_entries information or media information, but if there is no information, an empty tuple is notified to let you know that there is no image. Finally, the tweets function.
USER_NAME = 'Username you want to get'
NUM_OF_TWEET_AT_ONCE = 200 # max: 200
NUM_OF_TIMES_TO_CRAWL = 16 # max: 3200 / NUM_OF_TWEET_AT_ONCE
SAVE_DIRECTORY = os.path.join('.', 'images')
def tweets():
import twitkey
twitter = OAuth1Session(twitkey.CONSUMER_KEY,
twitkey.CONSUMER_SECRET,
twitkey.ACCESS_TOKEN,
twitkey.ACCESS_TOKEN_SECRET)
url = ("https://api.twitter.com/1.1/statuses/user_timeline.json"
"?screen_name=%s&include_rts=false" % USER_NAME)
params = {"count": NUM_OF_TWEET_AT_ONCE}
for i in range(NUM_OF_TIMES_TO_CRAWL):
req = twitter.get(url, params=params)
if req.status_code != requests.codes.ok:
return
timeline = json.loads(req.text)
for tweet in timeline:
yield tweet
params["max_id"] = tweet["id"]
You need to use the twitter API to get tweets, and you need to register as a user with twitter in advance to get the key and token to access the twitter API. Write that information as a variable in a file called twitkey.py. The program source and confidential information are separated. In the original script, it is a dictionary variable, but you can also just assign a variable.
twitkey.py
#coding: UTF-8
CONSUMER_KEY = ""
CONSUMER_SECRET = ""
ACCESS_TOKEN = ""
ACCESS_TOKEN_SECRET = ""
After all, the nesting was deepened in the for statement, but it was made shallower by making it a generator function using yield. By separating the functions, the function name also becomes a comment indicating the processing content, and I think it will be easier to understand the processing.
Finally, I will post the whole script including the main process, import, and progress display print.
#!/usr/bin/env python2
# -*- coding:utf-8 -*-
import sys
import os.path
import dateutil.parser
import urllib
import requests
import json
from requests_oauthlib import OAuth1Session
USER_NAME = 'Username you want to get'
NUM_OF_TWEET_AT_ONCE = 200 # max: 200
NUM_OF_TIMES_TO_CRAWL = 16 # max: 3200 / NUM_OF_TWEET_AT_ONCE
SAVE_DIRECTORY = os.path.join('.', 'images')
def tweets():
import twitkey
twitter = OAuth1Session(twitkey.CONSUMER_KEY,
twitkey.CONSUMER_SECRET,
twitkey.ACCESS_TOKEN,
twitkey.ACCESS_TOKEN_SECRET)
url = ("https://api.twitter.com/1.1/statuses/user_timeline.json"
"?screen_name=%s&include_rts=false" % USER_NAME)
params = {"count": NUM_OF_TWEET_AT_ONCE}
for i in range(NUM_OF_TIMES_TO_CRAWL):
req = twitter.get(url, params=params)
if req.status_code != requests.codes.ok:
print "ERROR:", req.status_code
return
timeline = json.loads(req.text)
for tweet in timeline:
print "TWEET:", tweet["text"]
yield tweet
params["max_id"] = tweet["id"]
def medias():
for tweet in tweets():
created_at = tweet["created_at"]
timestamp = dateutil.parser.parse(created_at)
extended_entities = tweet.get("extended_entities", {})
media = extended_entities.get("media", ())
print "CREATE:", created_at
yield media, timestamp
def image_urls():
for media, timestamp in medias():
urls = [content["media_url_https"]
for content in media
if content.get("type") == "photo"]
print "IMAGE:", len(urls)
yield urls, timestamp
def images():
for urls, timestamp in image_urls():
date_time = timestamp.strftime("%Y%m%d_%H%M%S")
for index, url in enumerate(urls):
_, ext = os.path.splitext(url)
filename = '%s_%s_%s%s' % (USER_NAME, date_time, index, ext)
image = urllib.urlopen(url)
print "URL:", url
yield image.read(), filename
image.close()
def crawl():
for image, filename in images():
path = os.path.join(SAVE_DIRECTORY, filename)
print "SAVE:", path
with open(path, 'wb') as localfile:
localfile.write(image)
if __name__ == '__main__':
if len(sys.argv) > 1:
USER_NAME = sys.argv[1]
crawl()
Recommended Posts