Get Twitter bookmarks on CentOS using Selenium

Twitter doesn't provide an API around bookmarks, so I used Selenium to get all the bookmarks.

environment


CentOS Linux release 7.7.1908
Python 3.6.8

Preparation

Install what you need

google-chrome Install by referring to this article. ChromeDriver Be careful about the version to install. If you put it carelessly, it will not work properly. Check ChromeDriver site and pip install with version specified.

Example


# google-chrome --version
Google Chrome 78.0.3904.108

# pip install chromedriver-binary==78.0.3904.105
# pip show chromedriver-binary
Name: chromedriver-binary
Version: 78.0.3904.105.0

# chromedriver-path
/usr/lib/python3.6/site-packages/chromedriver_binary (Needed later)

Selenium

# pip install selenium

Operation check

I've added a lot of options, but --headless and --no-sandbox may be enough. In my environment I got an exception without --headless. executable_path specifies the result of the above chromedriver-path. I have saved a screenshot for confirmation.

test.py


import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys

options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-infobars')
options.add_argument('--disable-extensions')
options.add_argument('--disable-gpu')
options.add_argument('--headless')

driver = webdriver.Chrome(chrome_options=options, executable_path='/usr/lib/python3.6/site-packages/chromedriver_binary/chromedriver')

driver.get('https://www.google.com/')
time.sleep(3.0)
driver.save_screenshot('screenshot.png')

driver.close()
driver.quit()

Bookmark acquisition

Login process

option.add_argument('--user-data-dir='+os.path.abspath('profile'))

If you specify the profile to use with the above option, the cookie is saved there, so you do not have to log in every time the program is executed. In that state, execute twitter login of this article only once. Even if you think that you have successfully logged in, it may stop at the confirmation page of your email address, so I think it's a good idea to use interactive mode to check the screenshot and URL.

Acquisition process

In the Twitter timeline and bookmarks, tweet elements are dynamically added / deleted according to scrolling. In the following program Get the url of the loaded tweet → Scroll so that the bottom tweet is at the top of the page → Wait for the page to load the tweet The url of all tweets is obtained by repeating.

def get_list():
    driver.get('https://twitter.com/i/bookmarks')
    time.sleep(10.0)

    status_urls = []
    container_xpath = '//*[@id="react-root"]/div/div/div/main/div/div/div/div[1]/div/div[2]/section/div/div/div'
    container = driver.find_element_by_xpath(container_xpath) #A portrait element that contains multiple tweets
    end_count = 0
    while True:
        divs = container.find_elements_by_xpath('./div')
        for div in divs:
            if len(div.find_elements_by_tag_name('img')) == 0:
                end_count += 1
                break
            status_url = div.find_element_by_xpath('./div/article/div/div[2]/div[2]/div[1]/div[1]/a').get_attribute('href')
            status_urls.append(status_url)
        if end_count > 8:
            break
        driver.execute_script('arguments[0].scrollIntoView();', divs[-1])  # must check length
        print(len(status_urls))
        time.sleep(15.0)

    return list(set(status_urls))  #Since duplication occurs in the acquisition method, it is made unique by setting it once.

When you go back to the limit of the bookmark, the tweet is not stored in the bottom element, so you can judge whether you scrolled to the end by div.find_elements_by_tag_name ('img'). It doesn't matter how long it takes, so I want to get all of them, so it's a redundant code by sleeping and specifying the number of times.

Summary

--If you have to check the version when installing Chrome Driver, you will be addicted to the swamp. --If you load the page using Selenium, you can enter and click values as you normally do in a browser, which is very convenient. --Note that the DOM structure of HTML may change and you may not be able to access the elements.

If you find something wrong, please comment.

The site that I used as a reference

Until running Selenium + Python on CentOS7 --Qiita If you want to keep your site logged in the next time you run Selenium Bot to reply from twitter login with Python Selenium --Qiita

Recommended Posts

Get Twitter bookmarks on CentOS using Selenium
Get only image tweets on twitter
Install Python on CentOS using Pyenv
Get data from Twitter using Tweepy
Install Python on CentOS using pyenv
Get delay information on Twitter and tweet
Program to get favorite images on Twitter
Get images from specific users on Twitter
twitter on python3
Try using Selenium
Get Twitter Trends
Get Twitter userData
Initial settings for using Python3.8 and pip on CentOS8
Automatic follow on Twitter with python and selenium! (RPA)
Solution if you crash when using selenium on heroku
Post to your account using the API on Twitter
Get and automate ASP Datepicker control using Python and Selenium
Install Faiss on CentOS 7
Character count on Twitter
Install numba on CentOS 7.2
Start CentOS 8 using VirtualBox
Install mecab-python on CentOS
Install Python 2.7.3 on CentOS 5.4
Search Twitter using Python
Installation on CentOS8 VirtualBox
Install awscli on centos7
Install Chainer on CentOS 6.7
Notes on using Alembic
Torque setup on CentOS 6
Get only the Python version (such as 2.7.5) on the CentOS 7 shell
Get an English translation using python google translate selenium (memories)
Install Linux (CentOS) on your PC using a USB stick