[python] A story about collecting twitter account names from handle names (like @ 123456) by combining BeautifulSoup and Excel input / output.

I want to scrape my Twitter account name with python! !!

Since I thought about the above, I wrote a little code like that and I will share it. (This time, it is assumed that the handle name is known)

Precautions for scraping

As for scraping, basically you have to follow various rules. Read here

Please be very careful when executing it as it may lead to the server load of the other party.

Information acquisition flow

Arrange the handle names in the first row of Excel ↓ Get information from excel ↓ Search on google ↓ Extract information (account name) from it ↓ List account names in the third column of Excel

I will do it in this way.

Install the required modules

4 modules to use

Requests to get the URL
Excel file operation openpyxl
bs4 extracted from the html of the site
Time for introducing the sleep function so as not to load the server

I created a scraype directory in Documents in my local environment.

scraype
- handle_name_search.py
- handle.xlsx

Let's install 4 immediately using the above pip.

pip3 install requests
pip3 install openpyxl
pip3 install BeautifulSoup4
pip3 install time

And import to the file created earlier

`handle_name-search.py`


import requests
import openpyxl
from bs4 import BeautfulSoup as bs
import time

Arrange the handle names in the first column of Excel

スクリーンショット 2020-03-18 23.51.40.png

Set as above, save as handle.xlsx and save in scraype folder.

Get the information of the A1 column from Excel

Use the openpyxl module to load files in the local environment and operate Sheet1

`handle_name_search.py`


wb = openpyxl.laod_workbook('/Users/{My local}/Documents/scraype/handle.xlsx')
sheet1=wb['Sheet1']

This completes the setting of Sheet1, so let's take all the A1 columns.

`handle_name_search.py`


for i in range(0,sheet1.max_row):
    print(sheet1.cell(row=i+1,column=1).value

If you do this, all the information in column A1 will be output from Excel! Then look for this!

Search the extracted one on google

`handle_name_search.py`


req = requests.get("https//www.google.com/search?q=" + sheet1.cell(row = i+1, column=1).value)

Search for tags with account names using Beatiful Soup

`handle_name_search.py`


req = req.text
soup = bs(req,"html.parser")
tags = soup.find_all("div",class_="ZINbbc xpd O9g5cc uUPGi")
if(tags[0].find("div",class_="BNeawe vvjwJb AP7Wnd") != None):
    title = tags[0].find("div",class_="BNeawe vvjwJb AP7Wnd").string

The content that I struggled with here is that the class name seen in the verification of chrome and the class acquired by bs4 are different, so in soup.prettify (), the search title of google is included in the class of BNeawe vvjwJb AP7 Wnd Verification.

Remove obstructions from search results

`handle_name_search.py`


if "(@" in title:
    title = title.split('(@')[0]
else:
    if "- Twitter" in title:
        title = title.split('-')[0]
    if "✓" in title:
        title = title.split('✓')[0]

Finally output to Excel and save

`handle_name_searchpy`


sheet1.cell(row=i+1,column=3).value = title
wb.save('/Users/{Your local name}/Documents/scraype/handle.xlsx')

I will put it below the whole picture

`handle_name_search.py`


import requests
import openpyxl
from bs4 import BeautifulSoup as bs
import time


##Local handle.Access xlsx (enter handle name in the first column))
wb = openpyxl.load_workbook('/Users/{Your local name}/Documents/scraype/handle.xlsx')
sheet1=wb['Sheet1']

##Get the handle of the first column of excel and output it to the third column
for i in range(0,sheet1.max_row):
    time.sleep(1)
    print(sheet1.cell(row=i+1,column=1).value)    
    req = requests.get("https://www.google.com/search?q=" + sheet1.cell(row=i+1,column=1).value)
    req = req.text
    soup = bs(req,"html.parser")
    tags = soup.find_all("div", class_="ZINbbc xpd O9g5cc uUPGi")
    if(tags[0].find("div",class_="BNeawe vvjwJb AP7Wnd") != None):
        title = tags[0].find("div", class_="BNeawe vvjwJb AP7Wnd").string 
        if "(@" in title:
            title = title.split('(@')[0]
        else:
            if "- Twitter" in title:
                title = title.split('-')[0]
            if "✓" in title:
                title = title.split('✓')[0]
        print(title)
        sheet1.cell(row=i+1,column=3).value= title
        wb.save('/Users/{Your local name}/Documents/scraype/handle.xlsx')
wb.save('/Users/{Your local name}/Documents/scraype/handle.xlsx')

After that, please compile and check it. Is the account name output in the third column? ??

We have introduced a sleep function on the way to prevent server attacks, but please be sure to follow the rules when performing scraping.

That was scraping.