[python] A story about collecting twitter account names from handle names (like @ 123456) by combining BeautifulSoup and Excel input / output.

I want to scrape my Twitter account name with python! !!

Since I thought about the above, I wrote a little code like that and I will share it. (This time, it is assumed that the handle name is known)

Precautions for scraping

As for scraping, basically you have to follow various rules. Read here

Please be very careful when executing it as it may lead to the server load of the other party.

Information acquisition flow

Arrange the handle names in the first row of Excel ↓ Get information from excel ↓ Search on google ↓ Extract information (account name) from it ↓ List account names in the third column of Excel

I will do it in this way.

Install the required modules

4 modules to use

  1. Requests to get the URL
  2. Excel file operation openpyxl
  3. bs4 extracted from the html of the site
  4. Time for introducing the sleep function so as not to load the server

I created a scraype directory in Documents in my local environment.

Let's install 4 immediately using the above pip.

pip3 install requests
pip3 install openpyxl
pip3 install BeautifulSoup4
pip3 install time

And import to the file created earlier

handle_name-search.py


import requests
import openpyxl
from bs4 import BeautfulSoup as bs
import time

Arrange the handle names in the first column of Excel

スクリーンショット 2020-03-18 23.51.40.png

Set as above, save as handle.xlsx and save in scraype folder.

Get the information of the A1 column from Excel

Use the openpyxl module to load files in the local environment and operate Sheet1

handle_name_search.py


wb = openpyxl.laod_workbook('/Users/{My local}/Documents/scraype/handle.xlsx')
sheet1=wb['Sheet1']

This completes the setting of Sheet1, so let's take all the A1 columns.

handle_name_search.py


for i in range(0,sheet1.max_row):
    print(sheet1.cell(row=i+1,column=1).value

If you do this, all the information in column A1 will be output from Excel! Then look for this!

Search the extracted one on google

handle_name_search.py


req = requests.get("https//www.google.com/search?q=" + sheet1.cell(row = i+1, column=1).value)

Search for tags with account names using Beatiful Soup

handle_name_search.py


req = req.text
soup = bs(req,"html.parser")
tags = soup.find_all("div",class_="ZINbbc xpd O9g5cc uUPGi")
if(tags[0].find("div",class_="BNeawe vvjwJb AP7Wnd") != None):
    title = tags[0].find("div",class_="BNeawe vvjwJb AP7Wnd").string

The content that I struggled with here is that the class name seen in the verification of chrome and the class acquired by bs4 are different, so in soup.prettify (), the search title of google is included in the class of BNeawe vvjwJb AP7 Wnd Verification.

Remove obstructions from search results

handle_name_search.py


if "(@" in title:
    title = title.split('(@')[0]
else:
    if "- Twitter" in title:
        title = title.split('-')[0]
    if "✓" in title:
        title = title.split('✓')[0]

Finally output to Excel and save

handle_name_searchpy


sheet1.cell(row=i+1,column=3).value = title
wb.save('/Users/{Your local name}/Documents/scraype/handle.xlsx')

I will put it below the whole picture

handle_name_search.py


import requests
import openpyxl
from bs4 import BeautifulSoup as bs
import time


##Local handle.Access xlsx (enter handle name in the first column))
wb = openpyxl.load_workbook('/Users/{Your local name}/Documents/scraype/handle.xlsx')
sheet1=wb['Sheet1']

##Get the handle of the first column of excel and output it to the third column
for i in range(0,sheet1.max_row):
    time.sleep(1)
    print(sheet1.cell(row=i+1,column=1).value)    
    req = requests.get("https://www.google.com/search?q=" + sheet1.cell(row=i+1,column=1).value)
    req = req.text
    soup = bs(req,"html.parser")
    tags = soup.find_all("div", class_="ZINbbc xpd O9g5cc uUPGi")
    if(tags[0].find("div",class_="BNeawe vvjwJb AP7Wnd") != None):
        title = tags[0].find("div", class_="BNeawe vvjwJb AP7Wnd").string 
        if "(@" in title:
            title = title.split('(@')[0]
        else:
            if "- Twitter" in title:
                title = title.split('-')[0]
            if "✓" in title:
                title = title.split('✓')[0]
        print(title)
        sheet1.cell(row=i+1,column=3).value= title
        wb.save('/Users/{Your local name}/Documents/scraype/handle.xlsx')
wb.save('/Users/{Your local name}/Documents/scraype/handle.xlsx')    

After that, please compile and check it. Is the account name output in the third column? ??

We have introduced a sleep function on the way to prevent server attacks, but please be sure to follow the rules when performing scraping.

That was scraping.

Recommended Posts

[python] A story about collecting twitter account names from handle names (like @ 123456) by combining BeautifulSoup and Excel input / output.
A story about Python pop and append
A story about modifying Python and adding functions
python input and output
Scraping desired data from website by linking Python and Excel
Collecting information from Twitter with Python (MySQL and Python work together)
A story about predicting prefectures from the names of cities, wards, towns and villages with Jubatus
A story about everything from data collection to AI development and Web application release in Python (3. AI development)
A story about someone who wanted to import django from a python interactive shell and save things to a DB