[Python] Explore the characteristics of the titles of the top sites in Google search results

Thing you want to do

I want to know the characteristics of the titles of the top 10 sites that Google searched for a certain keyword.

Library

Premise

This time I will write with Jupyter notebook, so it must be installed.

Implementation

I will write the actual code, but since I have little experience with it, I may be writing it inefficiently. Please note.

Library import

This time we will use only this library, so import it at the beginning.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from math import log
from janome.tokenizer import Tokenizer
from janome.analyzer import Analyzer
from janome.tokenfilter import POSStopFilter
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

Get the title of the search result

Use BeautifulSoup to scrape Google search results.

#Request header
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
list_keywd = ['Keyword 1','Keyword 2']
input_num = 10
url = 'https://www.google.co.jp/search?num={}&q='.format(input_num) + ' '.join(list_keywd)

#Connect
response = requests.get(url, headers=headers)

#Check HTTP status code (exception handling except 200)
response.raise_for_status()

#Parse the retrieved HTML
soup = bs(response.content, 'html.parser')

#Get search result titles and links
ret_link = soup.select('.r > a')
#To avoid removing breadcrumbs
ret_link2 = soup.select('.r > a > h3')

title_list = []
url_list = []
leng = len(ret_link)
r_list = []
cols = ['title','url']

for i in range(len(ret_link)):
    #Get the text part of the title
    title_txt = ret_link2[i].get_text()

    #Get only the link and remove the extra part
    url_txt = ret_link[i].get('href').replace('/url?q=','')

    title_list.append(title_txt)
    url_list.append(url_txt)
 
    tmp = []
    tmp = [title_txt,url_txt]
    r_list.append(tmp)

#Display search results
df = pd.DataFrame(r_list,columns=cols)
df

The top 10 sites in Google search results for a certain keyword were as follows.

01.png

Morphological analysis of the title-separate writing

Now that we have the titles of the top 10 sites, we will do everything from morphological analysis to word-separation. Janome was used for morphological analysis.

#Separate and register the nouns of each blog
work = []
WAKATI = []
for i in BLOG.keys():
    texts_flat = "".join(BLOG[i]["title"])
    tokens = a.analyze(texts_flat)
    work.append(' '.join([t.surface for t in tokens]))
    WAKATI.append(work[i].lower().split())
#Verification
for i in BLOG.keys():
    print("■WAKATI[{}]: {}".format(i,WAKATI[i]))

#scikit-Calculate the frequency of word occurrence with learn
vectorizer = CountVectorizer()

#Bow calculation
X = vectorizer.fit_transform([work[i] for i in range(len(work))])
WORDS = vectorizer.get_feature_names()
WORDS.sort()
print('=========================================')
print('All words')
print('=========================================')
print(WORDS)

It was analyzed in this way. 02.png

Function definition

This time, I wrote a function to find the tf value, idf value, and tf-idf value.

#Function definition
def tf(t, d):
  return d.count(t)/len(d)

def idf(t):
  df = 0
  for wak in WAKATI:
    df += t in wak
  
  #return log(N/df) + 1
  return log(N/np.array(df)) + 1

def tfidf(t,d):
  return tf(t,d) * idf(t)

def highlight_negative(val):
    if val > 0:
        return 'color: {0}; font-weight: bold'.format('red')
    else:
        return 'color: {0}'.format('black')
#Function definition End

Let's look at the tf value

First, let's look at the tf value.

#tf calculation
print('■ TF value for each site')
print('Frequency of appearance in one document')
ret = []
for i in range(N):
    ret.append([])
    d = WAKATI[i]
    for j in range(len(WORDS)):
        t = WORDS[j]
        if len(d) > 0:
            ret[-1].append(tf(t,d))

tf_ = pd.DataFrame(ret, columns=WORDS)
tf_.style.applymap(highlight_negative)

As shown in the figure below, the tf value was acquired. The part of speech used is written in red. "Mask" and "cool" are used in many titles. You may be able to find the search word just by the tf value.

03.png

Let's look at the idf value

The higher the idf value, the less likely it is to appear in other titles, making it a rare word. Conversely, the smaller the value, the more often it is used.

#idf calculation
ret = []
for i in range(len(WORDS)):
  t = WORDS[i]
  ret.append(idf(t))

idf_ = pd.DataFrame(ret, index=WORDS, columns=["IDF"])
idf_s = idf_.sort_values('IDF')
idf_s.style.applymap(highlight_negative)

04.png As for the tf value, the idf value of "mask" and "cool" that often appeared is naturally small. In this result, the value 2.609438 appears at 2 sites, and the value 3.302585 appears at only 1 site.

Let's look at the tf-idf value

The higher the tf-idf value, the more important the word has to play in the title.

ret = []
for i in range(N):
  ret.append([])
  d = WAKATI[i]
  for j in range(len(WORDS)):
    t = WORDS[j]
    if len(d) > 0:
        ret[-1].append(tfidf(t,d))

tfidf_ = pd.DataFrame(ret,columns=WORDS)
tfidf_.style.applymap(highlight_negative)

The result looks like this. The row is the site and the column contains all the words. The value of words that do not appear on the site is "0". When looking at each site, it may be said that a word with a large value has a big role in the site title. 05.png

Summary

You can check the features like this. It may be helpful when you find a word to put in, such as what the title of the site should be.

Recommended Posts

[Python] Explore the characteristics of the titles of the top sites in Google search results
Google search for the last line of the file in Python
In search of the fastest FizzBuzz in Python
Scraping Google News search results in Python (2) Use Beautiful Soup
Check the behavior of destructor in Python
The result of installing python in Anaconda
The basics of running NoxPlayer in Python
The google search console sitemap api client is in webmasters instead of search console
Receive a list of the results of parallel processing in Python with starmap
Output the number of CPU cores in Python
[Python] Sort the list of pathlib.Path in natural sort
Unattended operation of Google Spreadsheets (etc.) in Python
Get the caller of a function in Python
Match the distribution of each group in Python
View the result of geometry processing in Python
Make a copy of the list in Python
Find the divisor of the value entered in python
Find the solution of the nth-order equation in python
The story of reading HSPICE data in Python
[Note] About the role of underscore "_" in Python
About the behavior of Model.get_or_create () of peewee in Python
Solving the equation of motion in Python (odeint)
Output in the form of a python array
Search by the value of the instance in the list
[Python selenium] After scraping Google search results, output title and URL in csv
Get the tag search results of Nico Nico Douga in XML format. (No login required)
A story about a Python beginner trying to get Google search results using the API
Experience the good calculation efficiency of vectorization in Python
How to get the number of digits in Python
In search of the best random dot stereogram (RDS).
Get the image of "Suzu Hirose" by Google image search.
Python --Explanation and usage summary of the top 24 packages
[python] Get the list of classes defined in the module
The story of FileNotFound in Python open () mode ='w'
Learn the design pattern "Chain of Responsibility" in Python
Implement the solution of Riccati algebraic equations in Python
Get the size (number of elements) of UnionFind in Python
Not being aware of the contents of the data in python
Difference in results depending on the argument of multiprocess.Process
Reproduce the execution example of Chapter 4 of Hajipata in Python
Let's use the open data of "Mamebus" in Python
Implemented the algorithm of "Algorithm Picture Book" in Python3 (Heapsort)
[Python] Outputs all combinations of elements in the list
Using the National Diet Library Search API in Python
Get the URL of the HTTP redirect destination in Python
A reminder about the implementation of recommendations in Python
Reproduce the execution example of Chapter 5 of Hajipata in Python
To do the equivalent of Ruby's ObjectSpace._id2ref in Python
Check the asymptotic nature of the probability distribution in Python
Binary search in Python
the zen of Python
Linear search in Python
Binary search in Python (binary search)
A Python script that goes from Google search to saving the Search results page at once
I checked the Python package pre-installed in Google Cloud Dataflow
Find out the apparent width of a string in python
I tried the accuracy of three Stirling's approximations in python
Visualize the results of decision trees performed with Python scikit-learn
Measure the execution result of the program in C ++, Java, Python.
Check the operation of Python for .NET in each environment
[Memo] The mystery of cumulative assignment statements in Python functions