I wrote a script to get a popular site in Japan

Summary

--Scraped with Python + PyQuery --I was able to get the top 525 of popular Japanese sites (most accessed) and output them to CSV. --This allows you to investigate the HTML / design features of popular sites (maybe)


I thought that if I analyzed the HTML and design of popular sites, it would be an interesting result, so I wrote a script to get the URL of popular sites for the time being.

You can see the ranking of each country on Alexa, so I will get it from there.

Alexa the Web Information is a long-established service that has been publishing statistics such as the number of access and usage status of websites since 1996 so that anyone can view them. Since 1999, a child of Amazon Company.

refs. http://freesoft.tvbok.com/cat94/site10/alexa.html

The script I wrote for the time being is on GitHub. https://github.com/saxsir/fjats

By the way, you can get more by using Official API (paid).

Caution

――Since it is a site that is open to the public, scraping itself is not illegal (probably), but if you overdo it too much, it will cause trouble to the other site, so please do it at your own risk.

refs. -List of precautions for web scraping -Let's talk about the law of web scraping!

Operating environment

Work procedure

  1. Install PyQuery
  2. Write a simple sample
  3. Rewrite to get up to 525th in Japan ranking
  4. Output to CSV
  5. Wait 1-3 seconds for each request

Install PyQuery

First, put in a library for scraping. A library called PyQuery seems to be popular, so let's use it.

All you have to do is write a script. For the time being, I will paste the completed form.

main.py


import csv
from pyquery import PyQuery as pq
from datetime import datetime as dt
from time import sleep
from random import randint

ranks = []
for i in range(21):
  # http://www.alexa.com/topsites/countries;0/JP
  url = 'http://www.alexa.com/topsites/countries;%s/JP' % i
  doc = pq(url, parser='html')
  ul = [doc(li) for li in doc('.site-listing')]
  ranks += [(li('.count').text(), li('.desc-paragraph')('a').text()) for li in ul]
  print('Fetch %s' % url)    # Check script is running
  sleep(randint(1,3))

with open('topsites-jp_%s.csv' % dt.now().strftime('%y-%m-%d-%H-%M'), 'w') as f:
  writer = csv.writer(f, lineterminator='\n')
  writer.writerow(('Ranking', 'URL'))
  writer.writerows(ranks)

From here on down is the explanation of the code, so if you want to move it, just paste the above source and execute it.

First write a simple sample

from pyquery import PyQuery as pq

#Try to get the top 25 sites in the world for the time being
url = 'http://www.alexa.com/topsites'
doc = pq(url, parser='html')

#The class is site from the obtained DOM-Get the elements of listing
#(Check the class name of the part you want to get in advance with Chrome developer tools etc.)
ul = [doc(li) for li in doc('.site-listing')]

#For the time being, the ranking and site name,Try to output by separating with
ranks = ['%s, %s' % (li('.count').text(), li('.desc-paragraph')('a').text()) for li in ul]

print(ranks)

Let's move this with an interpreter. (It's OK if you start and copy and paste)

$ python                                                
Python 3.4.3 (default, Mar 27 2015, 14:54:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyquery import PyQuery as pq
>>>
>>> #Try to get the top 25 sites in the world for the time being
... url = 'http://www.alexa.com/topsites'
>>> doc = pq(url, parser='html')
>>>
>>> #The class is site from the obtained DOM-Get the elements of listing
... #(Check the class name of the part you want to get in advance with Chrome developer tools etc.)
... ul = [doc(li) for li in doc('.site-listing')]
>>>
>>> #For the time being, the ranking and site name,Try to output by separating with
... ranks = ['%s, %s' % (li('.count').text(), li('.desc-paragraph')('a').text()) for li in ul]
>>>
>>> print(ranks)
['1, Google.com', '2, Facebook.com', '3, Youtube.com', '4, Yahoo.com', '5, Baidu.com', '6, Amazon.com', '7, Wikipedia.org', '8, Taobao.com', '9, Twitter.com', '10, Qq.com', '11, Google.co.in', '12, Live.com', '13, Sina.com.cn', '14, Weibo.com', '15, Linkedin.com', '16, Yahoo.co.jp', '17, Google.co.jp', '18, Ebay.com', '19, Tmall.com', '20, Yandex.ru', '21, Blogspot.com', '22, Vk.com', '23, Google.de', '24, Hao123.com', '25, T.co']

It seems that it has been acquired, so rewrite this to acquire the ranking of Japan.

Rewrite to get up to 525th in Japan ranking

Looking at it with a browser, http://www.alexa.com/topsites/countries;0/JP The 1st to 25th places are displayed with a URL like this. If you increase the 0 part, it seems that you can get up to 20, so use the for statement to loop 20 times and get the data.

from pyquery import PyQuery as pq

ranks = []
for i in range(21):
  # http://www.alexa.com/topsites/countries;0/JP
  url = 'http://www.alexa.com/topsites/countries;%s/JP' % i
  doc = pq(url, parser='html')
  ul = [doc(li) for li in doc('.site-listing')]
  ranks += [(li('.count').text(), li('.desc-paragraph')('a').text()) for li in ul]

this part

[(li('.count').text(), li('.desc-paragraph')('a').text()) for li in ul]

Is a little confusing,

#Tuple like this
('1', 'Site 1') # (li('.count').text(), li('.desc-paragraph')('a').text())

#, List comprehension([... for li in ul])Use to make a list like this and return it
[('1', 'Site 1'), ('2', 'Site 2') ...]

#Concatenation (ranks)+= ...) (Array in JavaScript or Ruby).concat)
[('1', 'Site 1'), ('2', 'Site 2') ... ('525', 'Site 525')]

The reason why I made it like this is because I want to output it to CSV later.

Output to CSV

import csv
from datetime import datetime as dt

with open('topsites-jp_%s.csv' % dt.now().strftime('%y-%m-%d-%H-%M'), 'w') as f:
  writer = csv.writer(f, lineterminator='\n')
  writer.writerow(('Ranking', 'URL'))
  writer.writerows(ranks)

The time is added to the CSV file name so that it is easy to understand when the data is.

Finally, leave a 1-3 second interval for each request

How to act as a crawler (but not ...).

from time import sleep
from random import randint

sleep(randint(1,3))

Randomly wait 1 to 3 seconds.

Referenced site

-Official repository -Scraping Alexa's web rank with pyQuery -Reading and writing CSV with Python -python current time acquisition -List of precautions for web scraping -Let's talk about the law of web scraping!

Recommended Posts

I wrote a script to get a popular site in Japan
I wrote a script to get you started with AtCoder fast!
I wrote a function to load a Git extension script in Python
I wrote a script to extract a web page link in Python
I wrote a script to upload a WordPress plugin
I made a script to put a snippet in README.md
I get a UnicodeDecodeError in mecab-python3
I get a KeyError in pyclustering.xmeans
I tried "How to get a method decorated in Python"
I wrote a script to combine the divided ts files
I wrote a script that splits the image in two
I just wrote a script to build Android on another machine
I wrote a script to help goodnotes5 and Anki work together
How to get a stacktrace in python
I made a script to display emoji
I wrote a code to convert quaternions to z-y-x Euler angles in Python
AtCoder writer I wrote a script to aggregate the contests for each writer
[To Twitter gentlemen] I wrote a script to convert .jpg-large to .jpg at once.
I made a script in python to convert .md files to Scrapbox format
When I get a chromedriver error in Selenium
A memo that I wrote a quicksort in Python
I want to create a window in Python
I wrote a class in Python3 and Java
I wrote "Introduction to Effect Verification" in Python
I wrote a design pattern in kotlin Prototype
I made a tool to get new articles
I wrote a Japanese parser in Japanese using pyparsing.
I wrote a script to revive the gulp watch that will die soon
I want to embed a variable in a Python string
I want to easily implement a timeout in python
I wrote a design pattern in kotlin Factory edition
I want to transition with a button in flask
I get a java.util.regex.PatternSyntaxException when splitting a string in PySpark
I wrote a design pattern in kotlin Builder edition
I want to write in Python! (2) Let's write a test
I wrote a design pattern in kotlin Singleton edition
I wrote a design pattern in kotlin Adapter edition
I tried to implement a pseudo pachislot in Python
I wrote a design pattern in kotlin, Iterator edition
A memorandum to run a python script in a bat file
I want to find a popular package on PyPi
I want to randomly sample a file in Python
I want to work with a robot in python.
I wrote a design pattern in kotlin Template edition
I wanted to modify Django's admin site a bit
I want to use a python data source in Re: Dash to get query results
I wrote a CLI tool in Go language to view Qiita's tag feed with CLI
I tried to implement a one-dimensional cellular automaton in Python
I made a script in Python to convert a text file for JSON (for vscode user snippet)
I wrote a program quickly to study DI with Python ①
[Python] I want to get a common set between numpy
I made a class to get the analysis result by MeCab in ndarray with python
Developed a library to get Kindle collection list in Python
I made a command to generate a table comment in Django
[OCI] Python script to get the IP address of a compute instance in Cloud Shell
[ECS] [Fargate] Script to get Task ID in TASK execution container
I tried to get started with Hy ・ Define a class
I wrote an empty directory automatic creation script in Python
How to get the last (last) value in a list in Python
How to get a list of built-in exceptions in python
I tried to make a stopwatch using tkinter in python