I recently learned about scraping and implemented it. This time, I created a keyword in " CiNii Articles --Search for Japanese Articles --National Institute of Informatics ". All the "titles", "authors", and "paper publication media" of the papers that hit the keyword are acquired and saved in CSV. It was a good study for learning scraping, so I wrote an article. We hope it will be useful for those who are learning scraping!
Below is the code I wrote myself. The explanation is written with the code, so please take a look at it. Also, actually go to the site of " CiNii Articles --Search for Japanese Articles --National Institute of Informatics " and Chrome I think that understanding will deepen if you do it while actually looking at the structure of HTML using the verification function of. This time I saved this code as "search.py".
import sys
import os
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re
def main():
url ='https://ci.nii.ac.jp/search?q={}&count=200'.format(sys.argv[1])
res = requests.get(url)
soup = BeautifulSoup(res.content , "html.parser")
#Check the number of searches.
#In text'\n Search results\n\t\n\t0\n\t'It contains data like this.
search_count_result = soup.find_all("h1" , {"class":"heading"})[0].text
#Get the number of searches using a regular expression
pattern = '[0-9]+'
result = re.search(pattern, search_count_result)
#If there are no search results, the function ends here
search_count = int(result.group())
if search_count == 0:
return print('There are no search results.')
print('The number of searches is' + str(search_count) + 'It is a matter.')
#Creating a directory to store data.
try:
os.makedirs(sys.argv[1])
print("A new directory has been created.")
except FileExistsError:
print("It will be a directory that already exists.")
#To get all the search results, get the number of for.
#This time, it is set to 200 because it is displayed every 200 cases.
if search_count // 200 == 0:
times = 1
elif search_count % 200 == 0:
times = search_count // 200
else:
times = search_count // 200 + 1
#Acquire authors, titles, and publication media at once
title_list = []
author_list = []
media_list = []
#Processing to delete whitespace characters here
escape = str.maketrans({"\n":'',"\t":''})
for time in range(times):
#get url
count = 1 + 200 * time
#search?q={}Enter the keyword you want to search for here.
#count=200&start={}It counts every 200 and enters the number to display from.
url ='https://ci.nii.ac.jp/search?q={}&count=200&start={}'.format(sys.argv[1], count)
print(url)
res = requests.get(url)
soup = BeautifulSoup(res.content , "html.parser")
for paper in soup.find_all("dl", {"class":"paper_class"}):#Turn the loop for each paper.
#Get title
title_list.append(paper.a.text.translate(escape))
#Acquisition of author
author_list.append(paper.find('p' , {'class':"item_subData item_authordata"}).text.translate(escape))
#Acquisition of publication media
media_list.append(paper.find('p' , {'class':"item_extraData item_journaldata"}).text.translate(escape))
#Save as CSV as a data frame.
jurnal = pd.DataFrame({"Title":title_list , "Author":author_list , "Media":media_list})
#Encoding is performed to prevent garbled characters.
jurnal.to_csv(str(sys.argv[1] + '/' + str(sys.argv[1]) + '.csv'),encoding='utf_8_sig')
print('I created a file.')
print(jurnal.head())
if __name__ == '__main__':
#The code you want to run only when you run the module directly
main()
I tried to implement what I actually created. First, type the following into the terminal. This time, I entered machine learning as a search keyword. In the place of machine learning, you enter the keyword you want to search for.
python search.py machine learning
If all goes well, it will look like this:
The contents of the CSV look like this.
How was that? I learned scraping about three days ago, but the code was dirty, but it was relatively easy to implement. I think I have more to study, so I will continue to do my best.
Recommended Posts