BeautifulSoup, httplib2, timere, pandas, datetimeimport pandas as pd
from bs4 import BeautifulSoup
import httplib2
import re
import time
from datetime import datetime
http: // ~ / page / 1; http: // ~ / page / 2; ... etc.)num = 1. It is assumed that an arbitrary integer can be entered.`soup``` by soup = BeautifulSoup (content,'lxml') ``.num = 1
h = httplib2.Http('.cache')
base_url = "http://~/page"
url=base_url+ str(num)
response, content = h.request(url)
content = content.decode('utf-8')
soup = BeautifulSoup(content, 'lxml')
'div', those with the value of `" id "``` are " primary "``` are extracted and `` Store in data``` (extract only information about individual news from various information contained in the homepage).data = soup.find_all('div', {"id": "primary"})
data obtained in 2. and store it in dates``. Since ``datais extracted by find_all, when further extracting by `` `find_all from there, ``` data [0] `` Note that it is specified as `.`dates` data includes information such as time as well as date and time. Since only the date and time are used here, only the date and time are extracted and stored in `` `temp```. temp to datetime type and store it in list. Since the original data contains a mixture of `` `% d /% m /% Y types and % Y-% m-% d types, ```index`` Cases are classified using `and converted to datetime type.dates = data[0].find_all('span', class_ = 'posted-on')
temp = []
for item in dates:
date = item.text
temp.append(date[1:11].split())
dlist = []
for item in temp:
index = item[0].find("/")
if index != -1:
dlist.append(datetime.strptime(item[0], '%d/%m/%Y').date())
else:
dlist.append(datetime.strptime(item[0], '%Y-%m-%d').date())
data obtained in 2. and store it in `` `newdata```.(headline ... title t),ulist``` (url).
*Here, for the headline, the escape sequence(\n|\r|\t)We are working to remove.newdata = data[0].find_all('h2', class_ ='entry-title')
tlist = []
ulist = []
for item in newdata:
urls = re.search('href="(?P<URL>.+?)"', str(item)).group('URL')
titles = item.get_text()
ulist.append(urls)
tlist.append(re.sub(r'\n|\r|\t', '', titles))
pandas to create a data frame of the target headline list (date and time, article title, url).list_headline = pd.DataFrame({'date':dlist,
'headline':tlist,
'url':ulist})
num is set as a variable, and a page with the same structure can be automatically acquired according to the value of num. (url = base_url + str (num) `` `in 1. defines this)`headline```) and the variable (here num```) with `def```, and the contents of the function Indent and write (see "Actual Code" below for details).def headline(num):
h = httplib2.Http('.cache')
base_url = "http://~/page"
url=base_url+ str(num)
#Omission#
return list_headline
num is 1 to 5, and execute the code.num = 1 and store it in headlines. This is because it is not possible to store an empty object using a loop.time.sleep (5)).num takes 2 to 5, apply the iterative function using for and add the obtained result to headlines (existing headlines data. Image of stacking new data frames against frames).print (i) is used for error checking.headlines = headline(1)
time.sleep(5)
for i in range (2,5):
temp = headline(i)
headlines = pd.concat([headlines, temp])
time.sleep(5)
print (i)
#headlines.to_csv(datetime.today().strftime("%Y%m%d")+'FILENAME.csv') ##Basically.csv is easier to use and recommended
headlines.to_excel('/Users/USERNAME/FOLDERNAME/'+ datetime.today().strftime("%Y%m%d")+'FILENAME.xlsx') ##Click here if excel format is better
base_url) and save destination (see 8. above) have fictitious numbers inserted, so you will not get results even if you use this code directly.page1, page2, ...) differs depending on the structure of the homepage, and the tag structure also varies. Is. Please check the source code of the page carefully before actually using it.import pandas as pd
from bs4 import BeautifulSoup
import httplib2
import re
import time
from datetime import datetime
def headline(num):
h = httplib2.Http('.cache')
base_url = "http://~/page"
url=base_url+ str(num)
response, content = h.request(url)
soup = BeautifulSoup(content, 'lxml')
data = soup.find_all('div', {"id": "primary"})
dates = data[0].find_all('span', class_ = 'posted-on')
temp = []
for item in dates:
date = item.text
temp.append(date[1:11].split())
dlist = []
for item in temp:
index = item[0].find("/")
if index != -1:
dlist.append(datetime.strptime(item[0], '%d/%m/%Y').date())
else:
dlist.append(datetime.strptime(item[0], '%Y-%m-%d').date())
newdata = data[0].find_all('h2', class_ ='entry-title')
tlist = []
ulist = []
for item in newdata:
urls = re.search('href="(?P<URL>.+?)"', str(item)).group('URL')
titles = item.get_text()
ulist.append(urls)
tlist.append(re.sub(r'\n|\r|\t', '', titles))
list_headline = pd.DataFrame({'date':dlist,
'headline':tlist,
'url':ulist})
return list_headline
headlines = headline(1)
time.sleep(5)
for i in range (2,5):
temp = headline(i)
headlines = pd.concat([headlines, temp])
time.sleep(5)
print (i)
#headlines.to_csv(datetime.today().strftime("%Y%m%d")+'FILENAME.csv')
headlines.to_excel('/Users/USERNAME/FOLDERNAME/'+ datetime.today().strftime("%Y%m%d")+'FILENAME.xlsx') ##Click here if excel format is better
Recommended Posts