BeautifulSoup
, httplib2
, time
re
, pandas
, datetime
import pandas as pd
from bs4 import BeautifulSoup
import httplib2
import re
import time
from datetime import datetime
http: // ~ / page / 1
; http: // ~ / page / 2
; ... etc.)num = 1
. It is assumed that an arbitrary integer can be entered.`soup``` by
soup = BeautifulSoup (content,'lxml')
``.num = 1
h = httplib2.Http('.cache')
base_url = "http://~/page"
url=base_url+ str(num)
response, content = h.request(url)
content = content.decode('utf-8')
soup = BeautifulSoup(content, 'lxml')
'div'
, those with the value of `" id "``` are
" primary "``` are extracted and `` Store in
data``` (extract only information about individual news from various information contained in the homepage).data = soup.find_all('div', {"id": "primary"})
data
obtained in 2. and store it in
dates``. Since ``
datais extracted by
find_all, when further extracting by `` `find_all
from there, ``` data [0] `` Note that it is specified as `.`dates`
data includes information such as time as well as date and time. Since only the date and time are used here, only the date and time are extracted and stored in `` `temp```. temp
to datetime type and store it in
list. Since the original data contains a mixture of `` `% d /% m /% Y
types and % Y-% m-% d
types, ```index`` Cases are classified using `and converted to datetime type.dates = data[0].find_all('span', class_ = 'posted-on')
temp = []
for item in dates:
date = item.text
temp.append(date[1:11].split())
dlist = []
for item in temp:
index = item[0].find("/")
if index != -1:
dlist.append(datetime.strptime(item[0], '%d/%m/%Y').date())
else:
dlist.append(datetime.strptime(item[0], '%Y-%m-%d').date())
data
obtained in 2. and store it in `` `newdata```.(headline ... title t),
ulist``` (url).
*Here, for the headline, the escape sequence(\n|\r|\t)We are working to remove.newdata = data[0].find_all('h2', class_ ='entry-title')
tlist = []
ulist = []
for item in newdata:
urls = re.search('href="(?P<URL>.+?)"', str(item)).group('URL')
titles = item.get_text()
ulist.append(urls)
tlist.append(re.sub(r'\n|\r|\t', '', titles))
pandas
to create a data frame of the target headline list (date and time, article title, url).list_headline = pd.DataFrame({'date':dlist,
'headline':tlist,
'url':ulist})
num
is set as a variable, and a page with the same structure can be automatically acquired according to the value of
num. (
url = base_url + str (num) `` `in 1. defines this)`headline```) and the variable (here
num```) with
`def```, and the contents of the function Indent and write (see "Actual Code" below for details).def headline(num):
h = httplib2.Http('.cache')
base_url = "http://~/page"
url=base_url+ str(num)
#Omission#
return list_headline
num
is 1 to 5, and execute the code.num = 1
and store it in headlines
. This is because it is not possible to store an empty object using a loop.time.sleep (5)
).num
takes 2 to 5, apply the iterative function using for
and add the obtained result to headlines
(existing headlines
data. Image of stacking new data frames against frames).print (i)
is used for error checking.headlines = headline(1)
time.sleep(5)
for i in range (2,5):
temp = headline(i)
headlines = pd.concat([headlines, temp])
time.sleep(5)
print (i)
#headlines.to_csv(datetime.today().strftime("%Y%m%d")+'FILENAME.csv') ##Basically.csv is easier to use and recommended
headlines.to_excel('/Users/USERNAME/FOLDERNAME/'+ datetime.today().strftime("%Y%m%d")+'FILENAME.xlsx') ##Click here if excel format is better
base_url
) and save destination (see 8. above) have fictitious numbers inserted, so you will not get results even if you use this code directly.page1
, page2
, ...) differs depending on the structure of the homepage, and the tag structure also varies. Is. Please check the source code of the page carefully before actually using it.import pandas as pd
from bs4 import BeautifulSoup
import httplib2
import re
import time
from datetime import datetime
def headline(num):
h = httplib2.Http('.cache')
base_url = "http://~/page"
url=base_url+ str(num)
response, content = h.request(url)
soup = BeautifulSoup(content, 'lxml')
data = soup.find_all('div', {"id": "primary"})
dates = data[0].find_all('span', class_ = 'posted-on')
temp = []
for item in dates:
date = item.text
temp.append(date[1:11].split())
dlist = []
for item in temp:
index = item[0].find("/")
if index != -1:
dlist.append(datetime.strptime(item[0], '%d/%m/%Y').date())
else:
dlist.append(datetime.strptime(item[0], '%Y-%m-%d').date())
newdata = data[0].find_all('h2', class_ ='entry-title')
tlist = []
ulist = []
for item in newdata:
urls = re.search('href="(?P<URL>.+?)"', str(item)).group('URL')
titles = item.get_text()
ulist.append(urls)
tlist.append(re.sub(r'\n|\r|\t', '', titles))
list_headline = pd.DataFrame({'date':dlist,
'headline':tlist,
'url':ulist})
return list_headline
headlines = headline(1)
time.sleep(5)
for i in range (2,5):
temp = headline(i)
headlines = pd.concat([headlines, temp])
time.sleep(5)
print (i)
#headlines.to_csv(datetime.today().strftime("%Y%m%d")+'FILENAME.csv')
headlines.to_excel('/Users/USERNAME/FOLDERNAME/'+ datetime.today().strftime("%Y%m%d")+'FILENAME.xlsx') ##Click here if excel format is better
Recommended Posts