Overview

Motivation: I want to collect the date, headline, and url of news that is added daily.
Use Beautiful Soup
This article is based on the assumption that Python 3.x and UNIX are used.

0. Package

Used for scraping: BeautifulSoup, httplib2, time
Used for data storage: re, pandas, datetime

import pandas as pd
from bs4 import BeautifulSoup
import httplib2
import re
import time
from datetime import datetime

1. Web data scraping

Target pages whose url tail changes dynamically (e.g. http: // ~ / page / 1; http: // ~ / page / 2; ... etc.)
In 1 to 5 below, it is explained as num = 1. It is assumed that an arbitrary integer can be entered.
Here, the goal is to store the data of the page in `soup``` by soup = BeautifulSoup (content,'lxml') ``.

num = 1
h = httplib2.Http('.cache')
base_url = "http://~/page"
url=base_url+ str(num)
response, content = h.request(url)
content = content.decode('utf-8')
soup = BeautifulSoup(content, 'lxml')

2. Narrow down scraped data: individual news

Extract the necessary information from the data stored in `` `soup``` in 1. (List individual news about posting date / time, headline, article url)
Here, among those tagged with 'div', those with the value of `" id "``` are " primary "``` are extracted and `` Store in data``` (extract only information about individual news from various information contained in the homepage).

data = soup.find_all('div',  {"id": "primary"})

3. Obtaining the date and time and shaping the time stamp

Extract the time stamp from the list of news included in data obtained in 2. and store it in dates``. Since ``datais extracted by find_all, when further extracting by `` `find_all from there, ``` data [0] `` Note that it is specified as `.
The `dates` data includes information such as time as well as date and time. Since only the date and time are used here, only the date and time are extracted and stored in `` `temp```.
Furthermore, convert temp to datetime type and store it in list. Since the original data contains a mixture of `` `% d /% m /% Y types and % Y-% m-% d types, ```index`` Cases are classified using `and converted to datetime type.

dates = data[0].find_all('span', class_ = 'posted-on')

temp = []
for item in dates:
        date = item.text
        temp.append(date[1:11].split())
dlist = []
for item in temp:
    index = item[0].find("/")
    if index != -1:
        dlist.append(datetime.strptime(item[0], '%d/%m/%Y').date())
    else:
        dlist.append(datetime.strptime(item[0], '%Y-%m-%d').date())

4. Get headline, url

Obtain the headline url of each news from the list of news included in data obtained in 2. and store it in `` `newdata```.
Store each in `` `tlist(headline ... title t),ulist``` (url). *Here, for the headline, the escape sequence(\n|\r|\t)We are working to remove.

newdata = data[0].find_all('h2', class_ ='entry-title')
tlist = []
ulist = []
for item in newdata:
    urls = re.search('href="(?P<URL>.+?)"', str(item)).group('URL')
    titles = item.get_text()
    ulist.append(urls)
    tlist.append(re.sub(r'\n|\r|\t', '', titles))

5. Collect the acquired information in a data frame

Here, use pandas to create a data frame of the target headline list (date and time, article title, url).
This data frame is the final result.

list_headline = pd.DataFrame({'date':dlist,
                            'headline':tlist,
                            'url':ulist})

6. Functionalization

As described in 1., the explanation has been advanced with `` `num = 1``` so far, but in the following, the same work will be considered for multiple pages.
Assuming that each page has the same structure, it can be said that it is effective to make the above 1 to 4 functions and control the page switching with variables.
Here, it is assumed that num is set as a variable, and a page with the same structure can be automatically acquired according to the value of num. (url = base_url + str (num) `` `in 1. defines this)
To make it a function, declare the function name (here `headline```) and the variable (here num```) with `def```, and the contents of the function Indent and write (see "Actual Code" below for details).
Finally, the data frame (which is the final result of this work, as mentioned in 5.) is specified as the return value.

def headline(num):
    h = httplib2.Http('.cache')
    base_url = "http://~/page"
    url=base_url+ str(num)
#Omission#
    return list_headline

7. Repeated code

Here, assume that the value of num is 1 to 5, and execute the code.
First, execute the function when num = 1 and store it in headlines. This is because it is not possible to store an empty object using a loop.
To avoid burdening the server, a waiting time of 5 seconds is set for script execution (time.sleep (5)).
Then, when num takes 2 to 5, apply the iterative function using for and add the obtained result to headlines (existing headlines data. Image of stacking new data frames against frames).
print (i) is used for error checking.

headlines = headline(1)
time.sleep(5)

for i in range (2,5):
    temp = headline(i)
    headlines = pd.concat([headlines, temp]) 
    time.sleep(5)
    print (i)

8. Save

Save the results obtained in 7.
Here, we introduce two saving methods, .csv and .xlsx.

#headlines.to_csv(datetime.today().strftime("%Y%m%d")+'FILENAME.csv') ##Basically.csv is easier to use and recommended
headlines.to_excel('/Users/USERNAME/FOLDERNAME/'+ datetime.today().strftime("%Y%m%d")+'FILENAME.xlsx') ##Click here if excel format is better

Actual code

The above code is as follows when 1 to 8 are put together.
Please note that the homepage address (base_url) and save destination (see 8. above) have fictitious numbers inserted, so you will not get results even if you use this code directly.
In addition, the numbering of page addresses (in this paper, it is assumed that the end of the address changes to page1, page2, ...) differs depending on the structure of the homepage, and the tag structure also varies. Is. Please check the source code of the page carefully before actually using it.
Some homepages prohibit scraping. Please check carefully.

import pandas as pd
from bs4 import BeautifulSoup
import httplib2
import re
import time
from datetime import datetime

def headline(num):
    h = httplib2.Http('.cache')
    base_url = "http://~/page"
    url=base_url+ str(num)
    response, content = h.request(url)
    soup = BeautifulSoup(content, 'lxml')
    data = soup.find_all('div',  {"id": "primary"})
    dates = data[0].find_all('span', class_ = 'posted-on')
    temp = []
    for item in dates:
            date = item.text
            temp.append(date[1:11].split())
    dlist = []
    for item in temp:
        index = item[0].find("/")
        if index != -1:
            dlist.append(datetime.strptime(item[0], '%d/%m/%Y').date())
        else:
            dlist.append(datetime.strptime(item[0], '%Y-%m-%d').date())

    newdata = data[0].find_all('h2', class_ ='entry-title')
    tlist = []
    ulist = []
    for item in newdata:
        urls = re.search('href="(?P<URL>.+?)"', str(item)).group('URL')
        titles = item.get_text()
        ulist.append(urls)
        tlist.append(re.sub(r'\n|\r|\t', '', titles))


    list_headline = pd.DataFrame({'date':dlist,
                            'headline':tlist,
                            'url':ulist})
    return list_headline

headlines = headline(1)
time.sleep(5)

for i in range (2,5):
    temp = headline(i)
    headlines = pd.concat([headlines, temp]) 
    time.sleep(5)
    print (i)

#headlines.to_csv(datetime.today().strftime("%Y%m%d")+'FILENAME.csv')
headlines.to_excel('/Users/USERNAME/FOLDERNAME/'+ datetime.today().strftime("%Y%m%d")+'FILENAME.xlsx') ##Click here if excel format is better

Scraping using Python

Overview

0. Package

1. Web data scraping

2. Narrow down scraped data: individual news

3. Obtaining the date and time and shaping the time stamp

4. Get headline, url

5. Collect the acquired information in a data frame

6. Functionalization

7. Repeated code

8. Save

Actual code