Scraping using Python

Overview

0. Package

import pandas as pd
from bs4 import BeautifulSoup
import httplib2
import re
import time
from datetime import datetime

1. Web data scraping

num = 1
h = httplib2.Http('.cache')
base_url = "http://~/page"
url=base_url+ str(num)
response, content = h.request(url)
content = content.decode('utf-8')
soup = BeautifulSoup(content, 'lxml')

2. Narrow down scraped data: individual news

data = soup.find_all('div',  {"id": "primary"})

3. Obtaining the date and time and shaping the time stamp

dates = data[0].find_all('span', class_ = 'posted-on')

temp = []
for item in dates:
        date = item.text
        temp.append(date[1:11].split())
dlist = []
for item in temp:
    index = item[0].find("/")
    if index != -1:
        dlist.append(datetime.strptime(item[0], '%d/%m/%Y').date())
    else:
        dlist.append(datetime.strptime(item[0], '%Y-%m-%d').date())

4. Get headline, url

newdata = data[0].find_all('h2', class_ ='entry-title')
tlist = []
ulist = []
for item in newdata:
    urls = re.search('href="(?P<URL>.+?)"', str(item)).group('URL')
    titles = item.get_text()
    ulist.append(urls)
    tlist.append(re.sub(r'\n|\r|\t', '', titles))

5. Collect the acquired information in a data frame

list_headline = pd.DataFrame({'date':dlist,
                            'headline':tlist,
                            'url':ulist})

6. Functionalization

def headline(num):
    h = httplib2.Http('.cache')
    base_url = "http://~/page"
    url=base_url+ str(num)
#Omission#
    return list_headline

7. Repeated code

headlines = headline(1)
time.sleep(5)

for i in range (2,5):
    temp = headline(i)
    headlines = pd.concat([headlines, temp]) 
    time.sleep(5)
    print (i)

8. Save

#headlines.to_csv(datetime.today().strftime("%Y%m%d")+'FILENAME.csv') ##Basically.csv is easier to use and recommended
headlines.to_excel('/Users/USERNAME/FOLDERNAME/'+ datetime.today().strftime("%Y%m%d")+'FILENAME.xlsx') ##Click here if excel format is better

Actual code

import pandas as pd
from bs4 import BeautifulSoup
import httplib2
import re
import time
from datetime import datetime

def headline(num):
    h = httplib2.Http('.cache')
    base_url = "http://~/page"
    url=base_url+ str(num)
    response, content = h.request(url)
    soup = BeautifulSoup(content, 'lxml')
    data = soup.find_all('div',  {"id": "primary"})
    dates = data[0].find_all('span', class_ = 'posted-on')
    temp = []
    for item in dates:
            date = item.text
            temp.append(date[1:11].split())
    dlist = []
    for item in temp:
        index = item[0].find("/")
        if index != -1:
            dlist.append(datetime.strptime(item[0], '%d/%m/%Y').date())
        else:
            dlist.append(datetime.strptime(item[0], '%Y-%m-%d').date())

    newdata = data[0].find_all('h2', class_ ='entry-title')
    tlist = []
    ulist = []
    for item in newdata:
        urls = re.search('href="(?P<URL>.+?)"', str(item)).group('URL')
        titles = item.get_text()
        ulist.append(urls)
        tlist.append(re.sub(r'\n|\r|\t', '', titles))


    list_headline = pd.DataFrame({'date':dlist,
                            'headline':tlist,
                            'url':ulist})
    return list_headline

headlines = headline(1)
time.sleep(5)

for i in range (2,5):
    temp = headline(i)
    headlines = pd.concat([headlines, temp]) 
    time.sleep(5)
    print (i)

#headlines.to_csv(datetime.today().strftime("%Y%m%d")+'FILENAME.csv')
headlines.to_excel('/Users/USERNAME/FOLDERNAME/'+ datetime.today().strftime("%Y%m%d")+'FILENAME.xlsx') ##Click here if excel format is better

Recommended Posts

Scraping using Python
[Scraping] Python scraping
Scraping using Python 3.5 async / await
Scraping using Python 3.5 Async syntax
Web scraping using Selenium (Python)
Python scraping notes
Python Scraping get_ranker_categories
Scraping with Python
Scraping with Python
Start using Python
Python Scraping eBay
Python Scraping get_title
Python: Scraping Part 2
Scraping a website using JavaScript in Python
[Python] Scraping a table using Beautiful Soup
Scraping with Python (preparation)
Summary about Python scraping
Try scraping with Python.
Operate Redmine using Python Redmine
Fibonacci sequence using Python
UnicodeEncodeError:'cp932' during python scraping
Data analysis using Python 0
Scraping with Python + PhantomJS
Data cleaning using Python
Using Python #external packages
WiringPi-SPI communication using Python
Age calculation using python
Search Twitter using Python
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
Name identification using python
Notes using Python subprocesses
Try using Tweepy [Python2.7]
Scraping RSS with Python
I tried web scraping using python and selenium
Pharmaceutical company researchers summarized web scraping using Python
Python notes using perl-ternary operator
Flatten using Python yield from
I tried scraping with Python
Web scraping with python + JupyterLab
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
[S3] CRUD with S3 using Python [Python]
[Python] Scraping in AWS Lambda
python super beginner tries scraping
Web scraping notes in python3
[Python] Try using Tkinter's canvas
Python
Scraping with chromedriver in python
Using Quaternion with Python ~ numpy-quaternion ~
Try using Kubernetes Client -Python-
Python notes using perl-special variables
[Python] Using OpenCV with Python (Basic)
Website change monitoring using python
Scraping with Selenium in Python
Post to Twitter using Python
Start to Selenium using python
Search algorithm using word2vec [python]
Change python version using pyenv
# 1 [python3] Simple calculation using variables