I tried to get Web information using "Requests" and "lxml"

I'm thinking of using Scrapy, but first I tried to get Web information with "Requests" and "lxml". The first step in web scraping using Python.

What i did

--Getting information on the Web using "Requests" --Extracting necessary information from HTML obtained using "lxml"

Installation

pip install requests

pip install lxml

HTML for testing

I placed it on EC2 and tested it via the Internet.

`test.html`


<html>
    <body>
        <div id="test1">test1
            <ul id="test1_ul">test1 ul</ul>
        </div>
    </body>
</html>

Scraping code

--If you pass the URL as an argument, process from that HTML --User-Agent changed to Mac just in case

(Error handling when there is no argument etc. is not implemented)

`scraping.py`


import sys
import requests
import lxml.html

#set dummy user-agent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8'}

#Specify URL as an argument
url = ''
if len(sys.argv) > 1:
    url = sys.argv[1]

response = requests.get(url, headers = headers)
html = lxml.html.fromstring(response.content)

for div in html.xpath('//*[@id="test1_ul"]') :
    print(div.text)

The execution command is as follows. The URL of the argument is arbitrary.

python scraping.py http://ec2******

Other

It's convenient to be able to easily get XPath and CSS selectors with Chrome developer tools.

Recommended Posts

I tried to get Web information using "Requests" and "lxml"

[Python] I tried to get various information using YouTube Data API!

I tried web scraping using python and selenium

I tried to get an AMI using AWS Lambda

Try to get a web page and JSON file using Python's Requests library

Python programming: I tried to get (crawling) news articles using Selenium and BeautifulSoup4.

I tried to get the location information of Odakyu Bus

I tried to get various information from the codeforces API

I tried to get data from AS / 400 quickly using pypyodbc

I tried to get a database of horse racing using Pandas

I tried to get the index of the list using the enumerate function

I tried to let Pepper talk about event information and member information

I tried to get a list of AMI Names using Boto3

I tried to get data from AS / 400 quickly using pypyodbc Preparation 1

I tried using Azure Speech to Text.

I tried using Twitter api and Line api

I tried to get started with Hy

I tried using PyEZ and JSNAPy. Part 2: I tried using PyEZ

I tried to classify text using TensorFlow

I tried to make a Web API

Scraping using lxml and saving to MySQL

I tried to predict Covid-19 using Darts

I tried to get the batting results of Hachinai using image processing

I tried to convert datetime <-> string with tzinfo using strftime () and strptime ()

I tried to extract and illustrate the stage of the story using COTOHA

I tried to get the movie information of TMDb API with Python

Start a web server using Bottle and Flask (I also tried using Apache)

I tried to create a sample to access Salesforce using Python and Bottle

I want to make a web application using React and Python flask

I tried using PyEZ and JSNAPy. Part 1: Overview

I tried web scraping to analyze the lyrics.

I implemented DCGAN and tried to generate apples

I tried to get an image by scraping

I tried object detection using Python and OpenCV

I tried to synthesize WAV files using Pydub.

I tried to get CloudWatch data with Python

Python programming: I tried to get company information (crawling) from Yahoo Finance in the US using BeautifulSoup4

[Introduction to PID] I tried to control and play ♬

I tried to make a ○ ✕ game using TensorFlow

I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"

[ES Lab] I tried to develop a WEB application with Python and Flask ②

I tried using parameterized

I tried using argparse

I tried using mimesis

I tried using aiomysql

I tried using Summpy

I tried using coturn

I tried using Pipenv

I tried using matplotlib

I tried using "Anvil".

I tried using Hubot

I tried using ESPCN

I tried using openpyxl

I tried using Ipython

I tried to debug.

I tried using PyCaret

I tried using cron

I tried using ngrok

I tried using face_recognition

I tried to paste

I tried using Jupyter