Use zip
.
From the list of articles on a site, use BeautifulSoup4
to get a list of URLs and titles.
The purpose is to create a list of URLs and titles connected by >>>
, which is ['url >>> title','url >>> title']
.
If the page has a title in the a tag element as shown below ...
import requests, bs4
res = requests.get('https://qiita.com/takuto_neko_like')
posts = bs4.BeautifulSoup(res.text, 'html.parser').select('.u-link-no-underline')
print(posts)
The variable posts
contains<class'bs4.element.Tag'>
, and you can see the contents of the a tag in html by accessing it individually as posts [0].
The title is described in the element of the a tag
[<a class="u-link-no-underline" href="/takuto_neko_like/items/52c6c52385386544aa62">Where I was worried about heroku</a>, <a class="u-link-no-underline" href="/takuto_neko_like/items/c5791f267e0964e09d03">I made a tool to get new articles</a>, <a class="u-link-no-underline" href="/takuto_neko_like/items/93b3751984e5e3fd3670">fish is moving too slowly git trouble</a>, <a class="u-link-no-underline" href="/takuto_neko_like/items/62aeb4271614f6f0347f">Use Plotly graphs with Django</a>, <a class="u-link-no-underline" href="/takuto_neko_like/items/c9c80ff453d0c4fad239">【Python】super()Reasons to override using</a>, <a class="u-link-no-underline" href="/takuto_neko_like/items/14e92797fa2b23a64adb">[Python] What is inherited by multiple inheritance?</a>, <a class="u-link-no-underline" href="/takuto_neko_like/items/6cf9bade3d9515a724c0">【Python】@What are classmethods and decorators?</a>, <a class="u-link-no-underline" href="/takuto_neko_like/items/aed9dd5619d8457d4894">【Python】*args **What is kwrgs</a>, <a class="u-link-no-underline" href="/takuto_neko_like/items/bb8d0957347636b5bf4f">[Bootstrap] How to fix and display navbar even if scrolling, points to keep in mind and solutions</a>]
The content of each a tag is <class'bs4.element.Tag'>
= Tag object, so by converting it to a string type with str (), you can use .find ()
with the `tag name and attributes. By using the index obtained by specifying the name etc., it will be possible to extract only the character string of the URL part and title part.
Format url and title
for post in posts:
#Extract URL
index_first = int(str(post).find('href=')) + 6
index_end = int(str(post).find('">'))
url = (str(post)[index_first : index_end])
#Extract title
index_first = int(str(post).find('">')) + 2
index_end = int(str(post).find('</a'))
title = (str(post)[index_first : index_end].replace('\u3000', ' '))
url_title_set.append(f"{url}>>>{title}")
If so, it's done.
However, there are many type ** sites that do not have a title listed as an element of the ** a tag. For example, a pattern in which article information is displayed as a card consisting of an image and a title, and a link is attached to the entire card.
Example
<div class='card'><a href='#' class='link'>
<div class='image'><img src='#'></div>
<div class='title'>title</div>
</a>
</div>
In such a case, if you specify the class card
in .select
of bs4, you can get in the div tag to which the card class is applied. I want to get the href information of the a tag and the element of the title class div from there.
In the actual code, there are more elements overlapping, so trying to find a particular string in .find from the parent element can be a bit annoying.
Also, bs4 has a way to access the child elements, but it seemed a bit annoying when I was looking at the docs, so I got each one individually as follows.
posts_links = bs4.BeautifulSoup(res.text, 'html.parser').select('.link')
posts_titles = bs4.BeautifulSoup(res.text, 'html.parser').select('.title')
We will access the Tag object individually by turning the posts list with a for statement by formatting the previous code ʻurl and title
, but this time there are two lists. Turn the two lists at the same time with a for statement and combine the url and title obtained from each list. Then I want to store it in a new list.
In this way, if you want to rotate multiple lists with a for statement at the same time, use zip
.
now_posts_link_title_set = []
for (posts_link, posts_title) in zip(posts_links, posts_titles):
index_first = int(str(posts_link).find('href=')) + 6
index_end = int(str(posts_link).find('">'))
posts_link_set = (str(posts_link)[index_first : index_end])
index_first = int(str(posts_title).find('h2')) + 3
index_end = int(str(posts_title).find('</h2'))
posts_title_set = (str(posts_title)[index_first : index_end].replace('\u3000', ' ')) #Whitespace replacement
now_posts_link_title_set.append(f"{posts_link_set}>>>{posts_title_set}")
It's okay to have more than two
for (a, b, c, d) in zip(a_list, b_list, c_list, d_list)
If there is a difference in the number of elements in the list, the larger one will be ignored
aa = [1,2,3,4,5]
bb = ['a', 'b', 'c']
for (a, b) in zip(aa, bb):
print(f'{a} : {b}')
#result
1 : a
2 : b
3 : c
This is convenient
Recommended Posts