I was writing a program to get a list of headlines and URLs from the Yahoo News site and display each item in one line, but I had a little trouble aligning the URL columns neatly, so for the future I will write an article in.
Get the data from the following sites.
The final text to get is as follows.
Use Python 3.7. The development environment is Visual Studio Community 2019.
import requests
import unicodedata
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def main():
base_url = 'https://news.yahoo.co.jp/'
categories = {
'Major': '',
'Domestic': 'categories/domestic',
#'Entertainment': 'categories/entertainment',
#'international': 'categories/world',
#'Economy': 'categories/business',
}
#Loop processing for each category
for cat in categories:
url = urljoin(base_url, categories[cat])
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml') # html.parser
ul_tag = soup.find('div', class_='topicsList')\
.find('ul', class_='topicsList_main')
print(f'==={cat}===')
for item in ul_tag.find_all('li', class_='topicsListItem'):
a = item.find('a')
topic_url = a['href']
topic_headline = a.text.strip()
#print(f'{topic_headline:<18}[{topic_url}]')
text = text_align(topic_headline, 30)
print(f'{text}[{topic_url}]')
print()
def get_han_count(text):
'''
Calculate the length of the character string with "2" for full-width characters and "1" for half-width characters.
'''
count = 0
for char in text:
if unicodedata.east_asian_width(char) in 'FWA':
count += 2
else:
count += 1
return count
def text_align(text, width, *, align=-1, fill_char=' '):
'''
Text with mixed full-width / half-width
Fill in blanks so that it has the specified length (half-width conversion)
width:Specify the number of characters in half-width conversion
align: -1 -> left, 1 -> right
fill_char:Specify the character to fill
return:Text filled with whitespace ('abcde ')
'''
fill_count = width - get_han_count(text)
if (fill_count <= 0): return text
if align < 0:
return text + fill_char*fill_count
else:
return fill_char*fill_count + text
if __name__ == '__main__':
main()
Initially, the format of the output text was as follows.
for item in ul_tag.find_all('li', class_='topicsListItem'):
a = item.find('a')
topic_url = a['href']
topic_headline = a.text.strip()
#This code will shift the URL column.
print(f'{topic_headline:<18}[{topic_url}]')
In this case, the output will be as follows.
A specification such as print (f'{topic_headline: <18} [{topic_url}]')
will handle full-width characters and half-width characters without distinction.
Therefore, I created a function to distinguish between full-width and half-width text and insert the required white space.
def get_han_count(text):
'''
Calculate the length of the character string with "2" for full-width characters and "1" for half-width characters.
'''
count = 0
for char in text:
if unicodedata.east_asian_width(char) in 'FWA':
count += 2
else:
count += 1
return count
def text_align(text, width, *, align=-1, fill_char=' '):
'''
Text with mixed full-width / half-width
Fill in blanks so that it has the specified length (half-width conversion)
width:Specify the number of characters in half-width conversion
align: -1 -> left, 1 -> right
fill_char:Specify the character to fill
return:Text filled with whitespace ('abcde ')
'''
fill_count = width - get_han_count(text)
if (fill_count <= 0): return text
if align < 0:
return text + fill_char*fill_count
else:
return fill_char*fill_count + text
In the end, format it with code like this:
for item in ul_tag.find_all('li', class_='topicsListItem'):
a = item.find('a')
topic_url = a['href']
topic_headline = a.text.strip()
text = text_align(topic_headline, 30)
print(f'{text}[{topic_url}]')
This solved the problem.
For the time being, text_align ()
has an option to insert a space on the left side and a symbol other than a space can be specified.
By the way, I think that the output from a Python program is usually a command prompt, but in this case you may want to export it to a text editor or word processor and save it.
In such a case, you can use the software Paster to paste directly to the caret position such as an editor.
Then, the data will be pasted directly as shown below.
This time it was a text format theme, so I didn't explain the scraping code, but almost all the find ()
methods were sufficient.
You are free to use the above source code, but please do so at your own risk.
For how to use the function that distinguishes between full-width and half-width (ʻunicodedata.east_asian_width () `), I referred to the following site.
When doing web scraping, be sure to check robots.txt of the target site.
text:news.yahoo.co.jp/robots.txt
User-agent: *
Disallow: /comment/plugin/
Disallow: /comment/violation/
Disallow: /polls/widgets/
Disallow: /articles/*/comments
Sitemap: https://news.yahoo.co.jp/sitemaps.xml
Sitemap: https://news.yahoo.co.jp/sitemaps/article.xml
Sitemap: https://news.yahoo.co.jp/byline/sitemap.xml
Sitemap: https://news.yahoo.co.jp/polls/sitemap.xml
Recommended Posts