[Python] A function that aligns the width by inserting a space in text that has both full-width and half-width characters.

Introduction

I was writing a program to get a list of headlines and URLs from the Yahoo News site and display each item in one line, but I had a little trouble aligning the URL columns neatly, so for the future I will write an article in.

Get the data from the following sites.

The final text to get is as follows.

Development environment

Use Python 3.7. The development environment is Visual Studio Community 2019.

code

import requests
import unicodedata
from urllib.parse import urljoin
from bs4 import BeautifulSoup

def main():
    base_url = 'https://news.yahoo.co.jp/'
    categories = {
        'Major': '',
        'Domestic': 'categories/domestic',
        #'Entertainment': 'categories/entertainment',
        #'international': 'categories/world',
        #'Economy': 'categories/business',
        }

    #Loop processing for each category
    for cat in categories:
        url = urljoin(base_url, categories[cat])

        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'lxml') # html.parser

        ul_tag = soup.find('div', class_='topicsList')\
                     .find('ul', class_='topicsList_main')

        print(f'＝＝＝{cat}＝＝＝')

        for item in ul_tag.find_all('li', class_='topicsListItem'):
            a = item.find('a')
            topic_url = a['href']
            topic_headline = a.text.strip()
            
            #print(f'{topic_headline:<18}[{topic_url}]')
            text = text_align(topic_headline, 30)
            print(f'{text}[{topic_url}]')

        print()

def get_han_count(text):
    '''
Calculate the length of the character string with "2" for full-width characters and "1" for half-width characters.
    '''
    count = 0

    for char in text:
        if unicodedata.east_asian_width(char) in 'FWA':
            count += 2
        else:
            count += 1

    return count

def text_align(text, width, *, align=-1, fill_char=' '):
    '''
Text with mixed full-width / half-width
Fill in blanks so that it has the specified length (half-width conversion)
    
    width:Specify the number of characters in half-width conversion
    align: -1 -> left, 1 -> right
    fill_char:Specify the character to fill

    return:Text filled with whitespace ('abcde     '）
    '''

    fill_count = width - get_han_count(text)
    if (fill_count <= 0): return text

    if align < 0:
        return text + fill_char*fill_count
    else:
        return fill_char*fill_count + text

if __name__ == '__main__':
    main()

Initially, the format of the output text was as follows.

for item in ul_tag.find_all('li', class_='topicsListItem'):
    a = item.find('a')
    topic_url = a['href']
    topic_headline = a.text.strip()
            
    #This code will shift the URL column.
    print(f'{topic_headline:<18}[{topic_url}]')

In this case, the output will be as follows.

A specification such as print (f'{topic_headline: <18} [{topic_url}]') will handle full-width characters and half-width characters without distinction.

Therefore, I created a function to distinguish between full-width and half-width text and insert the required white space.

def get_han_count(text):
    '''
Calculate the length of the character string with "2" for full-width characters and "1" for half-width characters.
    '''
    count = 0

    for char in text:
        if unicodedata.east_asian_width(char) in 'FWA':
            count += 2
        else:
            count += 1

    return count

def text_align(text, width, *, align=-1, fill_char=' '):
    '''
Text with mixed full-width / half-width
Fill in blanks so that it has the specified length (half-width conversion)
    
    width:Specify the number of characters in half-width conversion
    align: -1 -> left, 1 -> right
    fill_char:Specify the character to fill

    return:Text filled with whitespace ('abcde     '）
    '''

    fill_count = width - get_han_count(text)
    if (fill_count <= 0): return text

    if align < 0:
        return text + fill_char*fill_count
    else:
        return fill_char*fill_count + text

In the end, format it with code like this:

for item in ul_tag.find_all('li', class_='topicsListItem'):
    a = item.find('a')
    topic_url = a['href']
    topic_headline = a.text.strip()
            
    text = text_align(topic_headline, 30)
    print(f'{text}[{topic_url}]')

This solved the problem. For the time being, text_align () has an option to insert a space on the left side and a symbol other than a space can be specified.

How to output to a text editor etc.

By the way, I think that the output from a Python program is usually a command prompt, but in this case you may want to export it to a text editor or word processor and save it.

In such a case, you can use the software Paster to paste directly to the caret position such as an editor.

Then, the data will be pasted directly as shown below.

At the end

This time it was a text format theme, so I didn't explain the scraping code, but almost all the find () methods were sufficient.

You are free to use the above source code, but please do so at your own risk.

Reference site

For how to use the function that distinguishes between full-width and half-width (ʻunicodedata.east_asian_width () `), I referred to the following site.

Count the number of characters (width) as 1 half-width character and 2 full-width characters in Python

When doing web scraping, be sure to check robots.txt of the target site.

`text:news.yahoo.co.jp/robots.txt`


User-agent: *
Disallow: /comment/plugin/
Disallow: /comment/violation/
Disallow: /polls/widgets/
Disallow: /articles/*/comments
Sitemap: https://news.yahoo.co.jp/sitemaps.xml
Sitemap: https://news.yahoo.co.jp/sitemaps/article.xml
Sitemap: https://news.yahoo.co.jp/byline/sitemap.xml
Sitemap: https://news.yahoo.co.jp/polls/sitemap.xml