[The strongest English word book Bakusei ww] Automatically generate English word book required for engineers with Python-Part 1

Introduction

I think everyone has bought it once. If you automatically create an English word book in Python,

Isn't it necessary to purchase an English word book anymore? I will test the hypothesis.

** If LGTM exceeds 10, ** We will create more practical ones in the second part (or the second part), so if you find it interesting, please LGTM ~!

About the author

Biography

Screen Shot 2020-04-09 at 16.41.03.png After graduating from the electrical and information engineering department of a technical college at the age of 20, while translating specifications etc. at a technical trading company, interpreting when visiting overseas manufacturers, corporate sales, etc., 9 hours on Saturdays and Sundays and 1 hour on weekdays self-study of English 1 We have continued to exceed TOEIC 900 for about a year and obtained IELTS 7.5. Currently, I am working as a back-end and AI engineer as my main business and as an English teacher as a side business. Please check this out for details. https://hossyan-blog.com/profile/

Promotion

my_website.png The day before yesterday, I opened an English learning site for programmers. Please take a look at this blog, which can be read by those who are about to start learning English or who are worried about how to learn English. https://hossyan-blog.com/

The introduction has become long, but I would like to get into the main subject immediately!

table of contents

――What are you going to make? ――How do you make it? --Whole code --Code explanation --Delete HTML tags after getting HTML data

What to make?

coding.jpg

As introduced in here, the original purpose of using vocabulary is

"** To remember only what you need for the goal you want to achieve **"

Will be. I think that qiita readers are IT engineers, so This time, I will create an English word book ** so that engineers can read technical books.

How to make it?

stackoverflow.png 1. Get HTML data from Stackoverflow 2. Morphological analysis 3. Calculate the frequency of words 4. Display the ones that appear frequently as candidates

Since this is the first part, we will first use the data for only one screen of the following HTML page tagged with the keyword "Python". https://stackoverflow.com/questions/tagged/python

Now let's get into the code introduction.

Whole code

The whole code looks like this.

from enum import Enum, unique
from typing import List, Tuple, Set, Dict
import requests

from bs4 import BeautifulSoup as bs
from textblob import TextBlob

URL = 'https://stackoverflow.com/questions/tagged/python'
PARSER = "html.parser"
FILTER_BY_COUNT = 2


@unique
class PartOfSpeechToLearn(Enum):
    JJ = 'Adjective'
    VB = 'Verb'
    NN = 'Noun'
    RB = 'Adverb'


if __name__ == '__main__':
    # Get HTML data and remove html tags
    res = requests.get(URL)
    raw_html = bs(res.text, PARSER)
    texts_without_html: str = raw_html.text

    # morphological analysis
    morph = TextBlob(texts_without_html)
    word_and_tag: List[Tuple[str, str]] = morph.tags

    # Filter words to create a book for vocab
    part_of_speech_to_learn = tuple(pos.name for pos in PartOfSpeechToLearn)
    words_to_learn: Set[str] = {
        wt[0]
        for wt in word_and_tag
        if wt[1] in part_of_speech_to_learn
    }
    words_filtered_by_count: Dict[str, int] = {
        word: morph.words.count(word)
        for word in words_to_learn
        if morph.words.count(word) > FILTER_BY_COUNT
    }

    # Show 50 words that are most frequently used
    words_in_descending_order: List[Tuple[str, int]] = sorted(
        words_filtered_by_count.items(),
        key=lambda x: x[1],
        reverse=True
    )
    for i, word_and_count in enumerate(words_in_descending_order[:50]):
        print(f'rank:{i} word: {word_and_count}')

Set the part of speech to get for the deck

This time, we will acquire the following four part of speech **.

  1. ** Adjective ** (adjective) Part of speech that modifies nouns such as big and funny
  2. ** Verb ** (verb) Part of speech that represents movement, such as run and write
  3. ** Noun ** (noun) Apple, etc.
  4. ** Adverb ** Part of speech that modifies adjectives and verbs such as really and surely
@unique
class PartOfSpeechToLearn(Enum):
    JJ = 'Adjective'
    VB = 'Verb'
    NN = 'Noun'
    RB = 'Adverb'

Delete HTML tags after getting HTML data

I'm getting the HTML from the URL and converting it to text without HTML tags.

# Get HTML data and remove html tags
res = requests.get(URL)
raw_html = bs(res.text, PARSER)
texts_without_html: str = raw_html.text

Morphological analysis

Perform morphological analysis to limit the parts of speech included in the vocabulary. TextBlob is used for morphological analysis. For TextBlob, I referred to here.

# morphological analysis
morph = TextBlob(texts_without_html)
word_and_tag: List[Tuple[str, str]] = morph.tags

Filter for decks

After filtering only the part of speech that you want to include in the vocabulary, create a dictionary with the word as the key and the number of occurrences as the value while filtering by the number of frequent occurrences (FILTER_BY_COUNT = 2).

part_of_speech_to_learn = tuple(pos.name for pos in PartOfSpeechToLearn)
words_to_learn: Set[str] = {
    wt[0]
    for wt in word_and_tag
    if wt[1] in part_of_speech_to_learn 
}
words_filtered_by_count: Dict[str, int] = {
    word: morph.words.count(word)
    for word in words_to_learn
    if morph.words.count(word) > FILTER_BY_COUNT
}

Show top 50 after sorting in descending order

words_in_descending_order: List[Tuple[str, int]] = sorted(
    words_filtered_by_count.items(),
    key=lambda x: x[1],
    reverse=True
)

# Show 50 words that are most frequently used
for i, word_and_count in enumerate(words_in_descending_order[:50]):
    print(f'rank:{i} word: {word_and_count}')

Execution result

surprise.jpg

Can it be an English word book ........................................?

Click here for the essential execution results .....................!

rank:0 word: ('i', 96)
rank:1 word: ('python', 86)
rank:2 word: ('ago', 50)
rank:3 word: ('bronze', 36)
rank:4 word: ('have', 29)
rank:5 word: ('×', 25)
rank:6 word: ('stack', 21)
rank:7 word: ('file', 17)
rank:8 word: ('List', 17)
rank:9 word: ('list', 17)
rank:10 word: ('data', 16)
rank:11 word: ('like', 14)
rank:12 word: ('be', 14)
rank:13 word: ('language', 13)
rank:14 word: ('pandas', 13)
rank:15 word: ('code', 12)
rank:16 word: ('create', 11)
rank:17 word: ('there', 10)
rank:18 word: ('dataframe', 10)
rank:19 word: ('not', 9)
rank:20 word: ('function', 9)
rank:21 word: ('silver', 9)
rank:22 word: ('work', 8)
rank:23 word: ('String', 8)
rank:24 word: ('string', 8)
rank:25 word: ('Get', 8)
rank:26 word: ('get', 8)
rank:27 word: ('r', 7)
rank:28 word: ('R', 7)
rank:29 word: ('tags', 7)
rank:30 word: ('following', 7)
rank:31 word: ('flask', 7)
rank:32 word: ('input', 7)
rank:33 word: ('do', 7)
rank:34 word: ('plot', 6)
rank:35 word: ('layout', 6)
rank:36 word: ('import', 6)
rank:37 word: ('array', 6)
rank:38 word: ('use', 6)
rank:39 word: ('below', 6)
rank:40 word: ('object', 6)
rank:41 word: ('format', 6)
rank:42 word: ('python-3.x', 6)
rank:43 word: ('app', 6)
rank:44 word: ('log', 5)
rank:45 word: ('add', 5)
rank:46 word: ('variable', 5)
rank:47 word: ('scrapy', 5)
rank:48 word: ('def', 5)
rank:49 word: ('c', 5)

Hmm, it's a little subtle. However, the result is almost as expected.

Improvement points and improvement methods

meeting_1.jpg

There are three subtle reasons: (Corrected while writing the sequel Update 4/23 11:57)

  1. There are words that are used only in the program
  2. Not case insensitive
  3. Overfit on stackoverflow page

One way to improve 1 is to ** make the preprocessing a little more sophisticated **. Regarding 3, it seems that it is necessary to increase the acquired data ** by mixing stackoverflow's individual question page, other technical blogs, technical news, etc.

At the end

coffee.jpg

** If LGTM exceeds 10, ** I will fix the improvement points and write the next article (second part or second part). And when the deck is complete, share that data with you

We will make it into a form that can actually be used as a word book!

So, if you find this series interesting, please ** LGTM! ** **

Sequel

(** Over LGTM10 !! (Thank you) ** Currently, I am writing an article for the next work, but I can write as many polite articles as there are LGTMs, so if you continue to like it, please LGTM. ! Update 4/23 9:00)

I wrote the second part! Please click here! [The strongest English word book Bakusei ww] Automatically generate English word book required for engineers with Python-Part 2

Recommended Posts

[The strongest English word book Bakusei ww] Automatically generate English word book required for engineers with Python-Part 1
English word book program linked with Google Docs
[Python] Automatically translate PDF with DeepL while keeping the original format. [Windows / Word required]