I think everyone has bought it once. If you automatically create an English word book in Python,
After graduating from the electrical and information engineering department of a technical college at the age of 20, while translating specifications etc. at a technical trading company, interpreting when visiting overseas manufacturers, corporate sales, etc., 9 hours on Saturdays and Sundays and 1 hour on weekdays self-study of English 1 We have continued to exceed TOEIC 900 for about a year and obtained IELTS 7.5. Currently, I am working as a back-end and AI engineer as my main business and as an English teacher as a side business. Please check this out for details. https://hossyan-blog.com/profile/
The introduction has become long, but I would like to get into the main subject immediately!
――What are you going to make? ――How do you make it? --Whole code --Code explanation --Delete HTML tags after getting HTML data
As introduced in here, the original purpose of using vocabulary is
"** To remember only what you need for the goal you want to achieve **"
Will be. I think that qiita readers are IT engineers, so This time, I will create an English word book ** so that engineers can read technical books.
Since this is the first part, we will first use the data for only one screen of the following HTML page tagged with the keyword "Python". https://stackoverflow.com/questions/tagged/python
Now let's get into the code introduction.
The whole code looks like this.
from enum import Enum, unique
from typing import List, Tuple, Set, Dict
import requests
from bs4 import BeautifulSoup as bs
from textblob import TextBlob
URL = 'https://stackoverflow.com/questions/tagged/python'
PARSER = "html.parser"
FILTER_BY_COUNT = 2
@unique
class PartOfSpeechToLearn(Enum):
JJ = 'Adjective'
VB = 'Verb'
NN = 'Noun'
RB = 'Adverb'
if __name__ == '__main__':
# Get HTML data and remove html tags
res = requests.get(URL)
raw_html = bs(res.text, PARSER)
texts_without_html: str = raw_html.text
# morphological analysis
morph = TextBlob(texts_without_html)
word_and_tag: List[Tuple[str, str]] = morph.tags
# Filter words to create a book for vocab
part_of_speech_to_learn = tuple(pos.name for pos in PartOfSpeechToLearn)
words_to_learn: Set[str] = {
wt[0]
for wt in word_and_tag
if wt[1] in part_of_speech_to_learn
}
words_filtered_by_count: Dict[str, int] = {
word: morph.words.count(word)
for word in words_to_learn
if morph.words.count(word) > FILTER_BY_COUNT
}
# Show 50 words that are most frequently used
words_in_descending_order: List[Tuple[str, int]] = sorted(
words_filtered_by_count.items(),
key=lambda x: x[1],
reverse=True
)
for i, word_and_count in enumerate(words_in_descending_order[:50]):
print(f'rank:{i} word: {word_and_count}')
This time, we will acquire the following four part of speech **.
@unique
class PartOfSpeechToLearn(Enum):
JJ = 'Adjective'
VB = 'Verb'
NN = 'Noun'
RB = 'Adverb'
I'm getting the HTML from the URL and converting it to text without HTML tags.
# Get HTML data and remove html tags
res = requests.get(URL)
raw_html = bs(res.text, PARSER)
texts_without_html: str = raw_html.text
Perform morphological analysis to limit the parts of speech included in the vocabulary. TextBlob is used for morphological analysis. For TextBlob, I referred to here.
# morphological analysis
morph = TextBlob(texts_without_html)
word_and_tag: List[Tuple[str, str]] = morph.tags
After filtering only the part of speech that you want to include in the vocabulary, create a dictionary with the word as the key and the number of occurrences as the value while filtering by the number of frequent occurrences (FILTER_BY_COUNT = 2).
part_of_speech_to_learn = tuple(pos.name for pos in PartOfSpeechToLearn)
words_to_learn: Set[str] = {
wt[0]
for wt in word_and_tag
if wt[1] in part_of_speech_to_learn
}
words_filtered_by_count: Dict[str, int] = {
word: morph.words.count(word)
for word in words_to_learn
if morph.words.count(word) > FILTER_BY_COUNT
}
words_in_descending_order: List[Tuple[str, int]] = sorted(
words_filtered_by_count.items(),
key=lambda x: x[1],
reverse=True
)
# Show 50 words that are most frequently used
for i, word_and_count in enumerate(words_in_descending_order[:50]):
print(f'rank:{i} word: {word_and_count}')
Can it be an English word book ........................................?
Click here for the essential execution results .....................!
rank:0 word: ('i', 96)
rank:1 word: ('python', 86)
rank:2 word: ('ago', 50)
rank:3 word: ('bronze', 36)
rank:4 word: ('have', 29)
rank:5 word: ('×', 25)
rank:6 word: ('stack', 21)
rank:7 word: ('file', 17)
rank:8 word: ('List', 17)
rank:9 word: ('list', 17)
rank:10 word: ('data', 16)
rank:11 word: ('like', 14)
rank:12 word: ('be', 14)
rank:13 word: ('language', 13)
rank:14 word: ('pandas', 13)
rank:15 word: ('code', 12)
rank:16 word: ('create', 11)
rank:17 word: ('there', 10)
rank:18 word: ('dataframe', 10)
rank:19 word: ('not', 9)
rank:20 word: ('function', 9)
rank:21 word: ('silver', 9)
rank:22 word: ('work', 8)
rank:23 word: ('String', 8)
rank:24 word: ('string', 8)
rank:25 word: ('Get', 8)
rank:26 word: ('get', 8)
rank:27 word: ('r', 7)
rank:28 word: ('R', 7)
rank:29 word: ('tags', 7)
rank:30 word: ('following', 7)
rank:31 word: ('flask', 7)
rank:32 word: ('input', 7)
rank:33 word: ('do', 7)
rank:34 word: ('plot', 6)
rank:35 word: ('layout', 6)
rank:36 word: ('import', 6)
rank:37 word: ('array', 6)
rank:38 word: ('use', 6)
rank:39 word: ('below', 6)
rank:40 word: ('object', 6)
rank:41 word: ('format', 6)
rank:42 word: ('python-3.x', 6)
rank:43 word: ('app', 6)
rank:44 word: ('log', 5)
rank:45 word: ('add', 5)
rank:46 word: ('variable', 5)
rank:47 word: ('scrapy', 5)
rank:48 word: ('def', 5)
rank:49 word: ('c', 5)
Hmm, it's a little subtle. However, the result is almost as expected.
There are three subtle reasons: (Corrected while writing the sequel Update 4/23 11:57)
One way to improve 1 is to ** make the preprocessing a little more sophisticated **. Regarding 3, it seems that it is necessary to increase the acquired data ** by mixing stackoverflow's individual question page, other technical blogs, technical news, etc.
** If LGTM exceeds 10, ** I will fix the improvement points and write the next article (second part or second part). And when the deck is complete, share that data with you
So, if you find this series interesting, please ** LGTM! ** **
(** Over LGTM10 !! (Thank you) ** Currently, I am writing an article for the next work, but I can write as many polite articles as there are LGTMs, so if you continue to like it, please LGTM. ! Update 4/23 9:00)
I wrote the second part! Please click here! [The strongest English word book Bakusei ww] Automatically generate English word book required for engineers with Python-Part 2
Recommended Posts