When applying a regular expression to an HTML file in Python I've used it in the encoding and other places, so I'll leave it as a memo.
Use the codecs library. Because it's a Python standard library(Appx. 1), It can be used only by import without installation.
import codecs f = codecs.open("hoge.html","r", encoding="utf-8")
### Caution
--Be sure to specify the ```encoding``` argument to the codecs.open function.
--Under Windows environment, it seems that you can not read files other than those specified as Shift-JIS. There was a predecessor when I was supporting (App x. 2).
## II. Apply regular expressions
#### **`Let's use the re library. This is also a standard library(Appx. 1)So it can be used only by import.`**
``` 1)So it can be used only by import.
import re str = f.read() regex = '[abc]' sample = re.findall(regex, str)
### Caution
--Be sure to use the `` `codecs.read ()` `` function before passing the html file read in the previous chapter to each function of ``` re```.
### Typical regular expression
The following is a list of functions that seem to be highly versatile as `` `re``` functions from Appx.3. It is assumed that `` `regex``` contains a regular expression.
--Forward search: `` `re.search (regex, string) ``` `
--Checks if the regex pattern is in string and returns the string (= regex) if it exists
--If the search fails, None is returned, so you can easily use it for conditional branching with, for example, ```if not re.search (regex, string): `` `.
--Search all: `` `re.findall (regex, string) ``` `
--Checks if the regex pattern is in a string and returns a list with all the matches
--If you want to perform another re process from this function, use the str function (to make it a string) to make it a string (App x. 4).
--Replacement: `` `re.sub (regex, replace, string, count = 0) ``` `
--Check if the regex pattern is in the string and replace it with replace
--If you enter a natural number of 1 or more in count, you can specify how many times to replace the regex corresponding part from the beginning of the string. If the default value is 0, replace all parts
――I was wondering, "How do you express the global search (` `/ g```) in Python when writing a regular expression in JavaScript etc.? ", But apparently this count = 0 state It seems to be expressed by (App x. 5)
The regular expression itself is detailed in App x. 6.
(that's all)
## Reference (Appendix / Appx.)
―― 1. [Python standard library](https://docs.python.org/ja/3/library/index.html)
―― 2. [A story that I had a hard time opening a file other than CP932 (Shift-JIS) encoded on Windows](https://qiita.com/Yuu94/items/9ffdfcb2c26d6b33792e)
--3 [re --- Regular expression operation](https://docs.python.org/ja/3/library/re.html)
- 4. [re.sub erroring with “Expected string or bytes-like object”](https://stackoverflow.com/questions/43727583/re-sub-erroring-with-expected-string-or-bytes-like-object)
- 5. [Python RegExp global flag
](https://stackoverflow.com/questions/11686516/python-regexp-global-flag)
--6 [List of basic regular expressions](https://murashun.jp/blog/20190215-01.html)
――I always use it to look up regular expressions.
Recommended Posts