Python Crawling & Scraping-Practical Development Guide for Data Collection and Analysis- https://www.amazon.co.jp/dp/B01NGWKE0P/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1
Of the book 1.4.1 Get the total number of e-books What I learned from the chapter
Operation to extract only the character string with a regular expression from the html code extracted with grep Four methods are introduced
1.Extract the part that matches the regular expression with the sed command
2.Remove the matched part with the sed command and remove the remaining part
3.Use the cut command to extract the nth string from a string separated by a specific character
4.Extract the nth string from a space-justified string using the awk command
I don't know the command in the first place. .. .. However, there was an explanation of sed and cut on the previous page.
When to use: You can replace or delete lines that match specific conditions. Usage:'s / regular expression to search / string to replace / option'
【Example of use】
# .Can be output by replacing with a space/g is the same as replacing all regular expressions to be searched on one line even if they appear multiple times.
XX | sed 's/./ /g'
cut Timing of use: Used to extract some columns of text separated by specific characters 【Example of use】
# ,Output only the first and second columns separated by.-Delimiter with d,-Specify the column number with f.
XX | cut -d , -f 1,2
I would like to take a look at the process of extracting by methods 1 to 4 of the main subject one by one.
Usage: * sed -E's /. \ * (Regular expression that matches the part you want to extract). * / \ 1 /' Decoding: . Matches any single character
【Example of use】
echo hello_world | sed -E 's/.\*(hello.).*/\1'
#Output result
hello
Decoding: ^ In [] indicates negation
【Example of use】
echo'<li class="pagingnumber">130/2098</li>' | sed -E 's/<[^>]*>//g'
Timing of use: When extracting a character string from csv Decoding: '-d, -f 2'is the delimiter, the second item from the delimited string
echo '1,baseball,Hanshin' | cut -d , -f 2
#Output result
baseball
It can be used when the digits are aligned with spaces and the delimiters are consecutive. (cut is not suitable when delimiters are consecutive) If you give the character string {print $ n}, you can extract the nth character string.
echo 'A B C D E' | awk '{print $4}'
#Output result
D
Recommended Posts