[Memorandum] I've started working, so I want to study regular expressions well [Regular expressions]

Reference article

https://qiita.com/jnchito/items/893c887fbf19e17d3ff9 https://qiita.com/jnchito/items/64c3fdc53766ac6f2008

To be honest, I am indebted to the author of this article. This article is an output article of the above article, so it may be better than reading this article.

What is a regular expression?

"A mini language for efficiently searching and replacing character strings by specifying a pattern"

Was written in the above article. Hmmmm. Certainly vague, but I remember "patterns", searches, and replacements.

I will try it for the time being. Let's see.

Play around with the https://rubular.com/ site to learn regular expressions.

The test string looks like this.

Name: Onikan
Phone: 03-1234-5678
Address: 1 Chuo-ku, Tokyo-2-3

When I typed \ d, only the numerical value was displayed in blue. (By the way, backslash can be done with option + \ on mac)

In other words, \ d represents one half-width number (0123456789). \ d is also called ** metalanguage **, and it also means that it represents a set of characters, so it seems to be called ** character class **. Hmmmm.

This is the ** "mini language for efficiently searching and replacing character strings" ** that came out earlier, and is it one of the "patterns"?

Anyway, remember that \ d represents one half-width number.

Try connecting metacharacters.

Since \ d represents one half-width number, let's learn the case of representing two or three.

I typed \ d \ d- \ d \ d \ d \ d- \ d \ d \ d \ d. This time, the whole including hyphens is selected.

\ d \ d represents two concatenated half-width numbers. (Like 12 or 34.)

Try running with Ruby

text = <<-TEXT
Name: Onikan
Phone: 03-1234-5678
Address: 1 Chuo-ku, Tokyo-2-3
text.scan /\d\d-\d\d\d\d-\d\d\d\d/
# => ["03-1234-5678"]

By the way, if you are not familiar with the part text = <<-TEXT, you may want to search for" Ruby here document ".

Try to run it with JavaScript

const text = "Name: Onikan\n Phone: 03-1234-5678\n Address: 1 Chuo-ku, Tokyo-2-3";
// => ["03-1234-5678"]

\ n is a line feed code. g is called a global option. There are the following differences with and without.

--Yes: When the first one is found, the search ends. --None: Extract the matching character string.

Corresponds to the area code

/ \ d \ d- \ d \ d \ d \ d- \ d \ d \ d \ d / does not correspond to all numbers. For example

etc. Learn regular expressions that can handle this. The important thing here is to find the ** pattern ** to search for. In this case, it is as follows.

--2 to 5 half-width numbers --Hyphen --1 to 4 half-width numbers --Hyphen ――4 half-width numbers

Line up in this order. The new knowledge that comes out here uses the metacharacters {n, m} and {n}. Since it specifies the amount of characters, it is called a quantity specifier.

{n, m} indicates that "the last character is n or more and m or less". For example, if it is \ d {1,4}, ** half-width number ** represents 1 to 4 characters.

So, if you apply it to the previous pattern, it will look like this.

It should be \ d {2,5}-\ d {1,4}-\ d {4}.

I also want to support parentheses!

In the above example, "03 (1234) 5678" etc. cannot be supported. So, ** I want to change it so that it can handle hyphens or parentheses **

New patterns are "hyphens or (" and "hyphens or)". New knowledge comes out here.

"One character of either A or B" → means [AB]. (Since it represents a set of characters, it is a kind of character class.) By the way, there is no limit to the number of characters in []. [ABC] represents any one character.

Therefore, "hyphen or (" is expressed as [-(]. "Hyphen or)" [-)]. Let's write the whole thing.


Hyphens can have a special meaning. For example, [A-Z] stands for "A or B or C or ... Z". In other words, it represents one full-width English character. That is, it may represent a range of ** characters. ** **

If a hyphen is entered at the beginning or end of [] like [-AB] or [AB-], it will be regarded as the hyphen itself.


--\ d represents one single-byte number --{n, m} indicates that the immediately preceding character is n or more and m or less. --{n} represents exactly n characters. --[ab] ʻa or b one letter --[a-z] is one letter of a or b or c or ... z --[-az]` represents-or a or z.


As I said at the beginning, this article is an output article, so I think you should refer to it.


Recommended Posts

[Memorandum] I've started working, so I want to study regular expressions well [Regular expressions]
I want to be able to think and write regular expressions myself. ..
I want to use Combine in UIKit as well.
I want to extract between character strings with a regular expression