[Ruby] [Memorandum] Now that I have entered into practice, I want to study (regular expressions) regular expressions [Regular expressions]

3 minute read

Reference article

https://qiita.com/jnchito/items/893c887fbf19e17d3ff9 https://qiita.com/jnchito/items/64c3fdc53766ac6f2008

To tell you the truth, I am indebted to the authors of this article. After all, this article is an output article of the above article, so it may be better than reading this article.

What is a regular expression

“Mini-language for efficiently searching and replacing character strings by specifying patterns”

Was written in the above article. Hmmmmm. Certainly, it is dull, but I remember “patterns”, searching, and replacing.

Try for the time being. Let’s see.

Play around with https://rubular.com/ to learn regular expressions.

The test string looks like this.

Name: Rice
Phone: 03-1234-5678
Address: 1-2-3 Chuo-ku, Tokyo

When I typed \d, only the numerical value was displayed in blue. (By the way, backslash can be done with option+\ on mac)

That is, \d represents one single-byte number (0123456789). \d is also called a metalanguage, and also means a set of characters, so it is also called a character class. Hmmmmm.

This is a “mini-language for efficiently searching and replacing a character string”** that came out earlier, and is one of the “patterns”?

Anyway, remember that \d represents one single-byte number.

Try connecting the meta characters.

Since \d represents one half-width number, let’s learn how to represent two or three.

I typed \d\d-\d\d\d\d-\d\d\d\d. This time, the whole is selected including the hyphen.

\d\d represents two concatenated half-width numbers. (Like 12 or 34.)

Try using Ruby

text = <<-TEXT
Name: Rice
Phone: 03-1234-5678
Address: 1-2-3 Chuo-ku, Tokyo
TEXT
text.scan /\d\d-\d\d\d\d-\d\d\d\d/
# => ["03-1234-5678"]

By the way, if you are not familiar with the text = <<- TEXT part, you may want to search with “Ruby here document”.

Try using JavaScript

const text = "Name: rice \n Telephone: 03-1234-5678\n Address: 1-2-3 Chuo-ku, Tokyo";
text.match(/\d\d-\d\d\d\d-\d\d\d\d/g);
// => ["03-1234-5678"]

\n is a line feed code. g is called a global option. There are the following differences with and without.

  • Yes: When the first one is found, the search ends.
  • None: Extract matching strings.

Corresponds to area code

/\d\d-\d\d\d\d-\d\d\d\d/ does not support all numbers. For example,

  • 090-1234-5678
  • 0120-1234-5678

etc. Learn regular expressions that can handle this. The important thing here is to find the pattern to search for. In this case, it is as follows.

  • 2 to 5 half-width numbers
  • Hyphen
  • 1 to 4 half-width numbers
  • Hyphen
  • Four half-width numbers

Line up in this order. The new knowledge presented here uses the metacharacters {n,m} and {n}. It is called a quantifier because it specifies the amount of characters.

{n,m} means “the last character is n or more and m or less”. For example, if it is \d{1,4}, it means 1 to 4 characters with single-byte numbers.

So, if you apply it to the previous pattern, it looks like this.

It should be \d{2,5}-\d{1,4}-\d{4}.

I also want to support parentheses!

In the above example, “03(1234)5678” etc. cannot be supported. So I want to change it so that hyphens and parentheses can be used

Newly incorporated as patterns are “hyphen or (“ and “hyphen or)”. New knowledge comes out here.

“One character of A or B” → means “[AB]. (Since it represents a set of characters, it is a kind of character class.) By the way, there is no limit to the number of characters in []. [ABC]` represents any one character.

Therefore, “hyphen or (“ is represented as [-(]. “Hyphen or)” [-)]. Let’s write the whole thing.

\d{2,5}[-(]\d{1,4}[-)]\d{4}

The hyphen can have a special meaning. For example, [A-Z] represents “A or B or C or …Z”. That is, it represents a single-byte single-byte character. That is, it may represent a range of ** characters. **

If you put a hyphen at the beginning or end of [] like [-AB] or [AB-], it will be regarded as a hyphen itself.

Summary

  • \d represents one single-byte number
  • {n,m} indicates that the previous character is n or more and m or less.
  • {n} represents exactly n characters.
  • One character of [ab]a or b
  • [a-z] is a character of a, b, c or …z
  • [-az] represents-or a or z.

Finally

As I said at the beginning, this article is an output article, so I think you should refer to it.

https://qiita.com/jnchito/items/893c887fbf19e17d3ff9