[Ruby] Convert a tag to a URL string with Rails

3 minute read

I wanted to convert the URL included in the character string to a tag with rails mail processing.

After a little research, you can easily get the URL of a string using URI.extract. .. .. I thought that it would be relatively easy to write, but when I wrote it, I got caught because there was a trap, so I decided to write it after reviewing.

TL;DR

The final code was resolved by doing the following: How did you get to this? I’ll explain why you should do this later.

def convert_url_to_a_element(text)
  uri_reg = URI.regexp(%w[http https])
  text.gsub(uri_reg) {%{<a href='#{$&}' target='_blank'>#{$&}</a>}}
end

text ='url1: http://hogehoge.com/hoge url2: http://hogehoge.com/fuga'
convert_url_to_a_element(text)
=> "url1: <a href='http://hogehoge.com/hoge' target='_blank'>http://hogehoge.com/hoge</a> url2: <a href='http:// hogehoge.com/fuga' target='_blank'>http://hogehoge.com/fuga</a>"

Anti pattern

The first is how to write the wrong process. However, even with this, the following text can be processed without problems. That’s why I couldn’t immediately notice the trap of this writing style. .. ..

def convert_url_to_a_element(text)
  URI.extract(text, %w[http https]).uniq.each do |url|
    sub_text = "<a href='#{url}' target='_blank'>#{url}</a>"
    text.gsub(url, sub_text)
  end
  text
end

text ='url1: http://hogehoge.com url2: http://fugafuga.com'
convert_url_to_a_element(text)
=>'url1: http://hogehoge.com url2: http://fugafuga.com'

By using URI.extract, you can get all URL format strings as shown below.

text ='url1: http://hogehoge.com url2: http://fugafuga.com'
URI.extract(text, %w[http https])
=> ["http://hogehoge.com", "http://fugafuga.com"]

This is rotated by each and replaced. However, if you use two URLs with the same domain name as shown below. .. ..

text ='url1: http://hogehoge.com/hoge url2: http://hogehoge.com'
convert_url_to_a_element(text)
=> "url1: <a href='<a href='http://hogehoge.com' target='_blank'>http://hogehoge.com</a>/hoge' target='_blank'>< a href='http://hogehoge.com' target='_blank'>http://hogehoge.com</a>/hoge</a> url2: <a href='http://hogehoge.com' target='_blank'>http://hogehoge.com</a>"

Somehow it’s collapsed. .. ..

Cause

The cause is that the replacement process was performed on the text after a tag conversion in the second replacement. As you can see, there is a pitfall in the above writing that it does not work well if there are two or more URLs with the same host name.

counter-measure

Double substitution can be prevented by obtaining the regular expression and replacing it with the regular expression in the gsub pattern, instead of turning the string obtained by URI.extract by each.

def convert_url_to_a_element(text)
  uri_reg = URI.regexp(%w[http https])
  text.gsub(uri_reg) {%{<a href='#{$&}' target='_blank'>#{$&}</a>}}
end

Supplementary note

About URI.regexp

URI.regexp is a method that returns the URL string pattern of the specified scheme as a regular expression. Regular expressions are strings, so you can write them yourself, but this method makes it easy.

As you can see from the return value, I didn’t feel like writing this from scratch. .. ..

URI.regexp(%w[http https])
=> /(?=(?-mix:http|https):)
        ([a-zA-Z][\-+.a-zA-Z\d]*): (?# 1: scheme)
        (?:
           ((?:[\-_.!~*'()a-zA-Z\d;?:@&=+$,]|%[a-fA-F\d]{2})(?: [\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})* ) (?# 2: opaque)
        |
           (?:(?:
             \/\/(?:
                 (?:(?:((?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2} )*)@)? (?# 3: userinfo)
                   (?:((?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\ d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F \d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(? :[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d ]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1, 3}\.\d{1,3}))?)\]))(?::(\d*))?))? (?# 4: host, 5: port)
               |
                 ((?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+) (? #6: registry)
               )
             |
             (?!\/\/)) (?# XXX:'\/\/' is the mark for hostport)
             (\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(? :;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*( ?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*( ?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)* )*)? (?# 7: path)
           )(?:\?((?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA -F\d]{2})*))? (?# 8: query)
        )
        (?:\#((?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA- F\d]{2})*))? (?# 9: fragment)
      /x

About gsub

The gsub method itself can be replaced by passing a string instead of a regular expression. In the former case, the URL string obtained is simply passed by each and replaced, but as a result, if it is a URL that contains the same domain, the replacement process will also be performed on the string after a tag conversion. Is executed, and it seems to be a strange character string.

Think about it. .. .. I was worried that I couldn’t come up with this measure unexpectedly. First, gsub

text.gsub!(uri_reg) {%{<a href="#{$&}">#{$&}</a>}}

About URI.extract

First of all, the URI.extract used first, but you can get only the URL string from the text by specifying the schema. I didn’t use it finally this time, but it seems convenient if I want to get only the URL string simply.

text ='aaaaa http://xxx.com/hoge bbbbb http://xxx.com'
URI.extract(text, %w[http https])
=> ["http://xxx.com/hoge" "http://xxx.com"]

Summary

  • If you want to convert a tag, it is better to replace gsub after pattern matching with regular expression.
  • The regular expression itself can be easily retrieved using URI.regexp

There were some twists and turns, but I think it was a good code. If there is any other good way to write it, please let me know.

Reference URL