Ruby: Nokogiri automatically determines the character code of html read in binary mode

Introduction

I used to read html with open-uri, parse it with Nokogiri, and scrape it. It was. When reading html with open-uri, it is read in binary, so I focused on what happens when html read in binary is encoded with nil and parsed with Nokogiri.

Try setting the HTML.parse encoding to "nil"

Let's set the encoding when parsing html with Nokogiri as nil. Set nil to the third argument of HTML.parse. The following html is loaded this time. The file is written in Shift_JIS.

hello.html


<html>
  <head>
    <title>Hello</title>
    <meta charset="Shift_JIS">
  </head>
  <body>
  </body>
</html>

Load html in binary mode. You can read the file in binary by adding the 'rb' option to the open method. For verification, let's display the character code at the time of reading in binary mode and the character code after parsing with Nokogiri.

sample.rb


require 'nokogiri'

html = open('hello.html', 'rb').read

p html.encoding
p Nokogiri::HTML.parse(html, nil, nil).encoding

Execution result

sample.rb result


$ ruby sample.rb
#<Encoding:ASCII-8BIT>
"Shift_JIS"

From the result, it can be confirmed that the encoding of the read html itself is ASCII-8BIT, but the encoding after parsing by Nokogiri is Shift_JIS, which is the same as the original file. By the way, even if you omit the argument as HTML.parse (html), you can get the same result as above.

Where does Nokogiri refer to the character code?

Looking at the verification results above, Nokogiri goes to refer to some character code by himself. Where are you referring to?

Actually, I am going to refer to the meta element of the original html file. It refers to the charset of <meta charset =" Shift_JIS ">.

Try changing the charset part to UTF-8 and output the character code in the same way as before.

hello.html


<html>
  <head>
    <title>Hello</title>
    <meta charset="UTF-8">
  </head>
  <body>
  </body>
</html>

Execution result

sample.rb result


$ ruby sample.rb
#<Encoding:ASCII-8BIT>
"UTF-8"

You can see that the character code after parsing has changed to UTF-8.

By the way, when I try to eliminate charset, ...

hello.html


<html>
  <head>
    <title>Hello</title>
    <meta>
  </head>
  <body>
  </body>
</html>

sample.rb result


$ ruby sample.rb
#<Encoding:ASCII-8BIT>
nil

The character code after parsing has become nil. Of course, if you display the title etc. in this state, the characters will be garbled.

Summary

--If you read html in binary and set Nokogiri's encoding to nil, Nokogiri will go to refer to the character code by itself. --Nokogiri goes to refer to the charset of the meta element of the html read in binary. --If charset is not written, Nokogiri's encoding will be nil.

Recommended Posts

Ruby: Nokogiri automatically determines the character code of html read in binary mode
Correct the character code in Java and read from the URL
Guess the character code in Java
The application absorbs the difference in character code
[Delete the first letter of the character string] Ruby
Implement the algorithm in Ruby: Day 3 -Binary search-
[Ruby] Code to display the day of the week
Count the number of occurrences of a string in Ruby
[Ruby] The role of subscripts in learning elements in arrays
Ruby, Nokogiri: Get the element name of the selected node
Get the URL of the HTTP redirect destination in Ruby
Determine that the value is a multiple of 〇 in Ruby
[Ruby] Creating code using the concept of classes and instances
[Ruby on Rails] Automatically enter the address from the zip code
Specify the character code of the source when building with Maven