Ruby: I tried to find out where Nokogiri goes to see the encoding himself


In an article here, Nokogiri concludes that when the encoding specification is nil, he goes to see the charset of the meta element of the original html to parse. I did. This time, I followed the official documentation to see if the conclusions really came true.

Follow the official documentation

Nokogiri Official Document This time I will follow this official document. Of course it is in English. I usually avoid official English documents, but I decide to go see them. Even if you can't read English, you can read the code. Perhaps.

Nokogiri :: HTML :: Document class

Normally, when parsing using Nokogiri, it is written as Nokogiri :: HTML.parse (html), but officially it seems to be the Nokogiri :: HTML :: Document class. Open the Document class field, look for the .parse method, and try to view the source with" view source ".

Source below


def parse string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML

  options = if Integer === options
  # Give the options to the user
  yield options if block_given?

  if string_or_io.respond_to?(:encoding)
    unless == "ASCII-8BIT"
      encoding ||=

  if string_or_io.respond_to?(:read)
    url ||= string_or_io.respond_to?(:path) ? string_or_io.path : nil
    unless encoding
      # Libxml2's parser has poor support for encoding
      # detection.  First, it does not recognize the HTML5
      # style meta charset declaration.  Secondly, even if it
      # successfully detects an encoding hint, it does not
      # re-decode or re-parse the preceding part which may be
      # garbled.
      # EncodingReader aims to perform advanced encoding
      # detection beyond what Libxml2 does, and to emulate
      # rewinding of a stream and make Libxml2 redo parsing
      # from the start when an encoding hint is found.
      string_or_io =
        return read_io(string_or_io, url, encoding, options.to_i)
      rescue EncodingFound => e
        encoding = e.found_encoding
    return read_io(string_or_io, url, encoding, options.to_i)

  # read_memory pukes on empty docs
  if string_or_io.nil? or string_or_io.empty?
    return encoding ? new.tap { |i| i.encoding = encoding } : new

  encoding ||= EncodingReader.detect_encoding(string_or_io)

  read_memory(string_or_io, url, encoding, options.to_i)

Let's pay attention to here first


  if string_or_io.respond_to?(:encoding)
    unless == "ASCII-8BIT"
      encoding ||=

string_or_io is a variable that you usually specify html. Interpretation, if string_or_io has a ʻencoding method, its encoding name is not ʻASCII-8BIT and the ʻencoding argument is not defined, then ʻencoding is string_or_io It seems to be the encoding name.

I see! So if you don't open html in binary mode, it depends on how you open html (encoding), so garbled characters may occur after parsing!

So what happens when you open the file in binary mode and the ʻencoding argument is nil`? Now let's focus on here.


encoding ||= EncodingReader.detect_encoding(string_or_io)

If the ʻencoding argument is not defined, you can use the ʻEncodingReader.detect_encoding method. Gently go to the document's ʻEncodingReader.detect_encoding` method.

View the source as before. Source below


def self.detect_encoding(chunk)
  if Nokogiri.jruby? && EncodingReader.is_jruby_without_fix?
    return EncodingReader.detect_encoding_for_jruby_without_fix(chunk)
  m = chunk.match(/\A(<\?xml[ \t\r\n]+[^>]*>)/) and
    return Nokogiri.XML(m[1]).encoding

  if Nokogiri.jruby?
    m = chunk.match(/(<meta\s)(.*)(charset\s*=\s*([\w-]+))(.*)/i) and
      return m[4]
    catch(:encoding_found) {
    handler =
    parser =
    parser << chunk rescue Nokogiri::SyntaxError

The method argument chunk will contain string_or_io this time, that is, what you normally use as html.

There are many unfamiliar methods, so I can't get the exact meaning, but is there a description that refers to the meta charset in the second if block? ?? ?? It seems that the value is returned by return, and this part feels very suspicious.

at the end

I haven't figured out the details of the source yet, but I feel like I've come closer to the answer I'm looking for. If you know the details of the source, I will summarize it in another article.

