Ruby: I tried to find out where Nokogiri goes to see the encoding himself

Introduction

In an article here, Nokogiri concludes that when the encoding specification is nil, he goes to see the charset of the meta element of the original html to parse. I did. This time, I followed the official documentation to see if the conclusions really came true.

Follow the official documentation

Nokogiri Official Document This time I will follow this official document. Of course it is in English. I usually avoid official English documents, but I decide to go see them. Even if you can't read English, you can read the code. Perhaps.

Nokogiri :: HTML :: Document class

Normally, when parsing using Nokogiri, it is written as Nokogiri :: HTML.parse (html), but officially it seems to be the Nokogiri :: HTML :: Document class. Open the Document class field, look for the .parse method, and try to view the source with" view source ".

Source below

lib/nokogiri/html/document.rb


def parse string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML

  options = Nokogiri::XML::ParseOptions.new(options) if Integer === options
  # Give the options to the user
  yield options if block_given?

  if string_or_io.respond_to?(:encoding)
    unless string_or_io.encoding.name == "ASCII-8BIT"
      encoding ||= string_or_io.encoding.name
    end
  end

  if string_or_io.respond_to?(:read)
    url ||= string_or_io.respond_to?(:path) ? string_or_io.path : nil
    unless encoding
      # Libxml2's parser has poor support for encoding
      # detection.  First, it does not recognize the HTML5
      # style meta charset declaration.  Secondly, even if it
      # successfully detects an encoding hint, it does not
      # re-decode or re-parse the preceding part which may be
      # garbled.
      #
      # EncodingReader aims to perform advanced encoding
      # detection beyond what Libxml2 does, and to emulate
      # rewinding of a stream and make Libxml2 redo parsing
      # from the start when an encoding hint is found.
      string_or_io = EncodingReader.new(string_or_io)
      begin
        return read_io(string_or_io, url, encoding, options.to_i)
      rescue EncodingFound => e
        encoding = e.found_encoding
      end
    end
    return read_io(string_or_io, url, encoding, options.to_i)
  end

  # read_memory pukes on empty docs
  if string_or_io.nil? or string_or_io.empty?
    return encoding ? new.tap { |i| i.encoding = encoding } : new
  end

  encoding ||= EncodingReader.detect_encoding(string_or_io)

  read_memory(string_or_io, url, encoding, options.to_i)
end

Let's pay attention to here first

lib/nokogiri/html/document.rb


  if string_or_io.respond_to?(:encoding)
    unless string_or_io.encoding.name == "ASCII-8BIT"
      encoding ||= string_or_io.encoding.name
    end
  end

string_or_io is a variable that you usually specify html. Interpretation, if string_or_io has a ʻencoding method, its encoding name is not ʻASCII-8BIT and the ʻencoding argument is not defined, then ʻencoding is string_or_io It seems to be the encoding name.

I see! So if you don't open html in binary mode, it depends on how you open html (encoding), so garbled characters may occur after parsing!

So what happens when you open the file in binary mode and the ʻencoding argument is nil`? Now let's focus on here.

lib/nokogiri/html/document.rb


encoding ||= EncodingReader.detect_encoding(string_or_io)

If the ʻencoding argument is not defined, you can use the ʻEncodingReader.detect_encoding method. Gently go to the document's ʻEncodingReader.detect_encoding` method.

View the source as before. Source below

lib/nokogiri/html/document.rb


def self.detect_encoding(chunk)
  if Nokogiri.jruby? && EncodingReader.is_jruby_without_fix?
    return EncodingReader.detect_encoding_for_jruby_without_fix(chunk)
  end
  m = chunk.match(/\A(<\?xml[ \t\r\n]+[^>]*>)/) and
    return Nokogiri.XML(m[1]).encoding

  if Nokogiri.jruby?
    m = chunk.match(/(<meta\s)(.*)(charset\s*=\s*([\w-]+))(.*)/i) and
      return m[4]
    catch(:encoding_found) {
      Nokogiri::HTML::SAX::Parser.new(JumpSAXHandler.new(:encoding_found)).parse(chunk)
      nil
    }
  else
    handler = SAXHandler.new
    parser = Nokogiri::HTML::SAX::PushParser.new(handler)
    parser << chunk rescue Nokogiri::SyntaxError
    handler.encoding
  end
end

The method argument chunk will contain string_or_io this time, that is, what you normally use as html.

There are many unfamiliar methods, so I can't get the exact meaning, but is there a description that refers to the meta charset in the second if block? ?? ?? It seems that the value is returned by return, and this part feels very suspicious.

at the end

I haven't figured out the details of the source yet, but I feel like I've come closer to the answer I'm looking for. If you know the details of the source, I will summarize it in another article.

Recommended Posts

Ruby: I tried to find out where Nokogiri goes to see the encoding himself
I tried to find out what changed in Java 9
I tried to summarize the basic grammar of Ruby briefly
I tried to explain the method
I tried to solve the problem of "multi-stage selection" with Ruby
I tried to summarize the words that I often see in docker-compose.yml
[Metal] I tried to figure out the flow until rendering using Metal
[Ruby] Tonight, I tried to summarize the loop processing [times, break ...]
I tried to summarize the methods used
[Ruby] I tried to diet the if statement code with the ternary operator
I tried to implement the Iterator pattern
I tried to solve the tribonacci sequence problem in Ruby, with recursion.
I tried to summarize the Stream API
Where can I find out about Java releases after February 2019? About the problem
I tried to make full use of the CPU core in Ruby
I tried to build Ruby 3.0.0 from source
[Ruby] I tried to summarize the methods that frequently appear in paiza
I tried to figure out the relationship between classes such as java.io.InputStream
[Ruby] I tried to summarize the methods that frequently appear in paiza ②
I want to find out what character the character string appears from the left
I tried to solve the tribonatch sequence problem in Ruby (time limit 10 minutes)
I tried to get the distance from the address string to the nearest station with ruby
How to find the cause of the Ruby error
[Rails] I tried to raise the Rails version from 5.0 to 5.2
I tried to organize the session in Rails
[Must see !!!] I tried to summarize object orientation!
[Ruby basics] I tried to learn modules (Chapter 1)
I tried to set tomcat to run the Servlet.
I want to get the value in Ruby
[Beginner's point of view] I tried to solve the FizzBuzz problem "easily" with Ruby!
I tried to organize the cases used in programming
I tried to summarize the state transition of docker
I tried to decorate the simple calendar a little
[Ruby] How to find the sum of each digit
05. I tried to stub the source of Spring Boot
I tried to reduce the capacity of Spring Boot
I tried to implement the Euclidean algorithm in Java
I want to find out which version of java the jar file I have is available
I had to figure out where the eclipse plugins folder was on my Mac. (Memo)
I tried to solve the Ruby karaoke machine problem (there is an example of the answer)
I tried to solve the Ruby bonus drink problem (there is an example of the answer)
[SwiftUI] I tried to find out how it changes for each specified location of background
I want to find out if the specified character string is supported by the target character code
I can't find the docker image after updating to docker desktop 2.4.0.0
I tried to implement the like function by asynchronous communication
I tried to introduce Bootstrap 4 to the Rails 6 app [for beginners]
I tried to increase the processing speed with spiritual engineering
[JDBC] I tried to access the SQLite3 database from Java.
I tried to summarize the basics of kotlin and java
(´-`) .. oO (I want to easily find the standard output "Hello".
[Swift] I tried to implement the function of the vending machine
What I did in the version upgrade from Ruby 2.5.2 to 2.7.1
I tried to build the environment little by little using docker
I tried to automate LibreOffice Calc with Ruby + PyCall.rb (Ubuntu 18.04)
I tried to build the environment of WSL2 + Docker + VSCode
I tried validation to unify the way hashtags are written
[Ruby] I want to reverse the order of the hash table
I tried upgrading from CentOS 6.5 to CentOS 7 with the upgrade tool
I tried to solve the Ruby bingo card creation problem (there is an example of the answer)