In an article here, Nokogiri concludes that when the encoding specification is nil
, he goes to see the charset of the meta element of the original html to parse. I did.
This time, I followed the official documentation to see if the conclusions really came true.
Nokogiri Official Document This time I will follow this official document. Of course it is in English. I usually avoid official English documents, but I decide to go see them. Even if you can't read English, you can read the code. Perhaps.
Normally, when parsing using Nokogiri, it is written as Nokogiri :: HTML.parse (html)
, but officially it seems to be the Nokogiri :: HTML :: Document
class.
Open the Document class field, look for the .parse
method, and try to view the source with" view source ".
Source below
lib/nokogiri/html/document.rb
def parse string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML
options = Nokogiri::XML::ParseOptions.new(options) if Integer === options
# Give the options to the user
yield options if block_given?
if string_or_io.respond_to?(:encoding)
unless string_or_io.encoding.name == "ASCII-8BIT"
encoding ||= string_or_io.encoding.name
end
end
if string_or_io.respond_to?(:read)
url ||= string_or_io.respond_to?(:path) ? string_or_io.path : nil
unless encoding
# Libxml2's parser has poor support for encoding
# detection. First, it does not recognize the HTML5
# style meta charset declaration. Secondly, even if it
# successfully detects an encoding hint, it does not
# re-decode or re-parse the preceding part which may be
# garbled.
#
# EncodingReader aims to perform advanced encoding
# detection beyond what Libxml2 does, and to emulate
# rewinding of a stream and make Libxml2 redo parsing
# from the start when an encoding hint is found.
string_or_io = EncodingReader.new(string_or_io)
begin
return read_io(string_or_io, url, encoding, options.to_i)
rescue EncodingFound => e
encoding = e.found_encoding
end
end
return read_io(string_or_io, url, encoding, options.to_i)
end
# read_memory pukes on empty docs
if string_or_io.nil? or string_or_io.empty?
return encoding ? new.tap { |i| i.encoding = encoding } : new
end
encoding ||= EncodingReader.detect_encoding(string_or_io)
read_memory(string_or_io, url, encoding, options.to_i)
end
Let's pay attention to here first
lib/nokogiri/html/document.rb
if string_or_io.respond_to?(:encoding)
unless string_or_io.encoding.name == "ASCII-8BIT"
encoding ||= string_or_io.encoding.name
end
end
string_or_io
is a variable that you usually specify html.
Interpretation, if string_or_io
has a ʻencoding method, its encoding name is not ʻASCII-8BIT
and the ʻencoding argument is not defined, then ʻencoding
is string_or_io
It seems to be the encoding name.
I see! So if you don't open html in binary mode, it depends on how you open html (encoding), so garbled characters may occur after parsing!
So what happens when you open the file in binary mode and the ʻencoding argument is
nil`?
Now let's focus on here.
lib/nokogiri/html/document.rb
encoding ||= EncodingReader.detect_encoding(string_or_io)
If the ʻencoding argument is not defined, you can use the ʻEncodingReader.detect_encoding
method.
Gently go to the document's ʻEncodingReader.detect_encoding` method.
View the source as before. Source below
lib/nokogiri/html/document.rb
def self.detect_encoding(chunk)
if Nokogiri.jruby? && EncodingReader.is_jruby_without_fix?
return EncodingReader.detect_encoding_for_jruby_without_fix(chunk)
end
m = chunk.match(/\A(<\?xml[ \t\r\n]+[^>]*>)/) and
return Nokogiri.XML(m[1]).encoding
if Nokogiri.jruby?
m = chunk.match(/(<meta\s)(.*)(charset\s*=\s*([\w-]+))(.*)/i) and
return m[4]
catch(:encoding_found) {
Nokogiri::HTML::SAX::Parser.new(JumpSAXHandler.new(:encoding_found)).parse(chunk)
nil
}
else
handler = SAXHandler.new
parser = Nokogiri::HTML::SAX::PushParser.new(handler)
parser << chunk rescue Nokogiri::SyntaxError
handler.encoding
end
end
The method argument chunk
will contain string_or_io
this time, that is, what you normally use as html.
There are many unfamiliar methods, so I can't get the exact meaning, but is there a description that refers to the meta charset in the second if block? ?? ?? It seems that the value is returned by return, and this part feels very suspicious.
I haven't figured out the details of the source yet, but I feel like I've come closer to the answer I'm looking for. If you know the details of the source, I will summarize it in another article.
Recommended Posts