This is a part of the contents. The original one is encoded by
utf-8 and 8-bit (attached in this mail). Where "\343\200\n \202"
is the encoded version of "。", i.e., "\343\200\202", but broken
in the middle of the bytes. It seems that a stupid mail software
perpetrates it because of a long encoded line.
When I read the mail using Gnus + shr, the text after the broken
point is all cut off. That is what libxml-parse-html-region does,
whereas xml-parse-region doesn't cut it. Moreover a web browser,
to which I send the html data using the `K H' command, shows all
the text (the broken character is shown as is, though).
This is not necessarily a libxml bug anyway, but I hope it works
In GNU Emacs 26.0.91 (build 1, x86_64-unknown-cygwin, GTK+ Version 3.22.28)
of 2018-03-12 built on localhost
Windowing system distributor 'The Cygwin/X Project', version 11.0.11906000
> When I read the mail using Gnus + shr, the text after the broken
> point is all cut off. That is what libxml-parse-html-region does,
> whereas xml-parse-region doesn't cut it. Moreover a web browser,
> to which I send the html data using the `K H' command, shows all
> the text (the broken character is shown as is, though).
> This is not necessarily a libxml bug anyway, but I hope it works
> like xml-parse.
libxml is more strict about correctness of the input than most other
HTML parsers. I don't think there's anything we can do about this
problematic input other than ponder whether Emacs should use a different
HTML parser, which I think sounds of unlikely. :-)
On Tue, 13 Mar 2018 01:44:22 +0100, Lars Ingebrigtsen wrote:
> libxml is more strict about correctness of the input than most other
> HTML parsers. I don't think there's anything we can do about this
> problematic input other than ponder whether Emacs should use a different
> HTML parser, which I think sounds of unlikely. :-)
I see. I agree not to modify libxml. Jidanni, how about trying
the following patch personally if you often get such broken mails?
Though I'm not quite sure if it does not cause another problem,
it fixes at least the mail in question.
Thank you for the patch but the real answer is to do what all other
browsers do... show as much as possible.
There is no browser out there that would dream of dying on the slightest
Anyway if you guys are really going to use XML::LibXML::Parser (?) then
maybe loosen up some of
/parser, html, reader/
recover from errors; possible values are 0, 1, and 2
A true value turns on recovery mode which allows one to parse broken XML or HTML data. The recovery mode allows the parser to return the successfully parsed
portion of the input document. This is useful for almost well-formed documents, where for example a closing tag is missing somewhere. Still, XML::LibXML will
only parse until the first fatal (non-recoverable) error occurs, reporting recoverable parsing errors as warnings. To suppress even these warnings, use
Note that validation is switched off automatically in recovery mode.
validate with the DTD; possible values are 0 and 1
XML::LibXML throws exceptions during parsing, validation or XPath processing (and some other occasions). These errors can be caught by using eval blocks. The error
is stored in $@. There are two implementations: the old one throws $@ which is just a message string, in the new one $@ is an object from the class
XML::LibXML::Error; this class overrides the operator "" so that when printed, the object flattens to the usual error message.
XML::LibXML throws errors as they occur. This is a very common misunderstanding in the use of XML::LibXML. If the eval is omitted, XML::LibXML will always halt your
script by "croaking" (see Carp man page for details).
Also note that an increasing number of functions throw errors if bad data is passed as arguments. If you cannot assure valid data passed to XML::LibXML you should
eval these functions.
Note: since version 1.59, get_last_error() is no longer available in XML::LibXML for thread-safety reasons.
> LI> I agree, and you should report these problems to the libxml2
> LI> maintainers.
> I would not want to ruin my reputation by letting them know I was
> inputting unvalidated XML and expecting whatever results.