bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't

Katsumi Yamaoka
Hi,

Jidanni mailed me an example html mail that contains a broken
encoded text as follows:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    .......公告辦理現金救助及低利貸款\343\200
 \202因2月
 低溫危害農作物為延遲性損害,.......
  </body>
</html>

This is a part of the contents.  The original one is encoded by
utf-8 and 8-bit (attached in this mail).  Where "\343\200\n \202"
is the encoded version of "。", i.e., "\343\200\202", but broken
in the middle of the bytes.  It seems that a stupid mail software
perpetrates it because of a long encoded line.

When I read the mail using Gnus + shr, the text after the broken
point is all cut off.  That is what libxml-parse-html-region does,
whereas xml-parse-region doesn't cut it.  Moreover a web browser,
to which I send the html data using the `K H' command, shows all
the text (the broken character is shown as is, though).

This is not necessarily a libxml bug anyway, but I hope it works
like xml-parse.

Thanks.

In GNU Emacs 26.0.91 (build 1, x86_64-unknown-cygwin, GTK+ Version 3.22.28)
 of 2018-03-12 built on localhost
Windowing system distributor 'The Cygwin/X Project', version 11.0.11906000

example-html-mail.gz (328 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't

Lars Ingebrigtsen
Katsumi Yamaoka <[hidden email]> writes:

> When I read the mail using Gnus + shr, the text after the broken
> point is all cut off.  That is what libxml-parse-html-region does,
> whereas xml-parse-region doesn't cut it.  Moreover a web browser,
> to which I send the html data using the `K H' command, shows all
> the text (the broken character is shown as is, though).
>
> This is not necessarily a libxml bug anyway, but I hope it works
> like xml-parse.

libxml is more strict about correctness of the input than most other
HTML parsers.  I don't think there's anything we can do about this
problematic input other than ponder whether Emacs should use a different
HTML parser, which I think sounds of unlikely.  :-)

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't

Katsumi Yamaoka
In reply to this post by Katsumi Yamaoka
On Tue, 13 Mar 2018 01:44:22 +0100, Lars Ingebrigtsen wrote:
> libxml is more strict about correctness of the input than most other
> HTML parsers.  I don't think there's anything we can do about this
> problematic input other than ponder whether Emacs should use a different
> HTML parser, which I think sounds of unlikely.  :-)

I see.  I agree not to modify libxml.  Jidanni, how about trying
the following patch personally if you often get such broken mails?
Though I'm not quite sure if it does not cause another problem,
it fixes at least the mail in question.


--- mm-decode.el~ 2018-02-28 02:01:37.897607000 +0000
+++ mm-decode.el 2018-03-13 02:23:04.321753900 +0000
@@ -1810,6 +1810,11 @@
       (when (and (or coding
      (setq coding (mm-charset-to-coding-system charset nil t)))
  (not (eq coding 'ascii)))
+ ;; Remove extra bytes in utf-8 encoded data.
+ (when (eq coding 'utf-8)
+  (goto-char (point-min))
+  (while (re-search-forward "[\x00-\x7f]+\\([\x80-\xbf]\\)" nil t)
+    (replace-match "\\1")))
  (insert (prog1
     (decode-coding-string (buffer-string) coding)
   (erase-buffer)
Reply | Threaded
Open this post in threaded view
|

bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't

積丹尼 Dan Jacobson
In reply to this post by Katsumi Yamaoka
Expecting perfect input is OK for compilers, but not for browsers
https://blog.codinghorror.com/its-a-malformed-world/



Reply | Threaded
Open this post in threaded view
|

bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't

積丹尼 Dan Jacobson
In reply to this post by Katsumi Yamaoka
Thank you for the patch but the real answer is to do what all other
browsers do... show as much as possible.

There is no browser out there that would dream of dying on the slightest
mistake.

Anyway if you guys are really going to use XML::LibXML::Parser (?) then
maybe loosen up some of

       recover
           /parser, html, reader/

           recover from errors; possible values are 0, 1, and 2

           A true value turns on recovery mode which allows one to parse broken XML or HTML data. The recovery mode allows the parser to return the successfully parsed
           portion of the input document. This is useful for almost well-formed documents, where for example a closing tag is missing somewhere. Still, XML::LibXML will
           only parse until the first fatal (non-recoverable) error occurs, reporting recoverable parsing errors as warnings. To suppress even these warnings, use
           recover=>2.

           Note that validation is switched off automatically in recovery mode.

       validation
           /parser, reader/

           validate with the DTD; possible values are 0 and 1


      ERROR REPORTING
       XML::LibXML throws exceptions during parsing, validation or XPath processing (and some other occasions). These errors can be caught by using eval blocks. The error
       is stored in $@. There are two implementations: the old one throws $@ which is just a message string, in the new one $@ is an object from the class
       XML::LibXML::Error; this class overrides the operator "" so that when printed, the object flattens to the usual error message.

       XML::LibXML throws errors as they occur. This is a very common misunderstanding in the use of XML::LibXML. If the eval is omitted, XML::LibXML will always halt your
       script by "croaking" (see Carp man page for details).

       Also note that an increasing number of functions throw errors if bad data is passed as arguments. If you cannot assure valid data passed to XML::LibXML you should
       eval these functions.

       Note: since version 1.59, get_last_error() is no longer available in XML::LibXML for thread-safety reasons.



Reply | Threaded
Open this post in threaded view
|

bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't

Katsumi Yamaoka
In reply to this post by Katsumi Yamaoka
On Tue, 13 Mar 2018 11:28:45 +0900, Katsumi Yamaoka wrote:
> + ;; Remove extra bytes in utf-8 encoded data.
> + (when (eq coding 'utf-8)
> +  (goto-char (point-min))
> +  (while (re-search-forward "[\x00-\x7f]+\\([\x80-\xbf]\\)" nil t)
> +    (replace-match "\\1")))

Corrected:

--- mm-decode.el~ 2018-02-28 02:01:37.897607000 +0000
+++ mm-decode.el 2018-03-13 03:27:56.885844100 +0000
@@ -1810,6 +1810,13 @@
       (when (and (or coding
      (setq coding (mm-charset-to-coding-system charset nil t)))
  (not (eq coding 'ascii)))
+ ;; Remove extra bytes in utf-8 encoded data.
+ (when (eq coding 'utf-8)
+  (goto-char (point-min))
+  (while (re-search-forward
+  "\\([\xc2-\xf7][\x80-\xbf]?\\)[\x00-\x7f]+\\([\x80-\xbf]\\)"
+  nil t)
+    (replace-match "\\1\\2")))
  (insert (prog1
     (decode-coding-string (buffer-string) coding)
   (erase-buffer)
Reply | Threaded
Open this post in threaded view
|

bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't

Lars Ingebrigtsen
In reply to this post by 積丹尼 Dan Jacobson
積丹尼 Dan Jacobson <[hidden email]> writes:

> There is no browser out there that would dream of dying on the slightest
> mistake.

I agree, and you should report these problems to the libxml2
maintainers.

> Anyway if you guys are really going to use XML::LibXML::Parser (?) then
> maybe loosen up some of

Our calls are as loose as they get, if I recall correctly.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't

積丹尼 Dan Jacobson
In reply to this post by Katsumi Yamaoka
>>>>> "LI" == Lars Ingebrigtsen <[hidden email]> writes:

LI> I agree, and you should report these problems to the libxml2
LI> maintainers.

I would not want to ruin my reputation by letting them know I was
inputting unvalidated XML and expecting whatever results.



Reply | Threaded
Open this post in threaded view
|

bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't

Lars Ingebrigtsen
積丹尼 Dan Jacobson <[hidden email]> writes:

> LI> I agree, and you should report these problems to the libxml2
> LI> maintainers.
>
> I would not want to ruin my reputation by letting them know I was
> inputting unvalidated XML and expecting whatever results.

:-)

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no