bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

bug#40794: 26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region

Tim Landscheidt
(Prologue: This bug showed up in the "ALT" attribute of an
"IMG" element of an HTML mail in Gnus.  I am reasonably cer-
tain that this stems from libxml-parse-html-region and
should be fixed there, but there may be more prudent solu-
tions.)

With GNU Emacs 26.3 on Fedora:

| ELISP> (with-temp-buffer
|          (insert "<!DOCTYPE html>
| <html lang=\"en\">
| <head><title>Title</title></head>
| <body>
|   <p>Hello world</p>
|   <p>&auml;</p>
|   <p>&star;</p>
|   <p>&starf;</p>
| </body>
| </html>")
|          (libxml-parse-html-region (point-min) (point-max)))
| (html
|  ((lang . "en"))
|  (head nil
|        (title nil "Title"))
|  (body nil "\n  "
|        (p nil "Hello world")
|        "\n  "
|        (p nil "ä")
|        "\n  "
|        (p nil "&star;")
|        "\n  "
|        (p nil "&starf;")
|        "\n"))

| ELISP>

These should instead yield "ä" (228), "☆" (9734) and
"★" (9733).

lisp/leim/quail/sgml-input.el seems to contain the necessary
data for &star; and &starf; that could probably be fed to
libxml.



Reply | Threaded
Open this post in threaded view
|

bug#40794: 26.3; HTML entities &star; and &starf; (inter alia) are not parsed by libxml-parse-html-region

Lars Ingebrigtsen
Tim Landscheidt <[hidden email]> writes:

> (Prologue: This bug showed up in the "ALT" attribute of an
> "IMG" element of an HTML mail in Gnus.  I am reasonably cer-
> tain that this stems from libxml-parse-html-region and
> should be fixed there, but there may be more prudent solu-
> tions.)

[...]

> These should instead yield "ä" (228), "☆" (9734) and
> "★" (9733).
>
> lisp/leim/quail/sgml-input.el seems to contain the necessary
> data for &star; and &starf; that could probably be fed to
> libxml.

As far as I can tell, libxml2 doesn't take a list of entities as an
input when parsing HTML?  I may have missed something...

Hm, a bit of googling shows http://xmlsoft.org/html/libxml-entities.html
and there is apparently a way to tell libxml2 about further entities?

But I think this all sounds more like a libxml2 than an Emacs bug,
really?

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no