bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Simon Ledergerber
Hi

When I was editing XHTML and HTML files, I wanted to make sure the BOM
was written out to the file in order to make it easier for the browser
to detect the UTF-8 encoding. Therefore I changed the coding system for
the file buffer to utf-8-with-signature-dos (since I am working on a
Windows System) before saving the file.

After some time I got surprised because the browser (IE11), didn't
report UTF-8 as the file's encoding. Having checked the hexdump of my
(X)HTML file, I saw the BOM was definitely missing.

Obviously, when a "UTF-8" string appears in the <meta charset="utf-8">
(even if commented out, see later below) or <?xml version="1.0"
encoding="utf-8"?> declaration, Emacs switches the file coding system to
utf-8, when it saves the file, even if utf-8-with-signature was
specified explicitly before. This appears to me as a bug, because there
is no way anymore to restore the BOM using Emacs.

I was not sure, if my bug is related to bug #8282, so I decided to
report it (again).

My Emacs version is: 24.5.1 (x86_64-unkown-cygwin) of 2015-04-10 on
Windows 8.1 x64.

I am running Emacs in text-mode only inside a Cygwin console.

This is my .emacs.d/init.el:
(line-number-mode)
(column-number-mode)
(setq-default fill-column 80)
(setq-default buffer-file-coding-system 'utf-8-dos)
(setq-default indent-tabs-mode nil)

With XML the problem can be reproduced in the most basic way as detailed
out by the following steps:

- Create a new file with C-x C-f in the current directory. Name it
test.txt for example.

- Switch to fundamental mode with M-x fundamental-mode.

- Type the text '<?xml version="1.0"' (without the surrounding single
quotes).

- Switch the encoding system to include the BOM: C-x RET f
utf-8-with-signature-dos.

- Verify the current encoding system with C-h Shift-c RET: Yes, the
encoding system for the file buffer is as specified before.

- Type C-x k to kill the help buffer if necessary and save the file with
C-x C-s.

- Check the file with a hex editor. Under the Cygwin Bash shell, 'od -Ax
-t xCaz test.txt' will also do it: The UTF-8 BOM 'EF BB BF' was written
at the beginning of the file.

- Complete the rest of the XML declaration as follows: ' encoding="utf-8"?>'

- Now save the file and check again: The encoding system for the buffer
has changed to utf-8-dos and the BOM has disappeared from the file!

Now the steps for HTML:

- Create a new file test1.txt in the current directory.

- Fill it with the following simple and yet incomplete HTML5 document:
<!doctype html>
<html>
     <head>
         <title>Test</title>
     </head>
     <body>
     </body>
</html>

- Change the coding system to utf-8-with-signature-dos and save the file.

- Verify that the coding system for the buffer is correct and the BOM is
really written: Yes, it is.

- Insert the following *comment* between <head> and <title>: <!-- <meta
charset="utf-8"> -->

- Save the file and verify: The coding system has changed to utf-8-dos
and the BOM has vanished, even if it is just a comment and has no effect!

Regards

Simon

P. S. Information as reported by M-x report-emacs-bug:
In GNU Emacs 24.5.1 (x86_64-unknown-cygwin)
  of 2015-04-10 on desktop-new
Configured using:
  `configure
  --srcdir=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5
  --prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc
  --docdir=/usr/share/doc/emacs --htmldir=/usr/share/doc/emacs/html -C
  --with-x=no 'CFLAGS=-ggdb -O2 -pipe -Wimplicit-function-declaration
  -fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/build=/usr/src/debug/emacs-24.5-1
  -fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5=/usr/src/debug/emacs-24.5-1'
  CPPFLAGS= LDFLAGS='

Important settings:
   value of $LANG: en_US.UTF-8
   locale-coding-system: utf-8-unix

Major mode: Help

Minor modes in effect:
   tooltip-mode: t
   electric-indent-mode: t
   menu-bar-mode: t
   file-name-shadow-mode: t
   global-font-lock-mode: t
   font-lock-mode: t
   auto-composition-mode: t
   auto-encryption-mode: t
   auto-compression-mode: t
   buffer-read-only: t
   column-number-mode: t
   line-number-mode: t
   transient-mark-mode: t

Recent messages:
Beginning of buffer [3 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
Mark set [2 times]
Auto-saving...done
Mark set [2 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
No docstring slot for help-mode-setup
No docstring slot for help-mode-finish

Load-path shadows:
None found.

Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
help-fns mail-prsvr mail-utils misearch multi-isearch mule-diag
help-mode easymenu regexp-opt sgml-mode xterm time-date tooltip electric
uniquify ediff-hook vc-hooks lisp-float-type tabulated-list newcomment
lisp-mode prog-mode register page menu-bar rfn-eshadow timer select
mouse jit-lock font-lock syntax facemenu font-core frame cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev
minibuffer nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote make-network-process
dbusbind gfilenotify multi-tty emacs)

Memory information:
((conses 16 81797 4691)
  (symbols 48 17091 0)
  (miscs 40 73 387)
  (strings 32 11233 4887)
  (string-bytes 1 291872)
  (vectors 16 7587)
  (vector-slots 8 342125 27930)
  (floats 8 57 393)
  (intervals 56 834 26)
  (buffers 960 21))




Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Eli Zaretskii
> Date: Thu, 21 May 2015 20:50:58 +0200
> From: Simon Ledergerber <[hidden email]>
>
> When I was editing XHTML and HTML files, I wanted to make sure the BOM
> was written out to the file in order to make it easier for the browser
> to detect the UTF-8 encoding. Therefore I changed the coding system for
> the file buffer to utf-8-with-signature-dos (since I am working on a
> Windows System) before saving the file.
>
> After some time I got surprised because the browser (IE11), didn't
> report UTF-8 as the file's encoding. Having checked the hexdump of my
> (X)HTML file, I saw the BOM was definitely missing.
>
> Obviously, when a "UTF-8" string appears in the <meta charset="utf-8">
> (even if commented out, see later below) or <?xml version="1.0"
> encoding="utf-8"?> declaration, Emacs switches the file coding system to
> utf-8, when it saves the file, even if utf-8-with-signature was
> specified explicitly before. This appears to me as a bug, because there
> is no way anymore to restore the BOM using Emacs.

What would you expect Emacs to do instead?  It just obeys the stated
encoding, which says nothing about the BOM.  How can Emacs know when
to use utf-8 and when utf-8-with-signature?



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Eli Zaretskii
[Please don't remove the bug address from the CC list, so that this
discussion is recorded in the bug data base.]

> Date: Thu, 21 May 2015 22:49:47 +0200
> From: Simon Ledergerber <[hidden email]>
>
>  From the documentation I understand that utf-8 is without BOM and
> utf-8-with-signature is with BOM. Maybe I am wrong and should rather
> understand that utf-8 is auto-detect. But then there is something like
> utf-8-without-signature missing to specify explicitly that no BOM is
> desired.
>
> In my opinion, it is correct when Emacs prefers utf-8 over
> utf-8-with-signature when it opens a file without BOM that can still be
> recognized as UTF-8.
>
> However when a file is opened with a BOM already present, it should
> stick to the utf-8-with-signature coding system, because the BOM "EF BB
> BF" unambiguously marks the file as UTF-8. (For UTF-16 for example,
> there is a different BOM byte pattern. There are other coding systems
> which do not have a BOM at all.)

What do you mean by "stick to"?  When I try visiting an XML file that
is encoded with BOM, Emacs decodes the file correctly, and the value
of buffer-file-coding-system is utf-8-with-signature.  Isn't that what
you want?  If that's what you want, but it doesn't happen for you,
please try in "emacs -Q".  It's possible that the default you set:

  (setq-default buffer-file-coding-system 'utf-8-dos)

is the reason for what you see.  (I don't understand why you need such
a default, and it sounds like a bad idea to me.)

> By doing C-x <RET> f and then saving it with C-x C-s, I expect to be
> able to change the coding system.  For example, if I specify utf-8-dos,
> the BOM should be removed, if one was present, and CR LF should be
> inserted for EOL. On the other side, if I choose
> utf-8-with-signature-unix, a BOM should be written and LF be taken for
> EOL. (The conversion between DOS and Unix works, just the BOM is the
> problem.)
>
> I have found a link, where this topic was already discussed, but it
> didn't help me further:
> http://superuser.com/questions/41254/make-emacs-not-remove-the-bom-from-xml-files
>
> In that post Vebjorn Ljosa asked exactly the question I have. Richard
> Hoskins replies with the answer to change the coding system with C-x
> <RET> r utf-8-with-signature. Unfortunately, it didn't work for me -
> after doing a change in the file and saving, it got back to utf-8
> automatically - that's why I have filed the bug.

That's not how you force a file to be saved in a specific encoding.
You should do this instead:

  C-x RET c utf-8-with-signature RET C-x C-s

The "C-x RET c" prefix forces the next Emacs operation to use the
specified encoding.  In this case, Emacs will ask for confirmation,
because the encoding you specified is different from what the XML
comment says.



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Simon Ledergerber
Hello Eli

I have done some more research to answer your questions. You will find
the details of my statement at the end of this mail.

On 22.05.2015 09:11, Eli Zaretskii wrote:

> [Please don't remove the bug address from the CC list, so that this
> discussion is recorded in the bug data base.]
>
>> Date: Thu, 21 May 2015 22:49:47 +0200
>> From: Simon Ledergerber <[hidden email]>
>>
>>   From the documentation I understand that utf-8 is without BOM and
>> utf-8-with-signature is with BOM. Maybe I am wrong and should rather
>> understand that utf-8 is auto-detect. But then there is something like
>> utf-8-without-signature missing to specify explicitly that no BOM is
>> desired.
>>
>> In my opinion, it is correct when Emacs prefers utf-8 over
>> utf-8-with-signature when it opens a file without BOM that can still be
>> recognized as UTF-8.
>>
>> However when a file is opened with a BOM already present, it should
>> stick to the utf-8-with-signature coding system, because the BOM "EF BB
>> BF" unambiguously marks the file as UTF-8. (For UTF-16 for example,
>> there is a different BOM byte pattern. There are other coding systems
>> which do not have a BOM at all.)
> What do you mean by "stick to"?  When I try visiting an XML file that
> is encoded with BOM, Emacs decodes the file correctly, and the value
> of buffer-file-coding-system is utf-8-with-signature.  Isn't that what
> you want?  If that's what you want, but it doesn't happen for you,
> please try in "emacs -Q".  It's possible that the default you set:
>
>    (setq-default buffer-file-coding-system 'utf-8-dos)
>
> is the reason for what you see.  (I don't understand why you need such
> a default, and it sounds like a bad idea to me.)
You're right. When I open a file that was really saved with BOM, Emacs
detects its encoding correctly, i. e. utf-8-with-signature-dos. But when
I change the content and save with C-x C-s, the encoding changes to
utf-8-dos and the BOM gets lost. Even when I start Emacs with -Q. This
is the actual bug.

>
>> By doing C-x <RET> f and then saving it with C-x C-s, I expect to be
>> able to change the coding system.  For example, if I specify utf-8-dos,
>> the BOM should be removed, if one was present, and CR LF should be
>> inserted for EOL. On the other side, if I choose
>> utf-8-with-signature-unix, a BOM should be written and LF be taken for
>> EOL. (The conversion between DOS and Unix works, just the BOM is the
>> problem.)
>>
>> I have found a link, where this topic was already discussed, but it
>> didn't help me further:
>> http://superuser.com/questions/41254/make-emacs-not-remove-the-bom-from-xml-files
>>
>> In that post Vebjorn Ljosa asked exactly the question I have. Richard
>> Hoskins replies with the answer to change the coding system with C-x
>> <RET> r utf-8-with-signature. Unfortunately, it didn't work for me -
>> after doing a change in the file and saving, it got back to utf-8
>> automatically - that's why I have filed the bug.
> That's not how you force a file to be saved in a specific encoding.
> You should do this instead:
>
>    C-x RET c utf-8-with-signature RET C-x C-s
>
> The "C-x RET c" prefix forces the next Emacs operation to use the
> specified encoding.  In this case, Emacs will ask for confirmation,
> because the encoding you specified is different from what the XML
> comment says.
>
This is true and it worked for me. Please see below for further
explanations.

Summary:
- C-x RET c utf-8-with-signature RET C-x C-s is a good workaround,
because it really forces the file being written with BOM. In order to
have an effect however, the file must be dirty, i. e. there must be a
pending change. But before the command completes in this case, the
prompt "Selected encoding utf-8-with-signature-dos disagrees with
utf-8-dos specified by file contents.  Really save (else edit coding
cookies and try again)? (yes or no)" appears. I think this is what you
mean with your sentence: "In this case, Emacs will ask for confirmation,
because the encoding you specified is different from what the XML
comment says."

- But consider the following: The encoding in the XML declaration or in
the HTML <meta charset="utf-8"> just specifies UTF-8 (or another
encoding). It doesn't say anything about the presence or absence of the
BOM. Therefore an editor detecting and deciding about the file's
encoding should not rely on this information only.

- When such a file, which was saved successfully with BOM, is closed and
reopened again, Emacs detects its encoding correctly, say
utf-8-with-signature-dos.

- However, when I change the file content and save it again just with
C-x C-s (without C-x RET c ... first!), then it changes back to
utf-8-dos. Yes, even if I start emacs with -Q! (That's the point.)

- I do not fully understand the criterion for and the magic behind how
Emacs chooses the file encoding when I do C-x C-s. But I was able to
reproduce it several times by applying the procedures given in the bug
report, even when -Q is on. As we already have stated above, this could
be because Emacs favors (and forces) utf-8 whenever it sees something
like XML or HTML that might be UTF-8-encoded.

-> Conclusion: C-x RET c utf-8-with-signature RET C-x C-s is a good way
to force the file being written as I want. But what I still do not
understand: When I open a file with BOM and Emacs recognizes that, why
does it change the encoding silently to drop the BOM when I regularly
save with C-x C-s - and this even without giving me a notice or warning?




Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Stefan Monnier
In reply to this post by Eli Zaretskii
> What would you expect Emacs to do instead?  It just obeys the stated
> encoding, which says nothing about the BOM.  How can Emacs know when
> to use utf-8 and when utf-8-with-signature?

To the extent that Emacs has seen the BOM when opening the file, it
would make sense for Emacs to try and preserve this detail.  IOW the
utf-8 annotation in the XML metadata shouldn't mean "use the utf-8
coding system" but "use a coding system compatible with utf-8".  So if
the coding system is already compatible with utf-8
(e.g. utf-8-with-signature), we should simply keep using that rather
than switch to the utf-8 coding-system.


        Stefan




Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Eli Zaretskii
> From: Stefan Monnier <[hidden email]>
> Cc: Simon Ledergerber <[hidden email]>,  [hidden email]
> Date: Fri, 22 May 2015 11:22:27 -0400
>
> > What would you expect Emacs to do instead?  It just obeys the stated
> > encoding, which says nothing about the BOM.  How can Emacs know when
> > to use utf-8 and when utf-8-with-signature?
>
> To the extent that Emacs has seen the BOM when opening the file, it
> would make sense for Emacs to try and preserve this detail.

It does.



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Stefan Monnier
>> > What would you expect Emacs to do instead?  It just obeys the stated
>> > encoding, which says nothing about the BOM.  How can Emacs know when
>> > to use utf-8 and when utf-8-with-signature?
>> To the extent that Emacs has seen the BOM when opening the file, it
>> would make sense for Emacs to try and preserve this detail.
> It does.

While there are cases where it does, this bug report is about a case
where it doesn't, IIUC.


        Stefan



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Eli Zaretskii
> From: Stefan Monnier <[hidden email]>
> Cc: [hidden email],  [hidden email]
> Date: Fri, 22 May 2015 17:51:07 -0400
>
> >> > What would you expect Emacs to do instead?  It just obeys the stated
> >> > encoding, which says nothing about the BOM.  How can Emacs know when
> >> > to use utf-8 and when utf-8-with-signature?
> >> To the extent that Emacs has seen the BOM when opening the file, it
> >> would make sense for Emacs to try and preserve this detail.
> > It does.
>
> While there are cases where it does, this bug report is about a case
> where it doesn't, IIUC.

AFAIU, that happened because the user has this in ~/.emacs:

  (setq-default buffer-file-coding-system 'utf-8-dos)

IMO, this bad customization should be removed, and then the problem
will go away.



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Simon Ledergerber
As already mentioned in my last post, even when I started Emacs with the option -Q, which should opt out my customizations, it made no difference. So naturally, the source of the problem will be somewhere else.

From: [hidden email]
Sent: ‎23.‎05.‎2015 08:44
To: [hidden email]
Cc: [hidden email]; [hidden email]
Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

> From: Stefan Monnier <[hidden email]>

> Cc: [hidden email][hidden email]
> Date: Fri, 22 May 2015 17:51:07 -0400
>
> >> > What would you expect Emacs to do instead?  It just obeys the stated
> >> > encoding, which says nothing about the BOM.  How can Emacs know when
> >> > to use utf-8 and when utf-8-with-signature?
> >> To the extent that Emacs has seen the BOM when opening the file, it
> >> would make sense for Emacs to try and preserve this detail.
> > It does.
>
> While there are cases where it does, this bug report is about a case
> where it doesn't, IIUC.

AFAIU, that happened because the user has this in ~/.emacs:

  (setq-default buffer-file-coding-system 'utf-8-dos)

IMO, this bad customization should be removed, and then the problem
will go away.
Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Eli Zaretskii
> Cc: <[hidden email]>
> From: Simon Ledergerber <[hidden email]>
> Date: Sat, 23 May 2015 19:11:15 +0200
>
> As already mentioned in my last post, even when I started Emacs with the option
> -Q, which should opt out my customizations, it made no difference. So
> naturally, the source of the problem will be somewhere else.

Doesn't happen to me.  So please post the file you used and the exact
sequence of steps, starting from 'emacs -Q", to reproduce the problem.

Thanks.



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Alain Schneble
In reply to this post by Simon Ledergerber
I'm joining this discussion and would like to report a recipe to
reproduce this issue on Windows:

- emacs -Q
- C-x C-f utf-8-bom-test.xml
- Enter the following text in the new buffer:
<?xml version="1.0" encoding="utf-8"?>
<root></root>
- C-x RET c utf-8-with-signature-dos C-x C-s yes RET
- C-x k RET
- C-x C-f utf-8-bom-test.xml
- M-: buffer-file-coding-system
  => utf-8-with-signature-dos
- Change buffer content, e.g. add some text to the root element:
<?xml version="1.0" encoding="utf-8"?>
<root>test</root>
- C-x C-s
- M-: buffer-file-coding-system
  => utf-8-dos
  (expected coding system: utf-8-with-signature-dos)

As it was already mentioned in this thread, just by visiting the file,
then changing and saving the buffer, the BOM gets lost.  This is due to
select-safe-coding-system (called by choose_write_coding_system) fully
trusting the coding system identified by find-auto-coding.  So far so
good.  The latter eventually calls auto-coding-functions which in turn
calls the built-in sgml-xml-auto-coding-function which I think should
take into account some context to enrich the derived coding system with
a signature if needed.  Similar to what select-safe-coding-system does
to enrich the coding with the proper eol-type.

Does that make sense to you?  If so, I'll try to come up with a patch
that enhances sgml-xml-auto-coding-function to take into account
buffer-file-coding-system (buffer + default value) in case it carries
the same text-conversion but different signature.  The proposed "auto
coding" shall inherit the signature in this case.

Thanks for any help.
Alain




Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Glenn Morris-3

Now reported with "fix this or get removed from the distribution"
severity at <https://bugs.debian.org/883434>.



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Stefan Monnier
> Now reported with "fix this or get removed from the distribution"
> severity at <https://bugs.debian.org/883434>.

I'm curious to see if the OP's "grave" severity settings will stick.
"Grave" is defined in https://www.debian.org/Bugs/Developer#severities as:

    makes the package in question unusable or mostly so, or causes data
    loss, or introduces a security hole allowing access to the accounts
    of users who use the package.

The only part that could arguably apply is "causes data loss", but even
that is stretching the meaning of those words, I think.

This said, we should indeed fix this bug.
Not sure how to Do It Right but least this specific problem should be
fixable with a patch along the lines of the one below (guaranteed 100%
untested).


        Stefan


diff --git a/lisp/international/mule.el b/lisp/international/mule.el
index 019e65b2c6..5c0675aa2f 100644
--- a/lisp/international/mule.el
+++ b/lisp/international/mule.el
@@ -1885,6 +1885,12 @@ auto-coding-alist-lookup
  (setq alist (cdr alist))))
     coding-system))
 
+(defun mule--coding-system-compatible-p (cs new-cs)
+  "Return non-nil if CS is one of the coding-systems described by NEW-CS."
+  (let ((base (coding-system-base cs)))
+    (or (eq base new-cs)
+        (eq base (intern (concat new-cs "-with-signature"))))))
+
 (put 'enable-character-translation 'permanent-local t)
 (put 'enable-character-translation 'safe-local-variable 'booleanp)
 
@@ -2038,8 +2044,12 @@ find-auto-coding
  (save-excursion
   (goto-char (point-min))
   (funcall (pop funcs) size)))))
- (if coding-system
-    (cons coding-system 'auto-coding-functions)))))
+ (and coding-system
+             ;; Don't override utf-8-with-signature with utf-8
+             ;; or latin-1-mac with latin-1 (bug#20623).
+             (not (mule--coding-system-compatible-p
+                   buffer-file-coding-system coding-system))
+     (cons coding-system 'auto-coding-functions)))))
 
 (defun set-auto-coding (filename size)
   "Return coding system for a file FILENAME of which SIZE bytes follow point.



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Eli Zaretskii
> From: Stefan Monnier <[hidden email]>
> Cc: Alain Schneble <[hidden email]>,  Simon Ledergerber <[hidden email]>,  [hidden email],  Eli Zaretskii <[hidden email]>
> Date: Mon, 04 Dec 2017 12:38:57 -0500
>
> This said, we should indeed fix this bug.

Agreed.

> Not sure how to Do It Right but least this specific problem should be
> fixable with a patch along the lines of the one below (guaranteed 100%
> untested).

Isn't it better to fix this in sgml-xml-auto-coding-function?  That's
where the root cause is, AFAIU.

And I don't understand the comment about latin-1-mac: I don't think we
have such problems in Emacs.  The -with-signature variety is
different, because it is not about EOL format.

Thanks.



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Stefan Monnier
> Isn't it better to fix this in sgml-xml-auto-coding-function?  That's
> where the root cause is, AFAIU.

I'd expect the same problem would affect all other uses.

> And I don't understand the comment about latin-1-mac: I don't think we
> have such problems in Emacs.  The -with-signature variety is
> different, because it is not about EOL format.

You might be right, but I don't know where/how this is handled.


        Stefan



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Eli Zaretskii
> From: Stefan Monnier <[hidden email]>
> Cc: [hidden email],  [hidden email],  [hidden email],  [hidden email]
> Date: Mon, 04 Dec 2017 16:08:14 -0500
>
> > Isn't it better to fix this in sgml-xml-auto-coding-function?  That's
> > where the root cause is, AFAIU.
>
> I'd expect the same problem would affect all other uses.

Not sure what you meant by "all other uses".  Could you please
elaborate?

> > And I don't understand the comment about latin-1-mac: I don't think we
> > have such problems in Emacs.  The -with-signature variety is
> > different, because it is not about EOL format.
>
> You might be right, but I don't know where/how this is handled.

I would like to propose the following alternative patch, which accepts
utf-8-with-signature and utf-8-hfs as variants of utf-8 for the
purposes of encoding of XML files.  Comments?  Do we want a similar
treatment for UTF-16?  (That doesn't seem to be required by the bug
report, and UTF-16 in XML files is non-standard anyway.  But what
about HTML?)

diff --git a/lisp/international/mule.el b/lisp/international/mule.el
index 857fa80..5ff1acf 100644
--- a/lisp/international/mule.el
+++ b/lisp/international/mule.el
@@ -2493,7 +2493,17 @@ sgml-xml-auto-coding-function
     (let* ((match (match-string 1))
    (sym (intern (downcase match))))
       (if (coding-system-p sym)
-  sym
+                  ;; If the encoding tag is UTF-8 and the buffer's
+                  ;; encoding is one of the variants of UTF-8, use the
+                  ;; buffer's encoding.  This allows, e.g., saving an
+                  ;; XML file as UTF-8 with BOM when the tag says UTF-8.
+                  (if (and (coding-system-equal 'utf-8
+                                                (coding-system-type sym))
+                           (coding-system-equal sym
+                                                (coding-system-type
+                                                 buffer-file-coding-system)))
+                      buffer-file-coding-system
+    sym)
  (message "Warning: unknown coding system \"%s\"" match)
  nil))
           ;; Files without an encoding tag should be UTF-8. But users
@@ -2506,7 +2516,8 @@ sgml-xml-auto-coding-function
                    (coding-system-base
                     (detect-coding-region (point-min) size t)))))
             ;; Pure ASCII always comes back as undecided.
-            (if (memq detected '(utf-8 undecided))
+            (if (memq detected
+                      '(utf-8 'utf-8-with-signature 'utf-8-hfs undecided))
                 'utf-8
               (warn "File contents detected as %s.
   Consider adding an encoding attribute to the xml declaration,



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Eli Zaretskii
> Date: Sun, 10 Dec 2017 21:17:00 +0200
> From: Eli Zaretskii <[hidden email]>
> Cc: [hidden email], [hidden email], [hidden email]
>
> I would like to propose the following alternative patch, which accepts
> utf-8-with-signature and utf-8-hfs as variants of utf-8 for the
> purposes of encoding of XML files.  Comments?  Do we want a similar
> treatment for UTF-16?  (That doesn't seem to be required by the bug
> report, and UTF-16 in XML files is non-standard anyway.  But what
> about HTML?)

No further comments, so I've pushed the change and I'm marking this
bug done.



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Glenn Morris-3
In reply to this post by Eli Zaretskii

The HTML (not XML) case specified in the original report
("Now the steps for HTML" in https://debbugs.gnu.org/20623#5) and in
https://bugs.debian.org/883434 seems unfixed.



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Eli Zaretskii
> From: Glenn Morris <[hidden email]>
> Cc: Stefan Monnier <[hidden email]>,  [hidden email],  [hidden email],  [hidden email]
> Date: Wed, 01 Aug 2018 14:07:28 -0400
>
> The HTML (not XML) case specified in the original report
> ("Now the steps for HTML" in https://debbugs.gnu.org/20623#5) and in
> https://bugs.debian.org/883434 seems unfixed.

Should it be?



Reply | Threaded
Open this post in threaded view
|

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Glenn Morris-3
Eli Zaretskii wrote:

>> The HTML (not XML) case specified in the original report
>> ("Now the steps for HTML" in https://debbugs.gnu.org/20623#5) and in
>> https://bugs.debian.org/883434 seems unfixed.
>
> Should it be?

I think this a bug that should be fixed, yes (if that is the question).



12