bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Glenn Morris-3
Package: emacs
Version: 24.3

Split from http://debbugs.gnu.org/15260

Eli Zaretskii wrote:

> mule-cmds.el calls reset-language-environment, and language/english.el
> calls set-language-info-alist; both have the effect of resetting
> default-file-name-coding-system to latin-1 (!? an interesting
> "default" for a Unicode-era Emacs, perhaps Handa-san could comment why
> we still do that).

I know nothing about this, but eg glib defaults to utf-8, which seems
like a better default to me these days:

https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#file-name-encodings



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Glenn Morris-3
Glenn Morris wrote:

>> mule-cmds.el calls reset-language-environment, and language/english.el
>> calls set-language-info-alist; both have the effect of resetting
>> default-file-name-coding-system to latin-1 (!? an interesting
>> "default" for a Unicode-era Emacs, perhaps Handa-san could comment why
>> we still do that).
>
> I know nothing about this, but eg glib defaults to utf-8, which seems
> like a better default to me these days:
>
> https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#file-name-encodings

... 4 years pass and latin-1 fails to make a comeback.

For some reason, I thought it was difficult to change the default to
utf-8 due to bootstrap ordering issues. This was probably prompted by
this comment in reset-language-environment:

  ;; On Darwin systems, this should be utf-8-unix, but when this file is loaded
  ;; that is not yet defined, so we set it in set-locale-environment instead.
  (setq default-file-name-coding-system 'iso-latin-1-unix)

But looking at it now, I cannot see what this comment is referring to.

If I change reset-language-environment so that it sets
default-file-name-coding-system (and default-sendmail-coding-system)
to 'utf-8, then a bootstrap works fine.

It looks like this stuff was all rewritten in Emacs 23.
Before that, there used to be international/utf-8.el,
which was indeed loaded after mule-cmds.
But since Emacs 23, mule-conf seems to define everything.
(But that rewrite seems to predate the above comment about Darwin...?)

So should the default finally be changed to utf-8?



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Eli Zaretskii
> From: Glenn Morris <[hidden email]>
> Date: Thu, 30 Nov 2017 20:52:17 -0500
>
> So should the default finally be changed to utf-8?

Perhaps on Posix systems, but not elsewhere.  And if we make the
change, we should make sure building Emacs in a non-ASCII directory
still works.

Btw, why does the default matter so much?  Once Emacs starts up
default-file-name-coding-system on GNU/Linux is set to UTF-8, if the
locale says so.  Is this just an aesthetic issue?



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Glenn Morris-3
Eli Zaretskii wrote:

> Perhaps on Posix systems, but not elsewhere.

I assume non-POSIX is newspeak for MS-Windows (native and DOS).

> And if we make the change, we should make sure building Emacs in a
> non-ASCII directory still works.

It works fine for me on G/L to have source, build, and install
directories be distinct non-ASCII directories. (Emacs works, that is,
but makeinfo 5.1 fails to find @include files in non-ASCII directories,
so I wonder how common such setups are.)


BTW, it feels very dated to me to have discussion of Windows 9X in the
Emacs manual section on file-name-coding.


diff --git i/doc/emacs/mule.texi w/doc/emacs/mule.texi
index 78f77cb..5fc44a6 100644
--- i/doc/emacs/mule.texi
+++ w/doc/emacs/mule.texi
@@ -1214,11 +1214,8 @@ system can encode.
 
   If @code{file-name-coding-system} is @code{nil}, Emacs uses a
 default coding system determined by the selected language environment,
-and stored in the @code{default-file-name-coding-system} variable.
-@c FIXME?  Is this correct?  What is the "default language environment"?
-In the default language environment, non-@acronym{ASCII} characters in
-file names are not encoded specially; they appear in the file system
-using the internal Emacs representation.
+and stored in the @code{default-file-name-coding-system} variable
+(normally UTF-8).
 
 @cindex file-name encoding, MS-Windows
 @vindex w32-unicode-filenames
diff --git i/lisp/international/mule-cmds.el w/lisp/international/mule-cmds.el
index 9d22d6e..192f0e9 100644
--- i/lisp/international/mule-cmds.el
+++ w/lisp/international/mule-cmds.el
@@ -1797,10 +1797,11 @@ The default status is as follows:
    'raw-text)
 
   (set-default-coding-systems nil)
-  (setq default-sendmail-coding-system 'iso-latin-1)
-  ;; On Darwin systems, this should be utf-8-unix, but when this file is loaded
-  ;; that is not yet defined, so we set it in set-locale-environment instead.
-  (setq default-file-name-coding-system 'iso-latin-1-unix)
+  (setq default-sendmail-coding-system 'utf-8)
+  (setq default-file-name-coding-system (if (memq system-type
+                                                  '(window-nt ms-dos))
+                                            'iso-latin-1-unix
+                                          'utf-8-unix))
   ;; Preserve eol-type from existing default-process-coding-systems.
   ;; On non-unix-like systems in particular, these may have been set
   ;; carefully by the user, or by the startup code, to deal with the
@@ -1816,8 +1817,10 @@ The default status is as follows:
  (input-coding
  (condition-case nil
      (coding-system-change-text-conversion
-      (cdr default-process-coding-system) 'iso-latin-1)
-   (coding-system-error 'iso-latin-1))))
+      (cdr default-process-coding-system)
+      (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8))
+   (coding-system-error
+    (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8)))))
     (setq default-process-coding-system
   (cons output-coding input-coding)))
 
diff --git i/lisp/mail/sendmail.el w/lisp/mail/sendmail.el
index cd80211..36fbb7d 100644
--- i/lisp/mail/sendmail.el
+++ w/lisp/mail/sendmail.el
@@ -993,7 +993,7 @@ but lower priority than the local value of `buffer-file-coding-system'.
 See also the function `select-message-coding-system'.")
 
 ;;;###autoload
-(defvar default-sendmail-coding-system 'iso-latin-1
+(defvar default-sendmail-coding-system 'utf-8
   "Default coding system for encoding the outgoing mail.
 This variable is used only when `sendmail-coding-system' is nil.
 
diff --git i/lisp/mh-e/mh-comp.el w/lisp/mh-e/mh-comp.el
index 98067ce..25118cd 100644
--- i/lisp/mh-e/mh-comp.el
+++ w/lisp/mh-e/mh-comp.el
@@ -304,6 +304,7 @@ message and scan line."
   (let ((draft-buffer (current-buffer))
         (file-name buffer-file-name)
         (config mh-previous-window-config)
+        ;; FIXME this is subtly different to select-message-coding-system.
         (coding-system-for-write
          (if (and (local-variable-p 'buffer-file-coding-system
                                     (current-buffer)) ;XEmacs needs two args
@@ -315,7 +316,7 @@ message and scan line."
            (or (and (boundp 'sendmail-coding-system) sendmail-coding-system)
                (and (default-boundp 'buffer-file-coding-system)
                     (default-value 'buffer-file-coding-system))
-               'iso-latin-1))))
+               'utf-8))))
     ;; Older versions of spost do not support -msgid and -mime.
     (unless mh-send-uses-spost-flag
       ;; Adding a Message-ID field looks good, makes it easier to search for



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Eli Zaretskii
> From: Glenn Morris <[hidden email]>
> Cc: [hidden email]
> Date: Mon, 04 Dec 2017 19:35:05 -0500
>
> Eli Zaretskii wrote:
>
> > Perhaps on Posix systems, but not elsewhere.
>
> I assume non-POSIX is newspeak for MS-Windows (native and DOS).

I didn't say "non-Posix"; you did.

MS-Windows is definitely not a Posix system, but whether it is the
only one, I don't know.  Are we sure all macOS/Darwin systems are
sufficiently Posix in this aspect?  AFAIR they use quite different
encoding methods for file names (canonical normalization etc.).

> > And if we make the change, we should make sure building Emacs in a
> > non-ASCII directory still works.
>
> It works fine for me on G/L to have source, build, and install
> directories be distinct non-ASCII directories.

Was it in a UTF-8 locale or in a non-UTF-8 locale?  The latter is the
potentially problematic case, AFAIR.

> (Emacs works, that is,
> but makeinfo 5.1 fails to find @include files in non-ASCII directories,
> so I wonder how common such setups are.)

Building a release tarball doesn't require makeinfo.

> BTW, it feels very dated to me to have discussion of Windows 9X in the
> Emacs manual section on file-name-coding.

We still try to support it, and the aspects of file-name encoding
related to it are definitely non-trivial.  Everything described there
is in the code.

> diff --git i/doc/emacs/mule.texi w/doc/emacs/mule.texi
> index 78f77cb..5fc44a6 100644
> --- i/doc/emacs/mule.texi
> +++ w/doc/emacs/mule.texi
> @@ -1214,11 +1214,8 @@ system can encode.
>  
>    If @code{file-name-coding-system} is @code{nil}, Emacs uses a
>  default coding system determined by the selected language environment,
> -and stored in the @code{default-file-name-coding-system} variable.
> -@c FIXME?  Is this correct?  What is the "default language environment"?
> -In the default language environment, non-@acronym{ASCII} characters in
> -file names are not encoded specially; they appear in the file system
> -using the internal Emacs representation.
> +and stored in the @code{default-file-name-coding-system} variable
> +(normally UTF-8).

Not sure why you removed the sentence which had the FIXME comment.  Is
it in any way related to the issue at hand?

>  @cindex file-name encoding, MS-Windows
>  @vindex w32-unicode-filenames
> diff --git i/lisp/international/mule-cmds.el w/lisp/international/mule-cmds.el
> index 9d22d6e..192f0e9 100644
> --- i/lisp/international/mule-cmds.el
> +++ w/lisp/international/mule-cmds.el
> @@ -1797,10 +1797,11 @@ The default status is as follows:
>     'raw-text)
>  
>    (set-default-coding-systems nil)
> -  (setq default-sendmail-coding-system 'iso-latin-1)
> -  ;; On Darwin systems, this should be utf-8-unix, but when this file is loaded
> -  ;; that is not yet defined, so we set it in set-locale-environment instead.
> -  (setq default-file-name-coding-system 'iso-latin-1-unix)
> +  (setq default-sendmail-coding-system 'utf-8)
> +  (setq default-file-name-coding-system (if (memq system-type
> +                                                  '(window-nt ms-dos))
> +                                            'iso-latin-1-unix
> +                                          'utf-8-unix))

Why are we changing sendmail-coding-system?  It has nothing to do with
file names, AFAIK.

>    ;; Preserve eol-type from existing default-process-coding-systems.
>    ;; On non-unix-like systems in particular, these may have been set
>    ;; carefully by the user, or by the startup code, to deal with the
> @@ -1816,8 +1817,10 @@ The default status is as follows:
>   (input-coding
>   (condition-case nil
>       (coding-system-change-text-conversion
> -      (cdr default-process-coding-system) 'iso-latin-1)
> -   (coding-system-error 'iso-latin-1))))
> +      (cdr default-process-coding-system)
> +      (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8))
> +   (coding-system-error
> +    (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8)))))
>      (setq default-process-coding-system
>    (cons output-coding input-coding)))

And this changes the default encoding used to communicate with
sub-processes.  Why?  We never talked about a wholesale change of all
the defaults to UTF-8, that is a much more broad issue than just
encoding of file names.

> diff --git i/lisp/mh-e/mh-comp.el w/lisp/mh-e/mh-comp.el
> index 98067ce..25118cd 100644
> --- i/lisp/mh-e/mh-comp.el
> +++ w/lisp/mh-e/mh-comp.el
> @@ -304,6 +304,7 @@ message and scan line."
>    (let ((draft-buffer (current-buffer))
>          (file-name buffer-file-name)
>          (config mh-previous-window-config)
> +        ;; FIXME this is subtly different to select-message-coding-system.
>          (coding-system-for-write
>           (if (and (local-variable-p 'buffer-file-coding-system
>                                      (current-buffer)) ;XEmacs needs two args
> @@ -315,7 +316,7 @@ message and scan line."
>             (or (and (boundp 'sendmail-coding-system) sendmail-coding-system)
>                 (and (default-boundp 'buffer-file-coding-system)
>                      (default-value 'buffer-file-coding-system))
> -               'iso-latin-1))))
> +               'utf-8))))

Changes like that in MH-E should be communicated to the MH-E
developer; I 'm not sure he is reading this list.

And you never answered my question about the rationale:

> Btw, why does the default matter so much?  Once Emacs starts up
> default-file-name-coding-system on GNU/Linux is set to UTF-8, if the
> locale says so.  Is this just an aesthetic issue?



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Glenn Morris-3
In reply to this post by Glenn Morris-3
Eli Zaretskii wrote:

> Are we sure all macOS/Darwin systems are sufficiently Posix in this
> aspect?

Emacs on Darwin has been unconditionally using utf-8 for over a decade.
It's special-cased in mule-cmds, as visible in the diff I sent.

>> It works fine for me on G/L to have source, build, and install
>> directories be distinct non-ASCII directories.
>
> Was it in a UTF-8 locale or in a non-UTF-8 locale?  The latter is the
> potentially problematic case, AFAIR.

I had LANG=en_US.UTF-8. I've repeated with LANG=en_US. Still works.

>>    If @code{file-name-coding-system} is @code{nil}, Emacs uses a
>>  default coding system determined by the selected language environment,
>> -and stored in the @code{default-file-name-coding-system} variable.
>> -@c FIXME?  Is this correct?  What is the "default language environment"?
>> -In the default language environment, non-@acronym{ASCII} characters in
>> -file names are not encoded specially; they appear in the file system
>> -using the internal Emacs representation.
>> +and stored in the @code{default-file-name-coding-system} variable
>> +(normally UTF-8).
>
> Not sure why you removed the sentence which had the FIXME comment.  Is
> it in any way related to the issue at hand?

I wrote the FIXME comment. In 5 years, no-one has addressed it.
Defaulting to UTF-8 makes it no longer relevant, so it seems better to
remove it.

> Why are we changing sendmail-coding-system?  It has nothing to do with
> file names, AFAIK.

I'm changing all (3) things that currently default to latin-1 to default to
utf-8.

>> Btw, why does the default matter so much?  Once Emacs starts up
>> default-file-name-coding-system on GNU/Linux is set to UTF-8, if the
>> locale says so.  Is this just an aesthetic issue?

utf-8 is the sensible, "modern" (ie, non-ancient) default.
If there is no reason to use latin-1, Emacs should use utf-8.
I'm not claiming it's critical.

Take it or leave it, as you wish.



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Lars Ingebrigtsen
Glenn Morris <[hidden email]> writes:

> utf-8 is the sensible, "modern" (ie, non-ancient) default.
> If there is no reason to use latin-1, Emacs should use utf-8.
> I'm not claiming it's critical.
>
> Take it or leave it, as you wish.

That was the final message in the thread.  Glenn's patch from six years
ago no longer applied, so I've respun it for Emacs 28 now (included
below).

Glenn's arguments make sense to me, but I'm not a domain expert here.
Does anybody object to applying this patch to Emacs 28?

diff --git a/doc/emacs/mule.texi b/doc/emacs/mule.texi
index 6eff0ca0d2..b78019020a 100644
--- a/doc/emacs/mule.texi
+++ b/doc/emacs/mule.texi
@@ -1215,11 +1215,8 @@ File Name Coding
 
   If @code{file-name-coding-system} is @code{nil}, Emacs uses a
 default coding system determined by the selected language environment,
-and stored in the @code{default-file-name-coding-system} variable.
-@c FIXME?  Is this correct?  What is the "default language environment"?
-In the default language environment, non-@acronym{ASCII} characters in
-file names are not encoded specially; they appear in the file system
-using the internal Emacs representation.
+and stored in the @code{default-file-name-coding-system} variable
+(normally UTF-8).
 
 @cindex file-name encoding, MS-Windows
 @vindex w32-unicode-filenames
diff --git a/lisp/international/mule-cmds.el b/lisp/international/mule-cmds.el
index ccc8ac9f9e..e3155dfc52 100644
--- a/lisp/international/mule-cmds.el
+++ b/lisp/international/mule-cmds.el
@@ -1799,13 +1799,11 @@ reset-language-environment
    'raw-text)
 
   (set-default-coding-systems nil)
-  (setq default-sendmail-coding-system 'iso-latin-1)
-  ;; On Darwin systems, this should be utf-8-unix, but when this file is loaded
-  ;; that is not yet defined, so we set it in set-locale-environment instead.
-  ;; [Actually, it seems to work fine to use utf-8-unix here, and not just
-  ;; on Darwin.  The previous comment seems to be outdated?
-  ;; See patch at https://debbugs.gnu.org/15803 ]
-  (setq default-file-name-coding-system 'iso-latin-1-unix)
+  (setq default-sendmail-coding-system 'utf-8)
+  (setq default-file-name-coding-system (if (memq system-type
+                                                  '(window-nt ms-dos))
+                                            'iso-latin-1-unix
+                                          'utf-8-unix))
   ;; Preserve eol-type from existing default-process-coding-systems.
   ;; On non-unix-like systems in particular, these may have been set
   ;; carefully by the user, or by the startup code, to deal with the
@@ -1821,8 +1819,10 @@ reset-language-environment
  (input-coding
  (condition-case nil
      (coding-system-change-text-conversion
-      (cdr default-process-coding-system) 'iso-latin-1)
-   (coding-system-error 'iso-latin-1))))
+      (cdr default-process-coding-system)
+      (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8))
+   (coding-system-error
+    (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8)))))
     (setq default-process-coding-system
   (cons output-coding input-coding)))
 
diff --git a/lisp/mail/sendmail.el b/lisp/mail/sendmail.el
index dd6eecbfd0..7610939e57 100644
--- a/lisp/mail/sendmail.el
+++ b/lisp/mail/sendmail.el
@@ -975,7 +975,7 @@ sendmail-coding-system
 See also the function `select-message-coding-system'.")
 
 ;;;###autoload
-(defvar default-sendmail-coding-system 'iso-latin-1
+(defvar default-sendmail-coding-system 'utf-8
   "Default coding system for encoding the outgoing mail.
 This variable is used only when `sendmail-coding-system' is nil.
 
diff --git a/lisp/mh-e/mh-comp.el b/lisp/mh-e/mh-comp.el
index f7e30bfbb3..8a69adbb75 100644
--- a/lisp/mh-e/mh-comp.el
+++ b/lisp/mh-e/mh-comp.el
@@ -305,6 +305,7 @@ mh-send-letter
   (let ((draft-buffer (current-buffer))
         (file-name buffer-file-name)
         (config mh-previous-window-config)
+        ;; FIXME this is subtly different to select-message-coding-system.
         (coding-system-for-write
          (if (fboundp 'select-message-coding-system)
              (select-message-coding-system) ; Emacs has this since at least 21.1
@@ -318,7 +319,7 @@ mh-send-letter
              (or (and (boundp 'sendmail-coding-system) sendmail-coding-system)
                  (and (default-boundp 'buffer-file-coding-system)
                       (default-value 'buffer-file-coding-system))
-                 'iso-latin-1)))))
+                 'utf-8)))))
     ;; Older versions of spost do not support -msgid and -mime.
     (unless mh-send-uses-spost-flag
       ;; Adding a Message-ID field looks good, makes it easier to search for

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: Eli Zaretskii <[hidden email]>,  [hidden email]
> Date: Wed, 09 Sep 2020 15:15:09 +0200
>
> Glenn's arguments make sense to me, but I'm not a domain expert here.
> Does anybody object to applying this patch to Emacs 28?

Please try building Emacs from a pristine tarball or a clean
repository in a directory with non-ASCII characters, under a
non-UTF-8, non-C locale.  If that works, I think this is good to go.

Thanks.



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Eli Zaretskii
In reply to this post by Glenn Morris-3
> From: Stefan Kangas <[hidden email]>
> Date: Wed, 9 Sep 2020 06:33:11 -0700
> Cc: Eli Zaretskii <[hidden email]>, [hidden email]
>
> Glenn Morris <[hidden email]> writes:
>
> > BTW, it feels very dated to me to have discussion of Windows 9X in the
> > Emacs manual section on file-name-coding.
>
> Agreed.  Could we move this discussion to the MS Windows FAQ instead?

I don't think the FAQ is the right place for this information.  So no,
please don't move it to the FAQ.

But we could move this to the MS-Windows appendix, leaving a
cross-reference where the text is now.



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Lars Ingebrigtsen
In reply to this post by Eli Zaretskii
Eli Zaretskii <[hidden email]> writes:

>> All the tools under Linux are so utf-8-focused these days...  let's
>> see...  I first, under a utf-8 locale created the directory "émacs",
>> then converted it to 8859-1:
>
> No, please create the directory with non-ASCII name _after_ switching
> the locale to Latin-1.

Shouldn't the result be the same?  I.e., a name with iso-8859-1 name?
The reason I did it this convoluted name was just that I couldn't
convince my system to make a 8859 name even after changing the locale.
That is, when I typed Alt-gr ' e, my terminal still sent over two bytes
(i.e., in utf-8) instead of a single-byte é.

But I think I know why "make check" was failing:

[larsi@stories ~/src/emacs/trunk]$ echo $LANG
sv_SE.ISO-8859-1
[larsi@stories ~/src/emacs/trunk]$ echo $LANG
en_US.UTF-8

The tests that were failing all talked about "chmod" and stuff, so I'm
guessing they were from a sub shell, and my system is apparently forcing
all new shells to use UTF-8...  And that was because I set the variables
in .bashrc.  I've now made them be 8859 also in sub-shells, but
unfortunately that doesn't help (it was a long shot, anyway -- these
aren't interactive shells, so .bashrc shouldn't be consulted).

make check:

>>Error occurred processing lisp/eshell/eshell-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/f\303\263o/test/lisp/eshell/eshell-tests.elcgtybBC"))

This time over, the directory is "fóo" (in latin-1), and that looks like
Emacs is trying to find the utf-8 version of the file name.

So it looks like the patch set has problems, and needs further fixes.
(Or "make check" has some problems here, since Emacs otherwise seems to
work fine.)

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: [hidden email],  [hidden email]
> Date: Fri, 11 Sep 2020 12:55:55 +0200
>
> Eli Zaretskii <[hidden email]> writes:
>
> >> All the tools under Linux are so utf-8-focused these days...  let's
> >> see...  I first, under a utf-8 locale created the directory "émacs",
> >> then converted it to 8859-1:
> >
> > No, please create the directory with non-ASCII name _after_ switching
> > the locale to Latin-1.
>
> Shouldn't the result be the same?  I.e., a name with iso-8859-1 name?

No, because the Linux file I/O APIs are encoding-agnostic, they will
(AFAIK) create the directory with a name that is the exact byte stream
that you type at the mkdir command (or at the Emacs make-directory).

> The reason I did it this convoluted name was just that I couldn't
> convince my system to make a 8859 name even after changing the locale.
> That is, when I typed Alt-gr ' e, my terminal still sent over two bytes
> (i.e., in utf-8) instead of a single-byte é.

Try doing this in Emacs, and use one of the Latin input methods if the
keyboard doesn't cooperate.

> But I think I know why "make check" was failing:
>
> [larsi@stories ~/src/emacs/trunk]$ echo $LANG
> sv_SE.ISO-8859-1
> [larsi@stories ~/src/emacs/trunk]$ echo $LANG
> en_US.UTF-8

I don't understand this: 2 identical commands one after the other
yield different results?

> The tests that were failing all talked about "chmod" and stuff, so I'm
> guessing they were from a sub shell, and my system is apparently forcing
> all new shells to use UTF-8...

Really?  So there's no way to change the locale to something
non UTF-8?

> make check:
>
> >>Error occurred processing lisp/eshell/eshell-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/f\303\263o/test/lisp/eshell/eshell-tests.elcgtybBC"))
>
> This time over, the directory is "fóo" (in latin-1), and that looks like
> Emacs is trying to find the utf-8 version of the file name.

If that's the case, then we lack ENCODE_FILE (or more generally don't
encode a file name) somewhere.

> So it looks like the patch set has problems, and needs further fixes.
> (Or "make check" has some problems here, since Emacs otherwise seems to
> work fine.)

We could also just install the changes and wait for bug reports, on
the assumption that the problems you see aren't real.  Your call.



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Lars Ingebrigtsen
Eli Zaretskii <[hidden email]> writes:

>> But I think I know why "make check" was failing:
>>
>> [larsi@stories ~/src/emacs/trunk]$ echo $LANG
>> sv_SE.ISO-8859-1
>> [larsi@stories ~/src/emacs/trunk]$ echo $LANG
>> en_US.UTF-8
>
> I don't understand this: 2 identical commands one after the other
> yield different results?

Sorry, there was a "bash" started in between there.

>> This time over, the directory is "fóo" (in latin-1), and that looks like
>> Emacs is trying to find the utf-8 version of the file name.
>
> If that's the case, then we lack ENCODE_FILE (or more generally don't
> encode a file name) somewhere.

After instrumenting bytecomp (i.e., adding a bunch of messages), I see
what function is actually failing.  With this in byte-compile-file:

                  (message "foo2: %S" (prin1-to-string tempfile))
                  (unless (= temp-modes desired-modes)
                    (set-file-modes tempfile desired-modes 'nofollow))
                  (message "foo1: %S" (prin1-to-string tempfile))

I get this output:

make[1]: Entering directory '/home/larsi/src/emacs/f�o/test'
  ELC      lisp/eshell/eshell-tests.elc
foo2: "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"
>>Error occurred processing lisp/eshell/eshell-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/f\303\263o/test/lisp/eshell/eshell-tests.elcnjDFYY"))
make[1]: *** [Makefile:165: lisp/eshell/eshell-tests.elc] Error 1

So it's created a tempfile, tagged with the correct charset (I had no
idea that that's how it worked), but decoded, and then set-file-modes
interprets that as an UTF-8 file name.

So...  it's a bug in set-file-modes?  Hm, nope, write-region has the
same problem.

That weird file name (decoded and tagged with a charset text parameter)
comes from make-temp-file -- everything seems to be OK before that.
target-file is:

foo: "\"/home/larsi/src/emacs/f\\363o/test/lisp/eshell/eshell-tests.elc\""

which seems to be correct, but

                       (tempfile
                        (make-temp-file (expand-file-name target-file)))

is

"#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"

and then things fail.  Which makes me wonder why building Emacs at all
works if it's such a fundamental problem...  Just to check whether my
system is switching the LANG back to utf-8:

          (message "foo: %S" (getenv "LC_ALL"))

in byte-compile-file says

foo: "sv_SE.ISO-8859-1"

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: [hidden email],  [hidden email]
> Date: Fri, 11 Sep 2020 13:27:28 +0200
>
> make[1]: Entering directory '/home/larsi/src/emacs/f�o/test'
>   ELC      lisp/eshell/eshell-tests.elc
> foo2: "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"
> >>Error occurred processing lisp/eshell/eshell-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/f\303\263o/test/lisp/eshell/eshell-tests.elcnjDFYY"))
> make[1]: *** [Makefile:165: lisp/eshell/eshell-tests.elc] Error 1
>
> So it's created a tempfile, tagged with the correct charset (I had no
> idea that that's how it worked), but decoded, and then set-file-modes
> interprets that as an UTF-8 file name.
>
> So...  it's a bug in set-file-modes?  Hm, nope, write-region has the
> same problem.

There be dragons ;-)

The problematic aspect of debugging these problems is that what you
see is not always what's there, due to display and decoding/encoding
operations by both Emacs and the display software you have on your
system (which drives the terminal).

In particular, strings inside Emacs are always in UTF-8-compatible
encoding, so the fact you get UTF-8 in *Messages* doesn't prove
anything.  What we need is to find 2 types of possible problems:

  . raw bytes from Latin-1 encoding inside Emacs buffers or strings
    that are supposed to be decoded
  . UTF-8 encoded (instead of Latin-1 encoded) characters passed to
    libc functions

So if you found that the problem reveals itself in set-file-modes,
let's see what happens there.  The relevant code is this:

  char *fname = SSDATA (ENCODE_FILE (absname));
  mode_t imode = XFIXNUM (mode) & 07777;
  if (fchmodat (AT_FDCWD, fname, imode, nofollow) != 0)
    report_file_error ("Doing chmod", absname);

Please either run this under GDB, or add printf's, to show the byte
sequences of 'absname' and of 'fname'.  The former should be in UTF-8
(so you should see 0xC3 and 0xB3 for the ó character), the latter
should be in Latin-1 (so you should see 0xF3 for the same letter).
This should give us some hints wrt where to look for the cause of the
problem.

> That weird file name (decoded and tagged with a charset text parameter)
> comes from make-temp-file -- everything seems to be OK before that.
> target-file is:
>
> foo: "\"/home/larsi/src/emacs/f\\363o/test/lisp/eshell/eshell-tests.elc\""
>
> which seems to be correct,

Where does the "foo:" printout comes from?  I wouldn't expect to see
Latin-1 encoded strings inside Emacs, not normally anyway.

> but
>
>       (tempfile
> (make-temp-file (expand-file-name target-file)))
>
> is
>
> "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"

I see nothing wrong here: this is how decoding works in Emacs.  And
again, how did you produce this string?  As I explained above, the
details of how you display these strings matter in this case.



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Lars Ingebrigtsen
Eli Zaretskii <[hidden email]> writes:

> So if you found that the problem reveals itself in set-file-modes,
> let's see what happens there.  The relevant code is this:

Yeah, I don't think that function is the problem in itself, but I don't
know where the problem originates either.

>> foo: "\"/home/larsi/src/emacs/f\\363o/test/lisp/eshell/eshell-tests.elc\""
>>
>> which seems to be correct,
>
> Where does the "foo:" printout comes from?  I wouldn't expect to see
> Latin-1 encoded strings inside Emacs, not normally anyway.

I just added a bunch of

          (message "foo: %S" variable)

here and there in byte-compile-file to watch how the passed-in string is
transformed.

>>       (tempfile
>> (make-temp-file (expand-file-name target-file)))
>>
>> is
>>
>> "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"
>
> I see nothing wrong here: this is how decoding works in Emacs.  And
> again, how did you produce this string?  As I explained above, the
> details of how you display these strings matter in this case.

Same way as above.

The file name is on the "f\\363o/test" form until make-temp-name, and
then it turns into a different string with a text property.  But I don't
know how much this is an artefact of how Emacs prints these things and
how much it's actually, er...  actual.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Lars Ingebrigtsen
In reply to this post by Eli Zaretskii
Another confusing data point.  If I say "make" in the test directory, I
get:

foo 1: "\"/home/larsi/src/emacs/f\\363o/test/lisp/eshell/eshell-tests.elc\""
foo 2: "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcGvbK3T\" 0 65 (charset iso-8859-1))"

If I just say "make" in the main directory, I get this:

foo 1: "\"/home/larsi/src/emacs/f�o/lisp/dos-w32.elc\""
foo 2: "\"/home/larsi/src/emacs/fóo/lisp/dos-w32.elcXgukAl\""

Or, if that doesn't survive emailing, here's an umage:



Note -- no text properties, and not represented as "f\363o".

*scratches head*

So is this a problem with how ert calls the byte compiler after all?

This is with

diff --git a/lisp/emacs-lisp/bytecomp.el b/lisp/emacs-lisp/bytecomp.el
index 966990bac9..07448033ac 100644
--- a/lisp/emacs-lisp/bytecomp.el
+++ b/lisp/emacs-lisp/bytecomp.el
@@ -1990,6 +1990,7 @@ byte-compile-file
  (with-current-buffer output-buffer
   (goto-char (point-max))
   (insert "\n") ; aaah, unix.
+          (message "foo 1: %S" (prin1-to-string (expand-file-name target-file)))
   (if (file-writable-p target-file)
       ;; We must disable any code conversion here.
       (progn
@@ -2007,6 +2008,7 @@ byte-compile-file
  (cons (lambda () (ignore-errors
    (delete-file tempfile)))
       kill-emacs-hook)))
+  (message "foo 2: %S" (prin1-to-string tempfile))
   (unless (= temp-modes desired-modes)
     (set-file-modes tempfile desired-modes 'nofollow))
   (write-region (point-min) (point-max) tempfile nil 1)


--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no


attachment0 (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Eli Zaretskii
In reply to this post by Lars Ingebrigtsen
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: [hidden email],  [hidden email]
> Date: Fri, 11 Sep 2020 14:33:08 +0200
>
> The file name is on the "f\\363o/test" form until make-temp-name

That shouldn't happen.  It probably means we lack a DECODE_FILE
somewhere.  File names inside Emacs should always be decoded into
UTF-8.

> and
> then it turns into a different string with a text property.  But I don't
> know how much this is an artefact of how Emacs prints these things and
> how much it's actually, er...  actual.

The only way to know is to add printf's or look in GDB.



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Eli Zaretskii
In reply to this post by Lars Ingebrigtsen
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: [hidden email],  [hidden email]
> Date: Fri, 11 Sep 2020 14:39:07 +0200
>
> So is this a problem with how ert calls the byte compiler after all?

I don't think so, but I'm not sure.  It could be some shenanigans of
expand-file-name, for example: it has its own ideas for when to
produce a unibyte string and when a multibyte string.

Again, the fact that "foo 1" displays a unibyte undecoded file name
sounds wrong to me.  Is target-file also a unibyte Latin-1 string?



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Lars Ingebrigtsen
In reply to this post by Eli Zaretskii
I'm just poking around to see what's different between the way the files
are compiled in the test directory and the lisp directory, because they
should either both fail or not.

So here's how "make" i test does it:

EMACSLOADPATH= LC_ALL=C EMACS_TEST_DIRECTORY=/home/larsi/src/emacs/f�o/test  "../src/emacs" --module-assertions --no-init-file --no-site-file --no-site-lisp -L ":."  --batch -f batch-byte-compile lisp/eshell/eshell-tests.el

Here's how "make" in Lisp does it:

EMACSLOADPATH= '../src/emacs' -batch --no-site-file --no-site-lisp --eval '(setq load-prefer-newer t)'  -f batch-byte-compile emacs-lisp/bytecomp.el

And, indeed, if I remove "LC_ALL=C" from the line, then this compiles
successfully.

*phew*

Hm...  in fact, everything compiles successfully without LC_ALL?

However, when the tests run (in the latin-1 environment) 11 tests fail:

SUMMARY OF TEST RESULTS
-----------------------
Files examined: 305
Ran 4200 tests, 4097 results as expected, 29 unexpected, 74 skipped
1 files did not contain any tests:
  src/emacs-module-tests.log
11 files contained unexpected results:
  src/regex-emacs-tests.log
  lisp/vc/vc-bzr-tests.log
  lisp/vc/diff-mode-tests.log
  lisp/time-stamp-tests.log
  lisp/net/shr-tests.log
  lisp/gnus/mml-sec-tests.log
  lisp/epg-tests.log
  lisp/emacs-lisp/package-tests.log
  lisp/emacs-lisp/faceup-tests/faceup-test-files.log
  lisp/cedet/semantic-utest-ia.log
  lib-src/emacsclient-tests.log

As a comparison, removing the LC_ALL in an utf-8 environment (with a
pure-ascii path) gives me:

SUMMARY OF TEST RESULTS
-----------------------
Files examined: 305
Ran 4231 tests, 4150 results as expected, 6 unexpected, 75 skipped
6 files contained unexpected results:
  src/emacs-module-tests.log
  src/callint-tests.log
  lisp/vc/vc-bzr-tests.log
  lisp/subr-tests.log
  lisp/files-tests.log
  lisp/emacs-lisp/gv-tests.log

The bzr test fails because of the brz/bzr thing, but the LC_ALL is
apparently needed for the other five things.

So: In conclusion, I this Glenn's patch needs more work before
applying.  :-)  But at least we now knows that it breaks, and why (well,
for some of it).

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Lars Ingebrigtsen
Lars Ingebrigtsen <[hidden email]> writes:

> And, indeed, if I remove "LC_ALL=C" from the line, then this compiles
> successfully.

Oh, wow.  Apparently nobody is using non-ASCII in their Emacs paths?  I
just did a "mv trunk góo" on my laptop (UTF-8 environment), nothing
altered from out-of-the-box on Debian bullseye, and make check:

>>Error occurred processing lisp/emacs-lisp/regexp-opt-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/g\303\203\302\263o/test/lisp/emacs-lisp/regexp-opt-tests.elc15Rc5M"))
make[3]: *** [Makefile:165: lisp/emacs-lisp/regexp-opt-tests.elc] Error 1

for all the files.

So the LC_ALL=C thing in the compilation phase is just...  wrong?

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: [hidden email],  [hidden email]
> Date: Fri, 11 Sep 2020 16:27:30 +0200
>
> >>Error occurred processing lisp/emacs-lisp/regexp-opt-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/g\303\203\302\263o/test/lisp/emacs-lisp/regexp-opt-tests.elc15Rc5M"))
> make[3]: *** [Makefile:165: lisp/emacs-lisp/regexp-opt-tests.elc] Error 1
>
> for all the files.
>
> So the LC_ALL=C thing in the compilation phase is just...  wrong?

It's probably not TRT when the directory is non-ASCII.  But note that
you can say

   make check TEST_LOCALE=<whatever>

Does it help to use the locale you have set?

"git log -L" indicates that the default setting of TEST_LOCALE=C was
introduced in commit 4874f0b.  It would be interesting to see what the
tests mentioned in the log message of that commit yield if the locale
is not C.



12