Quantcast

Understanding how to specify UTF-8

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Understanding how to specify UTF-8

Will Parsons-3
I want to always use Unicode/UTF-8 unless otherwise specified.  I've noticed
that I've attempted to do this in my .emacs file in two separate ways on two
separate platforms:

1)  (setq-default buffer-file-coding-system 'utf-8-unix)

2)  (set-language-environment "UTF-8")

Both seem to work, but I'm wondering if there are subtle differences between
the two that I should be aware of.

--
Will
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Understanding how to specify UTF-8

Eli Zaretskii
> From: Will Parsons <[hidden email]>
> Date: 7 Apr 2017 23:43:55 GMT
>
> I want to always use Unicode/UTF-8 unless otherwise specified.

This doesn't tell what exactly do you want to happen.  The above
basically says "I want to use UTF-8 except when I don't", and doesn't
say a word about those "I don't" cases.  So please elaborate to make
the responses more accurate and correct.

For example, what about files you edit that were encoded in something
other than UTF-8 before? what about responding to email encoded in
something other than UTF-8? etc. etc.

> I've noticed that I've attempted to do this in my .emacs file in two
> separate ways on two separate platforms:
>
> 1)  (setq-default buffer-file-coding-system 'utf-8-unix)
>
> 2)  (set-language-environment "UTF-8")
>
> Both seem to work, but I'm wondering if there are subtle differences between
> the two that I should be aware of.

The second one is better, as it leaves Emacs more leeway where UTF-8
might not be appropriate.  But it's difficult to know what to tell
without the additional information.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Understanding how to specify UTF-8

B. T. Raven-4
In reply to this post by Will Parsons-3
Hi Will. I decided to respond because of this observation in the latest
posting:
"They used to say emacs and vi are religions; these days they are
starting to seem like latin."

On 4/7/2017 18:43, Will Parsons wrote:

> I want to always use Unicode/UTF-8 unless otherwise specified.  I've noticed
> that I've attempted to do this in my .emacs file in two separate ways on two
> separate platforms:
>
> 1)  (setq-default buffer-file-coding-system 'utf-8-unix)
>
> 2)  (set-language-environment "UTF-8")
>
> Both seem to work, but I'm wondering if there are subtle differences between
> the two that I should be aware of.


I can't help with any subtlties but can only recommend that you add this
cookie to the beginning of the buffer:

  ;; -*- coding: utf-8 -*-


I think it may be enough to save and reload the file into a new buffer
before adding exotic characters.
I also have these lines in my .emacs:

   (set-locale-environment   "utf-8")
         (set-language-environment               'utf-8)
         (set-default-coding-systems             'utf-8)
         (setq file-name-coding-system           'utf-8)
         (setq buffer-file-coding-system 'utf-8)
         (setq coding-system-for-write           'utf-8)
         (set-keyboard-coding-system             'utf-8)
         (set-terminal-coding-system          'utf-8)
         (prefer-coding-system                   'utf-8)
         ;; (set-buffer-process-coding-system 'utf-8 'utf-8)
         (modify-coding-system-alist 'process
"[cC][mM][dD][pP][rR][oO][xX][yY]" 'utf-8-dos)


The line commented out caused a problem but I don't remember what it
was. My os w64 vers. 7

Ed
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

(no subject)

Eli Zaretskii
> From: "B. T. Raven" <[hidden email]>
> Date: Thu, 13 Apr 2017 00:09:51 -0500
>
> I also have these lines in my .emacs:
>
>    (set-locale-environment   "utf-8")
>          (set-language-environment               'utf-8)
>          (set-default-coding-systems             'utf-8)
>          (setq file-name-coding-system           'utf-8)
>          (setq buffer-file-coding-system 'utf-8)
>          (setq coding-system-for-write           'utf-8)
>          (set-keyboard-coding-system             'utf-8)
>          (set-terminal-coding-system          'utf-8)
>          (prefer-coding-system                   'utf-8)
>          ;; (set-buffer-process-coding-system 'utf-8 'utf-8)
>          (modify-coding-system-alist 'process
> "[cC][mM][dD][pP][rR][oO][xX][yY]" 'utf-8-dos)
>
>
> The line commented out caused a problem but I don't remember what it
> was. My os w64 vers. 7

Some of the above are not recommended, and some are downright
dangerous (a.k.a. "shooting yourself in the foot").  Especially on
MS-Windows, UTF-8 should be used with extra care, because Windows only
partially supports this encoding in its APIs.

Specifically:

>    (set-locale-environment   "utf-8")

Don't do this on Windows, as Windows locales cannot use UTF-8 as their
encoding.

>          (set-language-environment               'utf-8)
>          (set-default-coding-systems             'utf-8)

Redundant as long as you have the prefer-coding-system call below.

>          (setq file-name-coding-system           'utf-8)

This is a no-op: Emacs on Windows ignores the value of this variable,
except if you are on Windows 9X, and file names cannot be encoded in
UTF-8 on Windows anyway.  Starting with Emacs 24.4, Emacs on Windows
uses Unicode APIs to deal with file names, so it supports non-ASCII
file names with all Unicode characters, and you don't need to do
anything to get this support.

>          (setq buffer-file-coding-system 'utf-8)

Dangerous.  Also redundant with prefer-coding-system below.

>          (setq coding-system-for-write           'utf-8)

This is dangerous: it will produce subtle issues with some commands,
notably when invoking subprocesses with non-ASCII strings in
command-line arguments.  This variable exists so that Lisp programs
could force specific encoding where appropriate, so leave it to that
and don't globally set it.

>          (set-keyboard-coding-system             'utf-8)
>          (set-terminal-coding-system          'utf-8)

These are wrong, and will get in the way when you work in -nw
sessions.  Emacs on MS-Windows doesn't fully support UTF-8 encoding of
keyboard input and console output, even if you tweak your system's
codepage to be 65001 (did you?).

>          (prefer-coding-system                   'utf-8)

This is the only setting that you should have if you want to use UTF-8
wherever possible and reasonable.

>          ;; (set-buffer-process-coding-system 'utf-8 'utf-8)
>          (modify-coding-system-alist 'process
> "[cC][mM][dD][pP][rR][oO][xX][yY]" 'utf-8-dos)

This is wrong: Emacs on MS-Windows doesn't support UTF-8 encoding of
program command-line arguments for subprocesses, and most Windows
programs will NOT talk UTF-8 in their standard streams.
prefer-coding-system should take care of those situations where this
is possible/actually happens; the rest should be left alone, or you
will have subtle problems with non-ASCII I/O vis-a-vis subprocesses.

HTH

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Understanding how to specify UTF-8

Eli Zaretskii
In reply to this post by B. T. Raven-4
[Resending with the correct Subject.]

> From: "B. T. Raven" <[hidden email]>
> Date: Thu, 13 Apr 2017 00:09:51 -0500
>
> I also have these lines in my .emacs:
>
>    (set-locale-environment   "utf-8")
>          (set-language-environment               'utf-8)
>          (set-default-coding-systems             'utf-8)
>          (setq file-name-coding-system           'utf-8)
>          (setq buffer-file-coding-system 'utf-8)
>          (setq coding-system-for-write           'utf-8)
>          (set-keyboard-coding-system             'utf-8)
>          (set-terminal-coding-system          'utf-8)
>          (prefer-coding-system                   'utf-8)
>          ;; (set-buffer-process-coding-system 'utf-8 'utf-8)
>          (modify-coding-system-alist 'process
> "[cC][mM][dD][pP][rR][oO][xX][yY]" 'utf-8-dos)
>
>
> The line commented out caused a problem but I don't remember what it
> was. My os w64 vers. 7

Some of the above are not recommended, and some are downright
dangerous (a.k.a. "shooting yourself in the foot").  Especially on
MS-Windows, UTF-8 should be used with extra care, because Windows only
partially supports this encoding in its APIs.

Specifically:

>    (set-locale-environment   "utf-8")

Don't do this on Windows, as Windows locales cannot use UTF-8 as their
encoding.

>          (set-language-environment               'utf-8)
>          (set-default-coding-systems             'utf-8)

Redundant as long as you have the prefer-coding-system call below.

>          (setq file-name-coding-system           'utf-8)

This is a no-op: Emacs on Windows ignores the value of this variable,
except if you are on Windows 9X, and file names cannot be encoded in
UTF-8 on Windows anyway.  Starting with Emacs 24.4, Emacs on Windows
uses Unicode APIs to deal with file names, so it supports non-ASCII
file names with all Unicode characters, and you don't need to do
anything to get this support.

>          (setq buffer-file-coding-system 'utf-8)

Dangerous.  Also redundant with prefer-coding-system below.

>          (setq coding-system-for-write           'utf-8)

This is dangerous: it will produce subtle issues with some commands,
notably when invoking subprocesses with non-ASCII strings in
command-line arguments.  This variable exists so that Lisp programs
could force specific encoding where appropriate, so leave it to that
and don't globally set it.

>          (set-keyboard-coding-system             'utf-8)
>          (set-terminal-coding-system          'utf-8)

These are wrong, and will get in the way when you work in -nw
sessions.  Emacs on MS-Windows doesn't fully support UTF-8 encoding of
keyboard input and console output, even if you tweak your system's
codepage to be 65001 (did you?).

>          (prefer-coding-system                   'utf-8)

This is the only setting that you should have if you want to use UTF-8
wherever possible and reasonable.

>          ;; (set-buffer-process-coding-system 'utf-8 'utf-8)
>          (modify-coding-system-alist 'process
> "[cC][mM][dD][pP][rR][oO][xX][yY]" 'utf-8-dos)

This is wrong: Emacs on MS-Windows doesn't support UTF-8 encoding of
program command-line arguments for subprocesses, and most Windows
programs will NOT talk UTF-8 in their standard streams.
prefer-coding-system should take care of those situations where this
is possible/actually happens; the rest should be left alone, or you
will have subtle problems with non-ASCII I/O vis-a-vis subprocesses.

HTH

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Understanding how to specify UTF-8

hector
@Eli: Thank you. Everything works better when you know what you're doing.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Understanding how to specify UTF-8

Will Parsons-3
In reply to this post by B. T. Raven-4
B. T. Raven wrote:
> Hi Will. I decided to respond because of this observation in the latest
> posting:
> "They used to say emacs and vi are religions; these days they are
> starting to seem like latin."

Not completely - "Emacs" should be spelt "Emax" first ;)
(And the plural, I suppose should be "emaces" rather than "emacsen".)

> On 4/7/2017 18:43, Will Parsons wrote:
>> I want to always use Unicode/UTF-8 unless otherwise specified.  I've noticed
>> that I've attempted to do this in my .emacs file in two separate ways on two
>> separate platforms:
>>
>> 1)  (setq-default buffer-file-coding-system 'utf-8-unix)
>>
>> 2)  (set-language-environment "UTF-8")
>>
>> Both seem to work, but I'm wondering if there are subtle differences between
>> the two that I should be aware of.
>
> I can't help with any subtlties but can only recommend that you add this
> cookie to the beginning of the buffer:
>
>   ;; -*- coding: utf-8 -*-

Yes, I've employed that too.  (Incidentally, I've been programming a lot in
Ruby for some years now, and I was surprised to find that after inserting a
copyright symbol (©) into one of my Ruby source files, that Emacs ruby-mode
inserted a line containing '# coding: utf-8' at the top when the file was
saved.)

> I think it may be enough to save and reload the file into a new buffer
> before adding exotic characters.
> I also have these lines in my .emacs:
>
>    (set-locale-environment   "utf-8")
>          (set-language-environment               'utf-8)
>          (set-default-coding-systems             'utf-8)
>          (setq file-name-coding-system           'utf-8)
>          (setq buffer-file-coding-system 'utf-8)
>          (setq coding-system-for-write           'utf-8)
>          (set-keyboard-coding-system             'utf-8)
>          (set-terminal-coding-system          'utf-8)
>          (prefer-coding-system                   'utf-8)
>          ;; (set-buffer-process-coding-system 'utf-8 'utf-8)
>          (modify-coding-system-alist 'process
> "[cC][mM][dD][pP][rR][oO][xX][yY]" 'utf-8-dos)
>
> The line commented out caused a problem but I don't remember what it
> was. My os w64 vers. 7

Wow.  I should think that should cover all possibilities.  I prefer to be a
bit more minimalist than that though...

Anyway, thanks - Vale Edwarde!

--
Will
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Understanding how to specify UTF-8

Jason Rumney-5
In reply to this post by Will Parsons-3
On Saturday, 8 April 2017 07:43:58 UTC+8, Will Parsons  wrote:

> I want to always use Unicode/UTF-8 unless otherwise specified.  I've noticed
> that I've attempted to do this in my .emacs file in two separate ways on two
> separate platforms:
>
> 1)  (setq-default buffer-file-coding-system 'utf-8-unix)
>
> 2)  (set-language-environment "UTF-8")
>
> Both seem to work, but I'm wondering if there are subtle differences between
> the two that I should be aware of.

The first only sets the default coding system for Files.

The second sets it for for everything, including system clipboard, file names, process I/O ...

On modern GNU/Linux, Mac or other Posix based OS's, you probably want everything in UTF-8, so the latter is correct.

On Windows, the system itself does not support UTF-8 fully, so the former is safer. For clipboard and file names on Windows, the latest versions of Emacs will use Unicode regardless of what you specify for the coding system, it is really only process I/O that is the problem - Cygwin and Mingw apps may support UTF-8 I/O, but native Windows apps (including the cmd.exe shell) can have severe difficulties with it.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Understanding how to specify UTF-8

Eli Zaretskii
> Date: Fri, 21 Apr 2017 02:28:45 -0700 (PDT)
> From: Jason Rumney <[hidden email]>
>
> On Windows, the system itself does not support UTF-8 fully, so the former is safer. For clipboard and file names on Windows, the latest versions of Emacs will use Unicode regardless of what you specify for the coding system, it is really only process I/O that is the problem - Cygwin and Mingw apps may support UTF-8 I/O, but native Windows apps (including the cmd.exe shell) can have severe difficulties with it.

MinGW apps are native apps, so they don't support UTF-8.  I think you
meant MSYS, not MinGW (and then only MSYS2 apps support UTF-8).

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Understanding how to specify UTF-8

Will Parsons-3
In reply to this post by Jason Rumney-5
Jason Rumney wrote:

> On Saturday, 8 April 2017 07:43:58 UTC+8, Will Parsons  wrote:
>> I want to always use Unicode/UTF-8 unless otherwise specified.  I've noticed
>> that I've attempted to do this in my .emacs file in two separate ways on two
>> separate platforms:
>>
>> 1)  (setq-default buffer-file-coding-system 'utf-8-unix)
>>
>> 2)  (set-language-environment "UTF-8")
>>
>> Both seem to work, but I'm wondering if there are subtle differences between
>> the two that I should be aware of.
>
> The first only sets the default coding system for Files.
>
> The second sets it for for everything, including system clipboard, file names, process I/O ...
>
> On modern GNU/Linux, Mac or other Posix based OS's, you probably want everything in UTF-8, so the latter is correct.
>
> On Windows, the system itself does not support UTF-8 fully, so the former is safer. For clipboard and file names on Windows, the latest versions of Emacs will use Unicode regardless of what you specify for the coding system, it is really only process I/O that is the problem - Cygwin and Mingw apps may support UTF-8 I/O, but native Windows apps (including the cmd.exe shell) can have severe difficulties with it.

Thank you for this detailed answer.  Interestingly enough, I have them
reversed in my Unix vs Windows configurations.

--
Will
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Understanding how to specify UTF-8

Stefan Monnier
In reply to this post by Will Parsons-3
> I want to always use Unicode/UTF-8 unless otherwise specified.

If your locale is using utf-8 (which it should nowadays in most cases
under GNU/Linux, especially if you "want to always use Unicode/UTF-8"),
then Emacs should already do that automatically.


        Stefan


Loading...