`write-region' writes different bytes than passed to it?

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

`write-region' writes different bytes than passed to it?

Philipp Stephani
Hi,

usually `write-region' uses the coding system bound to
`coding-system-for-write'. However, I've found a case where this
doesn't seem to be the case:

$ emacs -Q -batch -eval '(let ((coding-system-for-write (quote
utf-8-unix))) (write-region "\xC1\xB2" nil "/tmp/test.txt"))' && hd
/tmp/test.txt
00000000  f2                                                |.|
00000001

That is, instead of the byte sequence C1 B2 it writes the single byte
F2, which is an invalid UTF-8 sequence. Is that expected?

Thanks,
Philipp

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Philipp Stephani
Am Di., 11. Dez. 2018 um 13:30 Uhr schrieb Philipp Stephani
<[hidden email]>:

>
> Hi,
>
> usually `write-region' uses the coding system bound to
> `coding-system-for-write'. However, I've found a case where this
> doesn't seem to be the case:
>
> $ emacs -Q -batch -eval '(let ((coding-system-for-write (quote
> utf-8-unix))) (write-region "\xC1\xB2" nil "/tmp/test.txt"))' && hd
> /tmp/test.txt
> 00000000  f2                                                |.|
> 00000001
>
> That is, instead of the byte sequence C1 B2 it writes the single byte
> F2, which is an invalid UTF-8 sequence. Is that expected?

I've realized that I can use either string-to-multibyte or
string-as-multibyte to force writing the expected bytes. Still it
seems weird that when confronted with an invalid UTF-8 sequence
`write-region' occasionally writes a *different* invalid UTF-8
sequence.

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Eli Zaretskii
In reply to this post by Philipp Stephani
> From: Philipp Stephani <[hidden email]>
> Date: Tue, 11 Dec 2018 13:30:07 +0100
>
> usually `write-region' uses the coding system bound to
> `coding-system-for-write'. However, I've found a case where this
> doesn't seem to be the case:
>
> $ emacs -Q -batch -eval '(let ((coding-system-for-write (quote
> utf-8-unix))) (write-region "\xC1\xB2" nil "/tmp/test.txt"))' && hd
> /tmp/test.txt
> 00000000  f2                                                |.|
> 00000001
>
> That is, instead of the byte sequence C1 B2 it writes the single byte
> F2, which is an invalid UTF-8 sequence. Is that expected?

Yes, because "\xC1\xB2" just happens to be the internal multibyte
representation of a raw-byte F2.  Raw bytes are always converted to
their single-byte values on output, regardless of the encoding you
request.

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Eli Zaretskii
In reply to this post by Philipp Stephani
> From: Philipp Stephani <[hidden email]>
> Date: Tue, 11 Dec 2018 13:42:59 +0100
>
> Still it seems weird that when confronted with an invalid UTF-8
> sequence `write-region' occasionally writes a *different* invalid
> UTF-8 sequence.

The internal representation of characters is not UTF-8, it is a
superset of UTF-8.  So some sequences that are invalid UTF-8 are valid
for the internal representation.

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Stefan Monnier
In reply to this post by Eli Zaretskii
> Yes, because "\xC1\xB2" just happens to be the internal multibyte
> representation of a raw-byte F2.  Raw bytes are always converted to
> their single-byte values on output, regardless of the encoding you
> request.

Maybe we shouldn't encode unibyte strings (under the assumption
that a unibyte string is already encoded: it's a sequence of bytes
rather than a sequence of chars).


        Stefan


Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Eli Zaretskii
> From: Stefan Monnier <[hidden email]>
> Date: Tue, 11 Dec 2018 11:36:13 -0500
>
> > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > representation of a raw-byte F2.  Raw bytes are always converted to
> > their single-byte values on output, regardless of the encoding you
> > request.
>
> Maybe we shouldn't encode unibyte strings (under the assumption
> that a unibyte string is already encoded: it's a sequence of bytes
> rather than a sequence of chars).

I'm not sure that single use case is important enough to change
something that was working like that since Emacs 23.  Who knows how
many more important use cases this will break?

This whole area is crawling with heuristics, whose only justification
is that it does TRT in the vast majority of use cases.

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Stefan Monnier
> I'm not sure that single use case is important enough to change
> something that was working like that since Emacs 23.  Who knows how
> many more important use cases this will break?

Oh, indeed, especially since it sounds to me like the problem is in the
original code (if you don't want to change bytes, the use a `binary`
encoding rather than utf-8).

> This whole area is crawling with heuristics, whose only justification
> is that it does TRT in the vast majority of use cases.

Exactly: I think we should try and get rid of those heuristics
(progressively).  Actually, it's already what we've been doing since
Emacs-20, tho "lately" the progression in this respect has slowed
down I think.


        Stefan


Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Philipp Stephani
In reply to this post by Eli Zaretskii
Am Di., 11. Dez. 2018 um 16:52 Uhr schrieb Eli Zaretskii <[hidden email]>:

>
> > From: Philipp Stephani <[hidden email]>
> > Date: Tue, 11 Dec 2018 13:30:07 +0100
> >
> > usually `write-region' uses the coding system bound to
> > `coding-system-for-write'. However, I've found a case where this
> > doesn't seem to be the case:
> >
> > $ emacs -Q -batch -eval '(let ((coding-system-for-write (quote
> > utf-8-unix))) (write-region "\xC1\xB2" nil "/tmp/test.txt"))' && hd
> > /tmp/test.txt
> > 00000000  f2                                                |.|
> > 00000001
> >
> > That is, instead of the byte sequence C1 B2 it writes the single byte
> > F2, which is an invalid UTF-8 sequence. Is that expected?
>
> Yes, because "\xC1\xB2" just happens to be the internal multibyte
> representation of a raw-byte F2.  Raw bytes are always converted to
> their single-byte values on output, regardless of the encoding you
> request.
>

Is that documented somewhere?
Or, in other words, what are the semantics of

(let ((coding-system-for-write 'utf-8-unix)) (write-region STRING ...))

?
There are two easy cases:
1. STRING is a unibyte string containing only bytes within the ASCII range
2. STRING is a multibyte string containing only Unicode scalar values
In those cases the answer is simple: The form writes the UTF-8
representation of STRING.
However, the interesting cases are as follows:
3. STRING is a unibyte string with at least one byte outside the ASCII range
4. STRING is a multibyte string with at least one elements that is not
a Unicode scalar value
My example is an instance of (3). I admit I haven't read the entire
Emacs Lisp reference manual, but quite some parts of it, and I
couldn't find a description of the cases (3) and (4). Naively there
are a couple options:
- Signal an error. That would seem appropriate as such strings can't
be encoded as UTF-8. However, evidently Emacs doesn't do this.
- For case 3, write the bytes in STRING, ignoring the coding system. I
had expected this to happen, but apparently it isn't the case either.

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Philipp Stephani
In reply to this post by Stefan Monnier
Am Di., 11. Dez. 2018 um 17:50 Uhr schrieb Stefan Monnier
<[hidden email]>:

>
> > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > representation of a raw-byte F2.  Raw bytes are always converted to
> > their single-byte values on output, regardless of the encoding you
> > request.
>
> Maybe we shouldn't encode unibyte strings (under the assumption
> that a unibyte string is already encoded: it's a sequence of bytes
> rather than a sequence of chars).
>


That's what I'd expect (either this, or a signal).

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Philipp Stephani
In reply to this post by Eli Zaretskii
Am Di., 11. Dez. 2018 um 19:41 Uhr schrieb Eli Zaretskii <[hidden email]>:

>
> > From: Stefan Monnier <[hidden email]>
> > Date: Tue, 11 Dec 2018 11:36:13 -0500
> >
> > > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > > representation of a raw-byte F2.  Raw bytes are always converted to
> > > their single-byte values on output, regardless of the encoding you
> > > request.
> >
> > Maybe we shouldn't encode unibyte strings (under the assumption
> > that a unibyte string is already encoded: it's a sequence of bytes
> > rather than a sequence of chars).
>
> I'm not sure that single use case is important enough to change
> something that was working like that since Emacs 23.  Who knows how
> many more important use cases this will break?

It's important for correctness and for actually describing what "encoding" does.

>
> This whole area is crawling with heuristics, whose only justification
> is that it does TRT in the vast majority of use cases.
>

Why should this be the right thing, what use case should it cover? Do
we expect users to explicitly put the byte sequences for the
(undocumented) internal encoding into unibyte strings? Shouldn't we
rather expect that users want to write unibyte strings as is, in all
cases?

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Philipp Stephani
In reply to this post by Stefan Monnier
Am Di., 11. Dez. 2018 um 20:53 Uhr schrieb Stefan Monnier
<[hidden email]>:
>
> > I'm not sure that single use case is important enough to change
> > something that was working like that since Emacs 23.  Who knows how
> > many more important use cases this will break?
>
> Oh, indeed, especially since it sounds to me like the problem is in the
> original code (if you don't want to change bytes, the use a `binary`
> encoding rather than utf-8).

That wouldn't work with multibyte strings, right? Because they need to
be encoded.

>
> > This whole area is crawling with heuristics, whose only justification
> > is that it does TRT in the vast majority of use cases.
>
> Exactly: I think we should try and get rid of those heuristics
> (progressively).  Actually, it's already what we've been doing since
> Emacs-20, tho "lately" the progression in this respect has slowed
> down I think.
>

I'd definitely welcome any simplification in this area. There seems to
be a lot of incidental complexity and undocumented corner cases here.

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Eli Zaretskii
In reply to this post by Philipp Stephani
> From: Philipp Stephani <[hidden email]>
> Date: Sat, 22 Dec 2018 23:58:07 +0100
> Cc: help-gnu-emacs <[hidden email]>
>
> > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > representation of a raw-byte F2.  Raw bytes are always converted to
> > their single-byte values on output, regardless of the encoding you
> > request.
> >
>
> Is that documented somewhere?

Which part(s)?

> Or, in other words, what are the semantics of
>
> (let ((coding-system-for-write 'utf-8-unix)) (write-region STRING ...))
>
> ?
>
> There are two easy cases:
> 1. STRING is a unibyte string containing only bytes within the ASCII range
> 2. STRING is a multibyte string containing only Unicode scalar values
> In those cases the answer is simple: The form writes the UTF-8
> representation of STRING.
> However, the interesting cases are as follows:
> 3. STRING is a unibyte string with at least one byte outside the ASCII range
> 4. STRING is a multibyte string with at least one elements that is not
> a Unicode scalar value

You are actually asking what code conversion does in these cases, so
let's limit the discussion to that part.  write-region is not really
relevant here.

One technicality before I answer the question: there are no "Unicode
scalar values" in Emacs strings and buffers.  The internal
representation is a multibyte one, so any non-ASCII character, be it a
valid Unicode character or a raw byte, is always stored as a multibyte
sequence.  So let's please use a less confusing wording, like
"strictly valid UTF-8 sequence" or something to that effect.

> My example is an instance of (3). I admit I haven't read the entire
> Emacs Lisp reference manual, but quite some parts of it, and I
> couldn't find a description of the cases (3) and (4). Naively there
> are a couple options:
> - Signal an error. That would seem appropriate as such strings can't
> be encoded as UTF-8. However, evidently Emacs doesn't do this.
> - For case 3, write the bytes in STRING, ignoring the coding system. I
> had expected this to happen, but apparently it isn't the case either.

IMO, doing encoding on unibyte strings invokes undefined behavior,
since encoding is only defined for multibyte strings.  Admittedly, we
don't say that explicitly (we could if that's deemed important), but
the entire description in "Coding System Basics" makes no sense
without this assumption, and even hints on that indirectly:

     The coding system ‘raw-text’ is special in that it prevents character
  code conversion, and causes the buffer visited with this coding system
  to be a unibyte buffer.  For historical reasons, you can save both
  unibyte and multibyte text with this coding system.

The last sentence implicitly tells you that coding systems other than
raw-text (with the exception of no-conversion, described in the very
next paragraph) can only be meaningfully used when writing multibyte
text.

Since this is undefined behavior, Emacs can do anything that best
suits the relevant use cases.  What it actually does is convert raw
bytes from their internal two-byte representation to a single byte.
Emacs jumps through many hoops to avoid exposing the internal
multibyte representation of raw bytes outside of buffers and strings,
and this is one of those hoops.  That's because exposing that internal
representation is considered to be corruption of the original byte
stream, and is not generally useful.

Signaling an error in this situation is also not useful, because it
turns out many Lisp programs did this kind of thing in the past (Gnus
is a notable example), and undoubtedly quite a few still do.

Emacs handles this case like it does because many years of bitter
experience have taught us that this suits best the use cases we want
to support.  In particular, signaling errors when encountering invalid
UTF-8 sequences is a bad idea in a text-editing application, where
users expect an arbitrary byte stream to pass unscathed from input to
output.  This is why Emacs is decades ahead of other similar systems,
such as Guile, which still throw exceptions in such cases (and claim
that they are "correct").

> > I'm not sure that single use case is important enough to change
> > something that was working like that since Emacs 23.  Who knows how
> > many more important use cases this will break?
>
> It's important for correctness and for actually describing what "encoding" does.

So does labeling this as undefined behavior, which is what it is.  We
don't really need to describe undefined behavior in detail, because
Lisp programs shouldn't do that.

> Do we expect users to explicitly put the byte sequences for the
> (undocumented) internal encoding into unibyte strings? Shouldn't we
> rather expect that users want to write unibyte strings as is, in all
> cases?

To avoid the undefined behavior, a Lisp program should never try to
encode a unibyte string with anything other than no-conversion or
raw-text (the latter also allows the application to convert EOL
format, if that is desired).  IOW, you should have used either
raw-text-unix or no-conversion in your example, not utf-8.

> > Oh, indeed, especially since it sounds to me like the problem is in the
> > original code (if you don't want to change bytes, the use a `binary`
> > encoding rather than utf-8).
>
> That wouldn't work with multibyte strings, right? Because they need to
> be encoded.

You can detect when a string is a unibyte string with
multibyte-string-p, if your application needs to handle both unibyte
and multibyte strings.  For unibyte strings, use only raw-text or
no-conversion.

> > Exactly: I think we should try and get rid of those heuristics
> > (progressively).  Actually, it's already what we've been doing since
> > Emacs-20, tho "lately" the progression in this respect has slowed
> > down I think.
>
> I'd definitely welcome any simplification in this area. There seems to
> be a lot of incidental complexity and undocumented corner cases here.

AFAIK, all of that heuristics are in the undefined behavior
department.  Lisp programs are well advised to stay away from that.
If Lisp programs do stay away, they will never need to deal with the
complexity and the undocumented corner cases.

We keep the current behavior for backward compatibility, and for this
reason I would suggest to avoid changes in this area unless we have a
very good reason for a change.  What was the reason you needed to
write something like the original snippet?

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Stefan Monnier
In reply to this post by Philipp Stephani
> There are two easy cases:
> 1. STRING is a unibyte string containing only bytes within the ASCII range
> 2. STRING is a multibyte string containing only Unicode scalar values
> In those cases the answer is simple: The form writes the UTF-8
> representation of STRING.

Not sure what you mean by "unicode scalar values", but a multibyte
string is a sequence of chars, i.e. a sequence of char codes (integers)
And utf-8 is a way to encode a sequence of integer char codes into
a sequence of bytes.

So your sample code will pretty much always write the utf-8
representation of the multibyte string.

[ The only exception is when the multibyte string contains chars in the
  eight-bit charset, because those are supposed to stand for raw bytes.
  This is exception is used to make sure that if you read a file using
  the utf-8 coding-system and the file's content is not valid utf-8,
  writing the buffer will still generate the exact same byte sequence.  ]

> However, the interesting cases are as follows:
> 3. STRING is a unibyte string with at least one byte outside the ASCII range

I don't think this case is clearly documented, indeed.

I believe what happens currently is that Emacs looks at the byte
sequence in the unibyte string as if it was the internal representation
of a multibyte string.  Changing behavior (e.g. by simply outputting the
bytes unchanged like I suggested) will likely affect some code out there
somewhere.  I think it'd be a good change, tho, because I think that any
code thus affected is likely buggy and needs to be fixed anyway (and
actually that change might be the fix the code needs).

What makes this question a bit more tricky is that when a string is all
ASCII, Emacs tends to choose rather arbitrarily between unibyte
and multibyte.  But if we decide that coding-system doesn't affect
unibyte strings, then we get into trouble with

    (let ((coding-system-for-write 'ebcdic-int)) (write-region STRING ...))

since for a purely ASCII string, we still need to do a conversion,
so we'd need to be more careful about the distinction between unibyte and
multibyte ASCII strings.

Maybe we should just drop support for coding systems that aren't
supersets of ASCII and be done with it, but I'm not sure we're ready to
do that.


        Stefan


Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Philipp Stephani
In reply to this post by Eli Zaretskii
Am So., 23. Dez. 2018 um 16:21 Uhr schrieb Eli Zaretskii <[hidden email]>:

>
> > From: Philipp Stephani <[hidden email]>
> > Date: Sat, 22 Dec 2018 23:58:07 +0100
> > Cc: help-gnu-emacs <[hidden email]>
> >
> > > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > > representation of a raw-byte F2.  Raw bytes are always converted to
> > > their single-byte values on output, regardless of the encoding you
> > > request.
> > >
> >
> > Is that documented somewhere?
>
> Which part(s)?

All of it? ;)
Basically, "what is the behavior of write-region".

>
> > Or, in other words, what are the semantics of
> >
> > (let ((coding-system-for-write 'utf-8-unix)) (write-region STRING ...))
> >
> > ?
> >
> > There are two easy cases:
> > 1. STRING is a unibyte string containing only bytes within the ASCII range
> > 2. STRING is a multibyte string containing only Unicode scalar values
> > In those cases the answer is simple: The form writes the UTF-8
> > representation of STRING.
> > However, the interesting cases are as follows:
> > 3. STRING is a unibyte string with at least one byte outside the ASCII range
> > 4. STRING is a multibyte string with at least one elements that is not
> > a Unicode scalar value
>
> You are actually asking what code conversion does in these cases, so
> let's limit the discussion to that part.  write-region is not really
> relevant here.
>
> One technicality before I answer the question: there are no "Unicode
> scalar values" in Emacs strings and buffers.  The internal
> representation is a multibyte one, so any non-ASCII character, be it a
> valid Unicode character or a raw byte, is always stored as a multibyte
> sequence.  So let's please use a less confusing wording, like
> "strictly valid UTF-8 sequence" or something to that effect.

I don't think we should change the terminology. Emacs multibyte
strings are sequences of integers (in most cases, scalar values), not
UTF-8 strings. They are internally represented as byte arrays, but
that's a different story.

>
> > My example is an instance of (3). I admit I haven't read the entire
> > Emacs Lisp reference manual, but quite some parts of it, and I
> > couldn't find a description of the cases (3) and (4). Naively there
> > are a couple options:
> > - Signal an error. That would seem appropriate as such strings can't
> > be encoded as UTF-8. However, evidently Emacs doesn't do this.
> > - For case 3, write the bytes in STRING, ignoring the coding system. I
> > had expected this to happen, but apparently it isn't the case either.
>
> IMO, doing encoding on unibyte strings invokes undefined behavior,
> since encoding is only defined for multibyte strings.

That is very unfortunate. Is there any hope we can get out of that situation?

> Admittedly, we
> don't say that explicitly (we could if that's deemed important), but
> the entire description in "Coding System Basics" makes no sense
> without this assumption, and even hints on that indirectly:
>
>      The coding system ‘raw-text’ is special in that it prevents character
>   code conversion, and causes the buffer visited with this coding system
>   to be a unibyte buffer.  For historical reasons, you can save both
>   unibyte and multibyte text with this coding system.
>
> The last sentence implicitly tells you that coding systems other than
> raw-text (with the exception of no-conversion, described in the very
> next paragraph) can only be meaningfully used when writing multibyte
> text.

That's true, but very subtle. You first have to read the description
of a certain encoding to figure out how other encodings behave.

>
> Since this is undefined behavior, Emacs can do anything that best
> suits the relevant use cases.  What it actually does is convert raw
> bytes from their internal two-byte representation to a single byte.
> Emacs jumps through many hoops to avoid exposing the internal
> multibyte representation of raw bytes outside of buffers and strings,
> and this is one of those hoops.  That's because exposing that internal
> representation is considered to be corruption of the original byte
> stream, and is not generally useful.

But in this question there is never any internal representation, just
a byte array that happens to match the internal representation of
something else.

>
> Signaling an error in this situation is also not useful, because it
> turns out many Lisp programs did this kind of thing in the past (Gnus
> is a notable example), and undoubtedly quite a few still do.

Well, if the behavior is unspecified, then signaling an error would
absolutely be a legal (and even expected) behavior.

>
> Emacs handles this case like it does because many years of bitter
> experience have taught us that this suits best the use cases we want
> to support.  In particular, signaling errors when encountering invalid
> UTF-8 sequences is a bad idea in a text-editing application, where
> users expect an arbitrary byte stream to pass unscathed from input to
> output.  This is why Emacs is decades ahead of other similar systems,
> such as Guile, which still throw exceptions in such cases (and claim
> that they are "correct").

I'm not saying that Emacs should necessary start signaling errors when
visiting files with invalid UTF-8 sequences. That it degrades
gracefully in this case is very welcome and user-friendly.
But visiting a file can't result in a call to write-region with a
unibyte string, right?

>
> > > I'm not sure that single use case is important enough to change
> > > something that was working like that since Emacs 23.  Who knows how
> > > many more important use cases this will break?
> >
> > It's important for correctness and for actually describing what "encoding" does.
>
> So does labeling this as undefined behavior, which is what it is.  We
> don't really need to describe undefined behavior in detail, because
> Lisp programs shouldn't do that.

Rather than describing it in detail, it should be removed. Unspecified
behavior makes a programming system hard to use and reason about.

>
> > Do we expect users to explicitly put the byte sequences for the
> > (undocumented) internal encoding into unibyte strings? Shouldn't we
> > rather expect that users want to write unibyte strings as is, in all
> > cases?
>
> To avoid the undefined behavior, a Lisp program should never try to
> encode a unibyte string with anything other than no-conversion or
> raw-text (the latter also allows the application to convert EOL
> format, if that is desired).  IOW, you should have used either
> raw-text-unix or no-conversion in your example, not utf-8.

If Lisp code shouldn't try that, then the encoding functions should
signal an error on such cases.

>
> > > Oh, indeed, especially since it sounds to me like the problem is in the
> > > original code (if you don't want to change bytes, the use a `binary`
> > > encoding rather than utf-8).
> >
> > That wouldn't work with multibyte strings, right? Because they need to
> > be encoded.
>
> You can detect when a string is a unibyte string with
> multibyte-string-p, if your application needs to handle both unibyte
> and multibyte strings.  For unibyte strings, use only raw-text or
> no-conversion.

I get that, but this is too subtle and nontrivial.

>
> > > Exactly: I think we should try and get rid of those heuristics
> > > (progressively).  Actually, it's already what we've been doing since
> > > Emacs-20, tho "lately" the progression in this respect has slowed
> > > down I think.
> >
> > I'd definitely welcome any simplification in this area. There seems to
> > be a lot of incidental complexity and undocumented corner cases here.
>
> AFAIK, all of that heuristics are in the undefined behavior
> department.  Lisp programs are well advised to stay away from that.
> If Lisp programs do stay away, they will never need to deal with the
> complexity and the undocumented corner cases.

You can't tell programmers to stay away from something. Either it
should work as documented or signal an error. Silently doing the wrong
thing is the worst choice.

>
> We keep the current behavior for backward compatibility, and for this
> reason I would suggest to avoid changes in this area unless we have a
> very good reason for a change.  What was the reason you needed to
> write something like the original snippet?
>

I'm writing a function to write an arbitrary string to a file. This
should be trivial, but as you can see, it isn't.

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Philipp Stephani
In reply to this post by Stefan Monnier
Am Mo., 24. Dez. 2018 um 05:28 Uhr schrieb Stefan Monnier
<[hidden email]>:
>
> > There are two easy cases:
> > 1. STRING is a unibyte string containing only bytes within the ASCII range
> > 2. STRING is a multibyte string containing only Unicode scalar values
> > In those cases the answer is simple: The form writes the UTF-8
> > representation of STRING.
>
> Not sure what you mean by "unicode scalar values"

What the Unicode standard says :)

> but a multibyte
> string is a sequence of chars, i.e. a sequence of char codes (integers)
> And utf-8 is a way to encode a sequence of integer char codes into
> a sequence of bytes.

"Character" is an underspecified term, therefore I generally try to avoid it.
To recap: An Emacs Lisp multibyte string is a sequence of integers of
a certain range. The range is a superset of the set of Unicode scalar
values.

>
> So your sample code will pretty much always write the utf-8
> representation of the multibyte string.
>
> [ The only exception is when the multibyte string contains chars in the
>   eight-bit charset, because those are supposed to stand for raw bytes.
>   This is exception is used to make sure that if you read a file using
>   the utf-8 coding-system and the file's content is not valid utf-8,
>   writing the buffer will still generate the exact same byte sequence.  ]
>
> > However, the interesting cases are as follows:
> > 3. STRING is a unibyte string with at least one byte outside the ASCII range
>
> I don't think this case is clearly documented, indeed.
>
> I believe what happens currently is that Emacs looks at the byte
> sequence in the unibyte string as if it was the internal representation
> of a multibyte string.  Changing behavior (e.g. by simply outputting the
> bytes unchanged like I suggested) will likely affect some code out there
> somewhere.  I think it'd be a good change, tho, because I think that any
> code thus affected is likely buggy and needs to be fixed anyway (and
> actually that change might be the fix the code needs).
>
> What makes this question a bit more tricky is that when a string is all
> ASCII, Emacs tends to choose rather arbitrarily between unibyte
> and multibyte.  But if we decide that coding-system doesn't affect
> unibyte strings, then we get into trouble with
>
>     (let ((coding-system-for-write 'ebcdic-int)) (write-region STRING ...))
>
> since for a purely ASCII string, we still need to do a conversion,
> so we'd need to be more careful about the distinction between unibyte and
> multibyte ASCII strings.
>
> Maybe we should just drop support for coding systems that aren't
> supersets of ASCII and be done with it, but I'm not sure we're ready to
> do that.
>


That might be one option. Others might be:
1. Signal an error whenever Emacs attempts to encode a unibyte string
and the encoding isn't "raw-text" or "no-conversion"
2. Like (1), but only signal an error if the encoding isn't ASCII-compatible

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Eli Zaretskii
In reply to this post by Philipp Stephani
> From: Philipp Stephani <[hidden email]>
> Date: Sun, 10 Feb 2019 20:06:57 +0100
> Cc: help-gnu-emacs <[hidden email]>
>
> > > > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > > > representation of a raw-byte F2.  Raw bytes are always converted to
> > > > their single-byte values on output, regardless of the encoding you
> > > > request.
> > > >
> > >
> > > Is that documented somewhere?
> >
> > Which part(s)?
>
> All of it? ;)
> Basically, "what is the behavior of write-region".

Like I said, write-region is not relevant here, encoding is.

> > One technicality before I answer the question: there are no "Unicode
> > scalar values" in Emacs strings and buffers.  The internal
> > representation is a multibyte one, so any non-ASCII character, be it a
> > valid Unicode character or a raw byte, is always stored as a multibyte
> > sequence.  So let's please use a less confusing wording, like
> > "strictly valid UTF-8 sequence" or something to that effect.
>
> I don't think we should change the terminology. Emacs multibyte
> strings are sequences of integers

No, they are not.  They are sequences of bytes (as evidenced by the
"multibyte" part) which represent sequences of Unicode codepoints.
The latter are scalar integers.  But these scalars are not explicitly
present in the multibyte representation.

> > IMO, doing encoding on unibyte strings invokes undefined behavior,
> > since encoding is only defined for multibyte strings.
>
> That is very unfortunate. Is there any hope we can get out of that situation?

Unlikely.

> But in this question there is never any internal representation

Yes, there is: you have succeeded to use one of the few loopholes to
create such a byte sequence.

> > Signaling an error in this situation is also not useful, because it
> > turns out many Lisp programs did this kind of thing in the past (Gnus
> > is a notable example), and undoubtedly quite a few still do.
>
> Well, if the behavior is unspecified, then signaling an error would
> absolutely be a legal (and even expected) behavior.

It's possible, but not useful, so we don't do that.

> I'm not saying that Emacs should necessary start signaling errors when
> visiting files with invalid UTF-8 sequences. That it degrades
> gracefully in this case is very welcome and user-friendly.
> But visiting a file can't result in a call to write-region with a
> unibyte string, right?

Why not?  Of course it can: imagine that you modify some part of the
file's text that doesn't include raw undecoded bytes, then write the
result to a file.  You will expect that portions of text you didn't
modify remain intact, right?

> > > It's important for correctness and for actually describing what "encoding" does.
> >
> > So does labeling this as undefined behavior, which is what it is.  We
> > don't really need to describe undefined behavior in detail, because
> > Lisp programs shouldn't do that.
>
> Rather than describing it in detail, it should be removed. Unspecified
> behavior makes a programming system hard to use and reason about.

It cannot be removed.  Raw bytes that cannot be decoded are a fact of
life, removing them will make Emacs a lame duck.

> > To avoid the undefined behavior, a Lisp program should never try to
> > encode a unibyte string with anything other than no-conversion or
> > raw-text (the latter also allows the application to convert EOL
> > format, if that is desired).  IOW, you should have used either
> > raw-text-unix or no-conversion in your example, not utf-8.
>
> If Lisp code shouldn't try that, then the encoding functions should
> signal an error on such cases.

Signaling an error is not useful, so Emacs should not do that.

> > You can detect when a string is a unibyte string with
> > multibyte-string-p, if your application needs to handle both unibyte
> > and multibyte strings.  For unibyte strings, use only raw-text or
> > no-conversion.
>
> I get that, but this is too subtle and nontrivial.

Then try not to write code that could bump into these subtleties.  You
shouldn't need that.

> > AFAIK, all of that heuristics are in the undefined behavior
> > department.  Lisp programs are well advised to stay away from that.
> > If Lisp programs do stay away, they will never need to deal with the
> > complexity and the undocumented corner cases.
>
> You can't tell programmers to stay away from something.

No, but I can advise them.

> Either it should work as documented or signal an error. Silently
> doing the wrong thing is the worst choice.

It doesn't do the wrong thing, it does the right thing: it stays out
of the hair of programmers who might need to write such stuff
(assuming they know what they are doing).

> > What was the reason you needed to write something like the
> > original snippet?
>
> I'm writing a function to write an arbitrary string to a file. This
> should be trivial, but as you can see, it isn't.

It wasn't a string, it was a sequence of bytes that cannot be
interpreted as a text string.

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Eli Zaretskii
In reply to this post by Philipp Stephani
> From: Philipp Stephani <[hidden email]>
> Date: Sun, 10 Feb 2019 20:15:57 +0100
> Cc: help-gnu-emacs <[hidden email]>
>
> > > There are two easy cases:
> > > 1. STRING is a unibyte string containing only bytes within the ASCII range
> > > 2. STRING is a multibyte string containing only Unicode scalar values
> > > In those cases the answer is simple: The form writes the UTF-8
> > > representation of STRING.
> >
> > Not sure what you mean by "unicode scalar values"
>
> What the Unicode standard says :)

A multibyte Unicode string doesn't contain Unicode scalar values, it
contains their UTF-8 encoding.

> To recap: An Emacs Lisp multibyte string is a sequence of integers of
> a certain range.

No, it's a sequence of bytes that can be interpreted as representing a
sequence of integers.

> > Maybe we should just drop support for coding systems that aren't
> > supersets of ASCII and be done with it, but I'm not sure we're ready to
> > do that.
>
> That might be one option.

Just a month or two ago someone asked about one variation of EBCDIC
that we didn't support directly.  So no, it's too early to drop them.

> 1. Signal an error whenever Emacs attempts to encode a unibyte string
> and the encoding isn't "raw-text" or "no-conversion"
> 2. Like (1), but only signal an error if the encoding isn't ASCII-compatible

Signaling an error in these cases is a non-starter.  If you don't like
what Emacs does in these cases, just don't write such code.  Emacs is
not a tool whose primary goal is educating novice programmers, it is
also an industry-strength system that allows doing low-level stuff
when needed.  If we signal errors in those cases, we will throw out
valid use cases for no good reason.

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Stefan Monnier
In reply to this post by Philipp Stephani
> 1. Signal an error whenever Emacs attempts to encode a unibyte string
> and the encoding isn't "raw-text" or "no-conversion"

Sounds good to me.  I have similar extra checks in my local Emacs hacks
(as well as signaling errors when trying to decode a multibyte string).
They've helped me track down encoding problems in Gnus.

They rarely trigger nowadays, but it's likely because I've fixed most
occurrences in the Elisp code I happen to use.  I'd be surprised if
there aren't any such problems lurking in many other places.

Often the corresponding code "works" in practice (i.e. circumstances
make it do the right thing even though in general it may break), and
fixing it so it doesn't trigger the check requires non-trivial changes.
IOW the tradeoff is not very good when it comes to motivating the
package's maintainer to fix his code (non-trivial rework which will
likely introduce new bugs in order to fix mostly hypothetical old bugs).


        Stefan


Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Eli Zaretskii
> From: Stefan Monnier <[hidden email]>
> Date: Sun, 10 Feb 2019 17:25:04 -0500
>
> > 1. Signal an error whenever Emacs attempts to encode a unibyte string
> > and the encoding isn't "raw-text" or "no-conversion"
>
> Sounds good to me.

I'm objected to such a change.

Reply | Threaded
Open this post in threaded view
|

Re: `write-region' writes different bytes than passed to it?

Stefan Monnier
>> > 1. Signal an error whenever Emacs attempts to encode a unibyte string
>> > and the encoding isn't "raw-text" or "no-conversion"
>> Sounds good to me.
> I'm objected to such a change.

I would too because of the breakage it can/will introduce.
But I still think it's a good idea ;-)
[ Lots of good ideas can't be applied, sadly.  ]

Maybe in this specific case we could introduce a "strict encoding mode"
controlled by a config var.


        Stefan "not volunteering to write the patch"


12