bug#43632: Raw bytes printed as latin-1 in echo area and *Messages*

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

bug#43632: Raw bytes printed as latin-1 in echo area and *Messages*

Eli Zaretskii
> From: Mattias Engdegård <[hidden email]>
> Date: Sat, 26 Sep 2020 14:51:22 +0200
>
>  M-: "\377"
>  => "ÿ"
>
> in both the echo area and in *Messages*. The expected message is "\377".
>
> The same thing happens with
>
>  (prin1 "\377" t)
>
> This anomaly was first observed by Lars Ingebrigtsen.

It is not an anomaly.  If you want to see escapes, set
print-escape-nonascii non-nil.

Also note that what you see is the result of 'eval' printing the
result, the real result (as returned by prin1) is a unibyte string:

  (multibyte-string-p (prin1 "\377")) => nil

(Yes, this is very confusing.)



Reply | Threaded
Open this post in threaded view
|

bug#43632: Raw bytes printed as latin-1 in echo area and *Messages*

Lars Ingebrigtsen
Eli Zaretskii <[hidden email]> writes:

> It is not an anomaly.  If you want to see escapes, set
> print-escape-nonascii non-nil.
>
> Also note that what you see is the result of 'eval' printing the
> result, the real result (as returned by prin1) is a unibyte string:
>
>   (multibyte-string-p (prin1 "\377")) => nil
>
> (Yes, this is very confusing.)

Could we do something about this confusion?

This came up because I just couldn't make sense of what I was seeing
when trying to work with raw bytes -- it seemed to be that Emacs was
auto-promoting unibyte strings (with bytes >127) to multibyte strings...
until I started calling multibyte-string-p on everything instead.

I'm guessing I'm not the only person confused by this stuff.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#43632: Raw bytes printed as latin-1 in echo area and *Messages*

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: Mattias Engdegård <[hidden email]>,
>   [hidden email]
> Date: Sat, 26 Sep 2020 16:22:10 +0200
>
> Eli Zaretskii <[hidden email]> writes:
>
> > It is not an anomaly.  If you want to see escapes, set
> > print-escape-nonascii non-nil.
> >
> > Also note that what you see is the result of 'eval' printing the
> > result, the real result (as returned by prin1) is a unibyte string:
> >
> >   (multibyte-string-p (prin1 "\377")) => nil
> >
> > (Yes, this is very confusing.)
>
> Could we do something about this confusion?

I don't want to say no, but I'm sure it is at least very hard.

> This came up because I just couldn't make sense of what I was seeing
> when trying to work with raw bytes -- it seemed to be that Emacs was
> auto-promoting unibyte strings (with bytes >127) to multibyte strings...
> until I started calling multibyte-string-p on everything instead.

We indeed convert to multibyte when we insert text into a multibyte
buffer, and that's a feature.

> I'm guessing I'm not the only person confused by this stuff.

You are not.  I learned to use multibyte-string-p when in doubt.



Reply | Threaded
Open this post in threaded view
|

bug#43632: Raw bytes printed as latin-1 in echo area and *Messages*

Mattias Engdegård-2
In reply to this post by Eli Zaretskii
26 sep. 2020 kl. 16.14 skrev Eli Zaretskii <[hidden email]>:

First, thank you for your quick answer.

> If you want to see escapes, set
> print-escape-nonascii non-nil.

Certainly, but that isn't needed when printing to a buffer, as in (prin1 "\377").

The print-escape-nonascii docs say

  When the output goes in a multibyte buffer, this feature is
  enabled regardless of the value of the variable.

Why, then, is the echo area not treated as a multibyte buffer in this regard? Is there a practical reason or is it an artefact of history that cannot be changed?

(Note that the echo area buffers, the minibuffer, and *Messages* are all multibyte.)

If the behaviour of (prin1 x t) cannot be changed, then what about eval-expression (M-|)? Being an interactive command, surely compatibility isn't an obstacle to making it more useful?

I'm assuming that it would be more useful to see raw bytes shown octal-escaped or otherwise visually distinct from their interpretation as Latin-1. If nothing else, it would make sense to have the same behaviour as when evaluating something in *scratch*.

It would also be interesting to know why, when print-escape-nonascii is nil, unibyte strings are decoded specifically as Latin-1 (and not, say, UTF-8). I presume it is an artefact of history.




Reply | Threaded
Open this post in threaded view
|

bug#43632: Raw bytes printed as latin-1 in echo area and *Messages*

Eli Zaretskii
> From: Mattias Engdegård <[hidden email]>
> Date: Sat, 26 Sep 2020 18:00:55 +0200
> Cc: [hidden email]
>
> Why, then, is the echo area not treated as a multibyte buffer in
> this regard?

It is.  That's not the reason.

> If the behaviour of (prin1 x t) cannot be changed, then what about eval-expression (M-|)? Being an interactive command, surely compatibility isn't an obstacle to making it more useful?

We were talking about M-:.  And that does call prin1.  And prin1 does
produce a unibyte string, the reason for the display that confused you
is elsewhere.

> It would also be interesting to know why, when print-escape-nonascii is nil, unibyte strings are decoded specifically as Latin-1 (and not, say, UTF-8). I presume it is an artefact of history.

They are not decoded as Latin-1, they are simply inserted into a
buffer as a single byte.  And we do it regardless of
print-escape-nonascii, because \377 is the Unicode (and Latin-1)
codepoint of ÿ.

So it isn't a historical artifact.  I think it's simply a consequence
of how print functions work.



Reply | Threaded
Open this post in threaded view
|

bug#43632: Raw bytes printed as latin-1 in echo area and *Messages*

Mattias Engdegård-2
26 sep. 2020 kl. 18.10 skrev Eli Zaretskii <[hidden email]>:

> We were talking about M-:.

Sorry, typo of mine. Thanks for the correction.

>  And that does call prin1.  And prin1 does
> produce a unibyte string, the reason for the display that confused you
> is elsewhere.

We probably disagree about the details here, but let's leave prin1 aside; it doesn't need to be changed in order to improve the M-: user experience.
What about this little tweak?

--- a/lisp/simple.el
+++ b/lisp/simple.el
@@ -1797,6 +1797,7 @@ eval-expression
   (let ((print-length (unless no-truncate eval-expression-print-length))
         (print-level  (unless no-truncate eval-expression-print-level))
         (eval-expression-print-maximum-character char-print-limit)
+        (print-escape-nonascii t)
         (deactivate-mark))
     (let ((out (if insert-value (current-buffer) t)))
       (prog1

That way, it works more like the other interactive Lisp evaluation commands in this particular respect.
Do you think a NEWS entry is called for?




Reply | Threaded
Open this post in threaded view
|

bug#43632: Raw bytes printed as latin-1 in echo area and *Messages*

Eli Zaretskii
> From: Mattias Engdegård <[hidden email]>
> Date: Sat, 26 Sep 2020 18:53:39 +0200
> Cc: [hidden email]
>
> --- a/lisp/simple.el
> +++ b/lisp/simple.el
> @@ -1797,6 +1797,7 @@ eval-expression
>    (let ((print-length (unless no-truncate eval-expression-print-length))
>          (print-level  (unless no-truncate eval-expression-print-level))
>          (eval-expression-print-maximum-character char-print-limit)
> +        (print-escape-nonascii t)
>          (deactivate-mark))
>      (let ((out (if insert-value (current-buffer) t)))
>        (prog1

We are going to disregard user preferences regarding escapes?  For an
obscure use case?  That doesn't make too much sense to me.




Reply | Threaded
Open this post in threaded view
|

bug#43632: Raw bytes printed as latin-1 in echo area and *Messages*

Eli Zaretskii
> Date: Sat, 26 Sep 2020 19:57:21 +0300
> From: Eli Zaretskii <[hidden email]>
> Cc: [hidden email]
>
> > From: Mattias Engdegård <[hidden email]>
> > Date: Sat, 26 Sep 2020 18:53:39 +0200
> > Cc: [hidden email]
> >
> > --- a/lisp/simple.el
> > +++ b/lisp/simple.el
> > @@ -1797,6 +1797,7 @@ eval-expression
> >    (let ((print-length (unless no-truncate eval-expression-print-length))
> >          (print-level  (unless no-truncate eval-expression-print-level))
> >          (eval-expression-print-maximum-character char-print-limit)
> > +        (print-escape-nonascii t)
> >          (deactivate-mark))
> >      (let ((out (if insert-value (current-buffer) t)))
> >        (prog1
>
> We are going to disregard user preferences regarding escapes?  For an
> obscure use case?  That doesn't make too much sense to me.

I'd rather risk something like the below instead:

diff --git a/src/print.c b/src/print.c
index 0ecc98f..5d878c9 100644
--- a/src/print.c
+++ b/src/print.c
@@ -1993,6 +1993,10 @@ print_object (Lisp_Object obj, Lisp_Object printcharfun, bool escapeflag)
                     }
                   else if (print_escape_control_characters && c_iscntrl (c))
     octalout (c, SDATA (obj), i_byte, size_byte, printcharfun);
+  else if (!multibyte
+   && SINGLE_BYTE_CHAR_P (c)
+   && !ASCII_CHAR_P (c))
+    printchar (BYTE8_TO_CHAR (c), printcharfun);
                   else
                     printchar (c, printcharfun);
   need_nonhex = false;



Reply | Threaded
Open this post in threaded view
|

bug#43632: Raw bytes printed as latin-1 in echo area and *Messages*

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: [hidden email],  [hidden email]
> Date: Sun, 27 Sep 2020 12:58:50 +0200
>
> In the *scratch* buffer, C-x C-e:
>
> "\377"
> => "\377"
>
> But the echo area says "\xff".
>
> because I have (setq display-raw-bytes-as-hex t).

So you've explicitly asked for that, no?

> (string-to-multibyte "\377")
> => "\377"
>
> And the echo area says "\377", as expected.

That's unrelated: Emacs behaved the same before my changes.



Reply | Threaded
Open this post in threaded view
|

bug#43632: Raw bytes printed as latin-1 in echo area and *Messages*

Lars Ingebrigtsen
Eli Zaretskii <[hidden email]> writes:

>> In the *scratch* buffer, C-x C-e:
>>
>> "\377"
>> => "\377"
>>
>> But the echo area says "\xff".
>>
>> because I have (setq display-raw-bytes-as-hex t).
>
> So you've explicitly asked for that, no?

I have, so I guess the only confusing thing is that it's still "\377" in
the scratch buffer?

>> (string-to-multibyte "\377")
>> => "\377"
>>
>> And the echo area says "\377", as expected.
>
> That's unrelated: Emacs behaved the same before my changes.

Yup.  But don't we call these "raw bytes", too?  I was just idly wondering
whether the display of these should also be affected by
`display-raw-bytes-as-hex'...

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#43632: Raw bytes printed as latin-1 in echo area and *Messages*

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: [hidden email],  [hidden email]
> Date: Sun, 27 Sep 2020 13:13:31 +0200
>
> Eli Zaretskii <[hidden email]> writes:
>
> >> In the *scratch* buffer, C-x C-e:
> >>
> >> "\377"
> >> => "\377"
> >>
> >> But the echo area says "\xff".
> >>
> >> because I have (setq display-raw-bytes-as-hex t).
> >
> > So you've explicitly asked for that, no?
>
> I have, so I guess the only confusing thing is that it's still "\377" in
> the scratch buffer?

In the scratch buffer you have a string of 4 ASCII characters, not of
1 raw byte, right?

> >> (string-to-multibyte "\377")
> >> => "\377"
> >>
> >> And the echo area says "\377", as expected.
> >
> > That's unrelated: Emacs behaved the same before my changes.
>
> Yup.  But don't we call these "raw bytes", too?  I was just idly wondering
> whether the display of these should also be affected by
> `display-raw-bytes-as-hex'...

No, because what you have in *scratch* is again a string of 4 ASCII
characters.



Reply | Threaded
Open this post in threaded view
|

bug#43632: Raw bytes printed as latin-1 in echo area and *Messages*

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: [hidden email],  [hidden email]
> Date: Sun, 27 Sep 2020 13:36:41 +0200
>
> Eli Zaretskii <[hidden email]> writes:
>
> >> >> (string-to-multibyte "\377")
> >> >> => "\377"
> >> >>
> >> >> And the echo area says "\377", as expected.
> >> >
> >> > That's unrelated: Emacs behaved the same before my changes.
> >>
> >> Yup.  But don't we call these "raw bytes", too?  I was just idly wondering
> >> whether the display of these should also be affected by
> >> `display-raw-bytes-as-hex'...
> >
> > No, because what you have in *scratch* is again a string of 4 ASCII
> > characters.
>
> Here I was talking about what's displayed in the echo area.  :-)

That, too, is a string of 4 ASCII characters.