Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Stefan Monnier
Hi Robert,

>     The strings contained in gpg keys can contain UTF-8 data, but can also
>     use percent-escapes to encode non-ASCII chars.  When converting those
>     escapes, use 'raw-text' coding system rather than 'string-to-unibyte',
>     since the latter signals an error for non-ASCII characters.

I don't quite understand: "can contain UTF-8 data" seems odd here since
you're calling `encode-coding-string` whose input argument is a sequence
of characters whereas "UTF-8 data" can only be found in sequences of bytes.

Did you mean "can contain non-ASCII characters"?

The other problem with the above description is the "raw-text" since
it's far from clear what it means (personally I really have no idea
what is "raw text" and the way Emacs understands "raw text" is more or
less "EOL-separated lines of bytes" which does not seem to match your
description since string-to-unibyte doesn't signal errors when
encountering bytes).

Looking at the code, I see that the only caller of
`epg--decode-percent-escape` seems to be
`epg--decode-percent-escape-utf-8` which decodes the bytes returned by
`epg--decode-percent-escape` using `utf-8` so I think it makes more
sense to encode using `utf-8` than `raw-text`, WDYT?


        Stefan


diff --git a/lisp/epg.el b/lisp/epg.el
index 5466716e34..e2ce68e161 100644
--- a/lisp/epg.el
+++ b/lisp/epg.el
@@ -2032,7 +2032,7 @@ epg-edit-key
     (epg-reset context)))
 
 (defun epg--decode-percent-escape (string)
-  (setq string (encode-coding-string string 'raw-text))
+  ;; `string' is assumed to be a sequence of *bytes*.
   (let ((index 0))
     (while (string-match "%\\(\\(%\\)\\|\\([[:xdigit:]][[:xdigit:]]\\)\\)"
  string index)
@@ -2047,7 +2047,10 @@ epg--decode-percent-escape
     string))
 
 (defun epg--decode-percent-escape-as-utf-8 (string)
-  (decode-coding-string (epg--decode-percent-escape string) 'utf-8))
+  (decode-coding-string
+   (epg--decode-percent-escape
+    (encode-coding-string string 'utf-8))
+   'utf-8))
 
 (defun epg--decode-hexstring (string)
   (let ((index 0))


Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Robert Pluim
>>>>> On Thu, 12 Dec 2019 08:58:33 -0500, Stefan Monnier <[hidden email]> said:

    Stefan> Hi Robert,
    >> The strings contained in gpg keys can contain UTF-8 data, but can also
    >> use percent-escapes to encode non-ASCII chars.  When converting those
    >> escapes, use 'raw-text' coding system rather than 'string-to-unibyte',
    >> since the latter signals an error for non-ASCII characters.

    Stefan> I don't quite understand: "can contain UTF-8 data" seems odd here since
    Stefan> you're calling `encode-coding-string` whose input argument is a sequence
    Stefan> of characters whereas "UTF-8 data" can only be found in sequences of bytes.

    Stefan> Did you mean "can contain non-ASCII characters"?

"can contain non-ASCII characters encoded using UTF-8", which means
they end up in a multi-byte string in emacs.

    Stefan> The other problem with the above description is the "raw-text" since
    Stefan> it's far from clear what it means (personally I really have no idea
    Stefan> what is "raw text" and the way Emacs understands "raw text" is more or
    Stefan> less "EOL-separated lines of bytes" which does not seem to match your
    Stefan> description since string-to-unibyte doesn't signal errors when
    Stefan> encountering bytes).

Itʼs replacing the use of string-to-unibyte on a multibyte string
containing non-ASCII characters, which signals an error, with
encode-coding-string using 'raw-text, which produces a bunch of
bytes. My other choices were 'binary or 'no-conversion, which do the
same, but have even less meaningful names.

    Stefan> Looking at the code, I see that the only caller of
    Stefan> `epg--decode-percent-escape` seems to be
    Stefan> `epg--decode-percent-escape-utf-8` which decodes the bytes returned by
    Stefan> `epg--decode-percent-escape` using `utf-8` so I think it makes more
    Stefan> sense to encode using `utf-8` than `raw-text`, WDYT?

No. The string that is passed to epg--decode-percent-escape can
contain non-ASCII characters encoded as UTF-8, plus percent-escaped
representations of non-ASCII characters. In order to convert those
percent-escaped characters correctly, the string has to be treated as
a unibyte array of bytes, then re-converted to multibyte by encoding
with utf-8 afterwards.

Robert

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Eli Zaretskii
> From: Robert Pluim <[hidden email]>
> Date: Thu, 12 Dec 2019 16:19:46 +0100
> Cc: [hidden email]
>
> In order to convert those percent-escaped characters correctly, the
> string has to be treated as a unibyte array of bytes, then
> re-converted to multibyte by encoding with utf-8 afterwards.
                               ^^^^^^^^
You mean "decoding", I believe (that's what the code does).

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Robert Pluim
In reply to this post by Robert Pluim
>>>>> On Thu, 12 Dec 2019 16:19:46 +0100, Robert Pluim <[hidden email]> said:
    Robert> No. The string that is passed to epg--decode-percent-escape can
    Robert> contain non-ASCII characters encoded as UTF-8, plus percent-escaped
    Robert> representations of non-ASCII characters. In order to convert those
    Robert> percent-escaped characters correctly, the string has to be treated as
    Robert> a unibyte array of bytes, then re-converted to multibyte by encoding
    Robert> with utf-8 afterwards.

Actually, Iʼm wrong here. (encode-coding-string string 'utf-8) will
work as well, Iʼd wrongly assumed it would produce a multibyte string.

Robert

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Robert Pluim
In reply to this post by Eli Zaretskii
>>>>> On Thu, 12 Dec 2019 17:28:14 +0200, Eli Zaretskii <[hidden email]> said:

    >> From: Robert Pluim <[hidden email]>
    >> Date: Thu, 12 Dec 2019 16:19:46 +0100
    >> Cc: [hidden email]
    >>
    >> In order to convert those percent-escaped characters correctly, the
    >> string has to be treated as a unibyte array of bytes, then
    >> re-converted to multibyte by encoding with utf-8 afterwards.
    Eli>                                ^^^^^^^^
    Eli> You mean "decoding", I believe (that's what the code does).

Yes. All this stuff is very easy to get lost in.

Robert

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Eli Zaretskii
> From: Robert Pluim <[hidden email]>
> Date: Thu, 12 Dec 2019 16:33:50 +0100
> Cc: [hidden email], [hidden email]
>
>     Eli> You mean "decoding", I believe (that's what the code does).
>
> Yes. All this stuff is very easy to get lost in.

Welcome to the club ;-)

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Andreas Schwab
In reply to this post by Robert Pluim
On Dez 12 2019, Robert Pluim wrote:

> "can contain non-ASCII characters encoded using UTF-8", which means
> they end up in a multi-byte string in emacs.

Only encoded data can contain UTF-8 encoded characters.  A decoded
string just contains (Unicode) characters.

> No. The string that is passed to epg--decode-percent-escape can
> contain non-ASCII characters encoded as UTF-8,

If the argument of epg--decode-percent-escape is a multi-byte string,
then it isn't encoded.

> In order to convert those percent-escaped characters correctly, the
> string has to be treated as a unibyte array of bytes

Why?

Andreas.

--
Andreas Schwab, SUSE Labs, [hidden email]
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Eli Zaretskii
In reply to this post by Robert Pluim
> From: Robert Pluim <[hidden email]>
> Date: Thu, 12 Dec 2019 16:30:12 +0100
> Cc: [hidden email]
>
> Actually, Iʼm wrong here. (encode-coding-string string 'utf-8) will
> work as well, Iʼd wrongly assumed it would produce a multibyte string.

The question is not what will or won't work, the question is what is
better, I think.  And I don't think Stefan explained his rationale for
preferring UTF-8.

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Robert Pluim
In reply to this post by Andreas Schwab
>>>>> On Thu, 12 Dec 2019 16:42:04 +0100, Andreas Schwab <[hidden email]> said:

    Andreas> On Dez 12 2019, Robert Pluim wrote:
    >> "can contain non-ASCII characters encoded using UTF-8", which means
    >> they end up in a multi-byte string in emacs.

    Andreas> Only encoded data can contain UTF-8 encoded characters.  A decoded
    Andreas> string just contains (Unicode) characters.

The data inside the gpg key is encoded, and is decoded by emacs into a
multi-byte string.

    >> No. The string that is passed to epg--decode-percent-escape can
    >> contain non-ASCII characters encoded as UTF-8,

    Andreas> If the argument of epg--decode-percent-escape is a multi-byte string,
    Andreas> then it isn't encoded.

Yes, my terminology is confused.

    >> In order to convert those percent-escaped characters correctly, the
    >> string has to be treated as a unibyte array of bytes

    Andreas> Why?

It starts as a multibyte string containing non-ascii chars, plus a
representation of non-ascii chars using percent escapes, eg

"ü%C3%A9"

epg--decode-percent-escape replace the escapes with the corresponding
characters, so now we have

"üé"

when what we really want is

"üé"

If we convert the initial string to 'bunch of bytes' before escaping,
then convert back to multibyte after, we get the correct
answer.

Robert

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Robert Pluim
In reply to this post by Eli Zaretskii
>>>>> On Thu, 12 Dec 2019 17:45:00 +0200, Eli Zaretskii <[hidden email]> said:

    >> From: Robert Pluim <[hidden email]>
    >> Date: Thu, 12 Dec 2019 16:30:12 +0100
    >> Cc: [hidden email]
    >>
    >> Actually, Iʼm wrong here. (encode-coding-string string 'utf-8) will
    >> work as well, Iʼd wrongly assumed it would produce a multibyte string.

    Eli> The question is not what will or won't work, the question is what is
    Eli> better, I think.  And I don't think Stefan explained his rationale for
    Eli> preferring UTF-8.

Stefan's patch has a certain symmetry to it, but Iʼd likely be going
"how come encoding as UTF-8 means we treat it as a bunch of bytes" at
some point in the future, and have to revisit the whole thing.

Robert

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Andreas Schwab
In reply to this post by Robert Pluim
On Dez 12 2019, Robert Pluim wrote:

> epg--decode-percent-escape replace the escapes with the corresponding
> characters, so now we have
>
> "üé"

That is were it went wrong.  You need to decode the sequence before
splicing it into the string.  Mixing encoded and decoded data is not
going to do the right thing.

Andreas.

--
Andreas Schwab, SUSE Labs, [hidden email]
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Robert Pluim
>>>>> On Thu, 12 Dec 2019 17:33:36 +0100, Andreas Schwab <[hidden email]> said:

    Andreas> On Dez 12 2019, Robert Pluim wrote:
    >> epg--decode-percent-escape replace the escapes with the corresponding
    >> characters, so now we have
    >>
    >> "üé"

    Andreas> That is were it went wrong.  You need to decode the sequence before
    Andreas> splicing it into the string.  Mixing encoded and decoded data is not
    Andreas> going to do the right thing.

I didnʼt write the code in question. Whoever did understandably didnʼt
want to write code for converting variable length percent-escapes into
the corresponding characters when emacs has decode-coding-string
already.

Robert

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Andreas Schwab
On Dez 12 2019, Robert Pluim wrote:

> I didnʼt write the code in question. Whoever did understandably didnʼt
> want to write code for converting variable length percent-escapes into
> the corresponding characters when emacs has decode-coding-string
> already.

You know how long the string to be decoded is.  Every run of percent
escapes is a complete sequence on its own.

Andreas.

--
Andreas Schwab, SUSE Labs, [hidden email]
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Eli Zaretskii
In reply to this post by Robert Pluim
> From: Robert Pluim <[hidden email]>
> Date: Thu, 12 Dec 2019 17:11:07 +0100
> Cc: Stefan Monnier <[hidden email]>, [hidden email]
>
> It starts as a multibyte string containing non-ascii chars, plus a
> representation of non-ascii chars using percent escapes, eg
>
> "ü%C3%A9"

Do the percent escapes necessarily represent UTF-8 byte sequences?  Or
can they be in any other encoding, like , say, Latin-1?

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Robert Pluim
>>>>> On Thu, 12 Dec 2019 19:57:20 +0200, Eli Zaretskii <[hidden email]> said:

    >> From: Robert Pluim <[hidden email]>
    >> Date: Thu, 12 Dec 2019 17:11:07 +0100
    >> Cc: Stefan Monnier <[hidden email]>, [hidden email]
    >>
    >> It starts as a multibyte string containing non-ascii chars, plus a
    >> representation of non-ascii chars using percent escapes, eg
    >>
    >> "ü%C3%A9"

    Eli> Do the percent escapes necessarily represent UTF-8 byte sequences?  Or
    Eli> can they be in any other encoding, like , say, Latin-1?

GnuPG only outputs percent escapes representing UTF-8 byte sequences,
as far as I can tell.

BTW, epg has been assuming that for the last 12 years, so I think itʼs
safe to continue.

Robert

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Eli Zaretskii
> From: Robert Pluim <[hidden email]>
> Date: Fri, 13 Dec 2019 10:40:03 +0100
> Cc: [hidden email], [hidden email], [hidden email]
>
>     >> "ü%C3%A9"
>
>     Eli> Do the percent escapes necessarily represent UTF-8 byte sequences?  Or
>     Eli> can they be in any other encoding, like , say, Latin-1?
>
> GnuPG only outputs percent escapes representing UTF-8 byte sequences,
> as far as I can tell.
>
> BTW, epg has been assuming that for the last 12 years, so I think itʼs
> safe to continue.

Then using UTF-8 here is indeed better, I think.

Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Stefan Monnier
In reply to this post by Eli Zaretskii
> The question is not what will or won't work, the question is what is
> better, I think.  And I don't think Stefan explained his rationale for
> preferring UTF-8.

What the code does is:
take a multibyte string with % escapes of bytes which are to be
interpreted as utf-8, so the Elisp does: encode with utf-8, process the %
escapes bytes, and decode back using utf-8.  The encoding+decoding with
the same coding-system should ensure that the non-escapes part of the
string is preserved and the use of utf-8 is chosen because that's what
the %-escapes are defined to use.

So yes, now I'm sure the patch is right.


        Stefan


Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Stefan Monnier
In reply to this post by Robert Pluim
> No. The string that is passed to epg--decode-percent-escape can
> contain non-ASCII characters encoded as UTF-8,

I have the impression that you're confused: a multibyte string that
contains say a λ or a é character does not contain "non-ASCII characters
encoded as UTF-8", unless you're referring to the way the multibyte
string is internally represented at a low level (but this internal
representation is largely irrelevant to Elisp).


        Stefan


Reply | Threaded
Open this post in threaded view
|

Re: master d57bb0c: Treat passed strings as raw-text when percent-escaping in epg

Robert Pluim
>>>>> On Mon, 16 Dec 2019 11:21:18 -0500, Stefan Monnier <[hidden email]> said:

    >> No. The string that is passed to epg--decode-percent-escape can
    >> contain non-ASCII characters encoded as UTF-8,

    Stefan> I have the impression that you're confused:

Didnʼt I already say that last week? :-)

    Stefan> a multibyte string that
    Stefan> contains say a λ or a é character does not contain "non-ASCII characters
    Stefan> encoded as UTF-8", unless you're referring to the way the multibyte
    Stefan> string is internally represented at a low level (but this internal
    Stefan> representation is largely irrelevant to Elisp).

That is what I was referring to, but it was the wrong thing to say in
this context.

Robert