Fixing Gnus, and string encoding question

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Fixing Gnus, and string encoding question

Eric Abrahamsen-2
So I've made a hash of this change (ha), and am trying to figure out the
best solution.

The problem is that non-ASCII group names are now strings, and are
coming into the system in two different ways: written into .newsrc.eld
with `print-escape-nonascii' set to t, and read off the filesystem using
a buffer with mutibyte disabled. The two methods don't match up -- the
strings are different.

Katsumi Yamaoka's example is the group whose decoded name is "nnml:テス
ト". This is written to .newsrc.eld as the string:

"nnml:\343\203\206\343\202\271\343\203\210"

Those aren't actual escapes, just backslashes and numbers.

The group name is read from file with `set-buffer-multibyte' nil, using
`read' to pick the group name up as a symbol, then using `symbol-name'
to turn it into a string. The symbol looks like:

nnml:\343\203\206\343\202\271\343\203\210

And the resulting string is:

"nnml:ã\203\206ã\202¹ã\203\210"

Where the escapes are real escapes, I've typed them out here. The two
strings aren't `equal', obviously.

I don't know how to turn either of these strings into the other --
either direction would work, but I don't know how.

Another option is to give up messing with strings, and back the changes
halfway out: still use hash tables, but leave the group names as
symbols, with their current funky encoding. That's probably how I should
have sliced these changes to begin with. Then a later step would be to
go straight from symbols to fully decoded strings.

Hoping for some guidance,
Eric


Reply | Threaded
Open this post in threaded view
|

Re: Fixing Gnus, and string encoding question

Noam Postavsky
On Fri, 5 Apr 2019 at 16:50, Eric Abrahamsen <[hidden email]> wrote:

> Katsumi Yamaoka's example is the group whose decoded name is "nnml:テス
> ト". This is written to .newsrc.eld as the string:
>
> "nnml:\343\203\206\343\202\271\343\203\210"

> nnml:\343\203\206\343\202\271\343\203\210

> "nnml:ã\203\206ã\202¹ã\203\210"

> I don't know how to turn either of these strings into the other --
> either direction would work, but I don't know how.

Are you maybe looking for decode-coding-string?

(decode-coding-string
 "nnml:\343\203\206\343\202\271\343\203\210" 'utf-8) ;=> "nnml:テスト"

(decode-coding-string
 (symbol-name (read "nnml:\343\203\206\343\202\271\343\203\210"))
 'utf-8) ;=> "nnml:テスト"

Reply | Threaded
Open this post in threaded view
|

Re: Fixing Gnus, and string encoding question

Eric Abrahamsen-2
Noam Postavsky <[hidden email]> writes:

> On Fri, 5 Apr 2019 at 16:50, Eric Abrahamsen <[hidden email]> wrote:
>
>> Katsumi Yamaoka's example is the group whose decoded name is "nnml:テス
>> ト". This is written to .newsrc.eld as the string:
>>
>> "nnml:\343\203\206\343\202\271\343\203\210"
>
>> nnml:\343\203\206\343\202\271\343\203\210
>
>> "nnml:ã\203\206ã\202¹ã\203\210"
>
>> I don't know how to turn either of these strings into the other --
>> either direction would work, but I don't know how.
>
> Are you maybe looking for decode-coding-string?
>
> (decode-coding-string
>  "nnml:\343\203\206\343\202\271\343\203\210" 'utf-8) ;=> "nnml:テスト"
>
> (decode-coding-string
>  (symbol-name (read "nnml:\343\203\206\343\202\271\343\203\210"))
>  'utf-8) ;=> "nnml:テスト"

No, unfortunately -- that would make everything much easier. Eventually
the idea will be to decode the strings into plain utf-8-emacs, but for
now I'm stuck keeping them in this weird half-state. I literally need a
conversion between the two versions above.

If this turns out to be too ridiculous, I'll re-slice things as I
mentioned earlier, and leave these group names as symbols.

Eric

Reply | Threaded
Open this post in threaded view
|

Re: Fixing Gnus, and string encoding question

Noam Postavsky
On Fri, 5 Apr 2019 at 22:22, Eric Abrahamsen <[hidden email]> wrote:

> >> "nnml:\343\203\206\343\202\271\343\203\210"

> >> "nnml:ã\203\206ã\202¹ã\203\210"

> > Are you maybe looking for decode-coding-string?

> No, unfortunately -- that would make everything much easier. Eventually
> the idea will be to decode the strings into plain utf-8-emacs, but for
> now I'm stuck keeping them in this weird half-state. I literally need a
> conversion between the two versions above.

Oh, I missed which two string you meant. It seems that evaluating the
1st string with C-x C-e prints the second string in the *Messages*
buffer (I initially thought they were the same string), but
printing/inserting it doesn't work the same. The message code prints
one character at a time, and indeed, inserting one character at a time
in lisp works too:

(let ((s "nnml:\343\203\206\343\202\271\343\203\210"))
  (with-temp-buffer
    (mapc #'insert s)
    (buffer-string)))

The following shorter expression also seem to work:

(apply #'string (string-to-list "nnml:\343\203\206\343\202\271\343\203\210"))

And apply #'unibyte-string goes back again:

(let* ((s1 "nnml:\343\203\206\343\202\271\343\203\210")
       (s2 (apply #'string (string-to-list s1))))
  (apply #'unibyte-string (string-to-list s2)))

I can't say I completely understand why all this works though.

Reply | Threaded
Open this post in threaded view
|

Re: Fixing Gnus, and string encoding question

Eli Zaretskii
In reply to this post by Eric Abrahamsen-2
> From: Eric Abrahamsen <[hidden email]>
> Date: Fri, 05 Apr 2019 19:22:18 -0700
> Cc: Emacs developers <[hidden email]>
>
> > (decode-coding-string
> >  "nnml:\343\203\206\343\202\271\343\203\210" 'utf-8) ;=> "nnml:テスト"
> >
> > (decode-coding-string
> >  (symbol-name (read "nnml:\343\203\206\343\202\271\343\203\210"))
> >  'utf-8) ;=> "nnml:テスト"
>
> No, unfortunately -- that would make everything much easier. Eventually
> the idea will be to decode the strings into plain utf-8-emacs, but for
> now I'm stuck keeping them in this weird half-state. I literally need a
> conversion between the two versions above.

Why do you need to keep these strings undecoded?

Reply | Threaded
Open this post in threaded view
|

Re: Fixing Gnus, and string encoding question

Andreas Schwab-2
In reply to this post by Eric Abrahamsen-2
On Apr 05 2019, Eric Abrahamsen <[hidden email]> wrote:

> The problem is that non-ASCII group names are now strings, and are
> coming into the system in two different ways: written into .newsrc.eld
> with `print-escape-nonascii' set to t,

Why do you need to use the escaped representation?

> The group name is read from file with `set-buffer-multibyte' nil,

Why can't you decode the file contents first, before passing it to read?

Andreas.

--
Andreas Schwab, [hidden email]
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Reply | Threaded
Open this post in threaded view
|

Re: Fixing Gnus, and string encoding question

Eric Abrahamsen-2
In reply to this post by Eli Zaretskii
Eli Zaretskii <[hidden email]> writes:

>> From: Eric Abrahamsen <[hidden email]>
>> Date: Fri, 05 Apr 2019 19:22:18 -0700
>> Cc: Emacs developers <[hidden email]>
>>
>> > (decode-coding-string
>> >  "nnml:\343\203\206\343\202\271\343\203\210" 'utf-8) ;=> "nnml:テスト"
>> >
>> > (decode-coding-string
>> >  (symbol-name (read "nnml:\343\203\206\343\202\271\343\203\210"))
>> >  'utf-8) ;=> "nnml:テスト"
>>
>> No, unfortunately -- that would make everything much easier. Eventually
>> the idea will be to decode the strings into plain utf-8-emacs, but for
>> now I'm stuck keeping them in this weird half-state. I literally need a
>> conversion between the two versions above.
>
> Why do you need to keep these strings undecoded?

Andreas Schwab <[hidden email]> writes:

> Why do you need to use the escaped representation?
>
>> The group name is read from file with `set-buffer-multibyte' nil,
>
> Why can't you decode the file contents first, before passing it to read?

That's the eventual plan. Gnus had the names encoded because they were
kept as symbols. I didn't want to go in one fell swoop from encoded
strings interned in obarrays to completely decoded strings kept in hash
tables, because I assumed I would screw something up and break Gnus and
annoy everyone. So that worked out well... But I should have done
encoded symbols kept in hash tables as the first intermediate step.

Eric


Reply | Threaded
Open this post in threaded view
|

Re: Fixing Gnus, and string encoding question

Eric Abrahamsen-2
In reply to this post by Noam Postavsky
Noam Postavsky <[hidden email]> writes:

> On Fri, 5 Apr 2019 at 22:22, Eric Abrahamsen <[hidden email]> wrote:
>
>> >> "nnml:\343\203\206\343\202\271\343\203\210"
>
>> >> "nnml:ã\203\206ã\202¹ã\203\210"
>
>> > Are you maybe looking for decode-coding-string?
>
>> No, unfortunately -- that would make everything much easier. Eventually
>> the idea will be to decode the strings into plain utf-8-emacs, but for
>> now I'm stuck keeping them in this weird half-state. I literally need a
>> conversion between the two versions above.
>
> Oh, I missed which two string you meant. It seems that evaluating the
> 1st string with C-x C-e prints the second string in the *Messages*
> buffer (I initially thought they were the same string), but
> printing/inserting it doesn't work the same. The message code prints
> one character at a time, and indeed, inserting one character at a time
> in lisp works too:
>
> (let ((s "nnml:\343\203\206\343\202\271\343\203\210"))
>   (with-temp-buffer
>     (mapc #'insert s)
>     (buffer-string)))
>
> The following shorter expression also seem to work:
>
> (apply #'string (string-to-list "nnml:\343\203\206\343\202\271\343\203\210"))
>
> And apply #'unibyte-string goes back again:
>
> (let* ((s1 "nnml:\343\203\206\343\202\271\343\203\210")
>        (s2 (apply #'string (string-to-list s1))))
>   (apply #'unibyte-string (string-to-list s2)))
>
> I can't say I completely understand why all this works though.

Well that is weird and I would never have discovered it on my own --
thank you! I'm going to try to put together a patch using this now.

Thanks again,
Eric


Reply | Threaded
Open this post in threaded view
|

Re: Fixing Gnus, and string encoding question

Eric Abrahamsen-2
In reply to this post by Noam Postavsky
Noam Postavsky <[hidden email]> writes:

> On Fri, 5 Apr 2019 at 22:22, Eric Abrahamsen <[hidden email]> wrote:
>
>> >> "nnml:\343\203\206\343\202\271\343\203\210"
>
>> >> "nnml:ã\203\206ã\202¹ã\203\210"
>
>> > Are you maybe looking for decode-coding-string?
>
>> No, unfortunately -- that would make everything much easier. Eventually
>> the idea will be to decode the strings into plain utf-8-emacs, but for
>> now I'm stuck keeping them in this weird half-state. I literally need a
>> conversion between the two versions above.
>
> Oh, I missed which two string you meant. It seems that evaluating the
> 1st string with C-x C-e prints the second string in the *Messages*
> buffer (I initially thought they were the same string), but
> printing/inserting it doesn't work the same. The message code prints
> one character at a time, and indeed, inserting one character at a time
> in lisp works too:
>
> (let ((s "nnml:\343\203\206\343\202\271\343\203\210"))
>   (with-temp-buffer
>     (mapc #'insert s)
>     (buffer-string)))
>
> The following shorter expression also seem to work:
>
> (apply #'string (string-to-list "nnml:\343\203\206\343\202\271\343\203\210"))
>
> And apply #'unibyte-string goes back again:
>
> (let* ((s1 "nnml:\343\203\206\343\202\271\343\203\210")
>        (s2 (apply #'string (string-to-list s1))))
>   (apply #'unibyte-string (string-to-list s2)))
>
> I can't say I completely understand why all this works though.
No, I spoke too soon. It must be another case of a string that doesn't
quite look like what it actually is. The string that looks like
"nnml:\343\203" etc must be something different: when I run your example
using a typed-in version of the string it behaves correctly, but when I
run it with the actual string I'm working with, the apply #'string
doesn't change it.

You can get the string I'm fighting with by saving the attached file and
running:

(with-temp-buffer
  (set-buffer-multibyte t)
  (let ((coding-system-for-read 'raw-text))
    (insert-file-contents "active")
    (goto-char (point-min))
    (symbol-name (read (current-buffer)))))
   
I'm trying to turn that into something that looks like
"nnml:ã\203\206ã\202¹ã\203\210"

Thanks,
Eric


active (24 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Fixing Gnus, and string encoding question

Andreas Schwab-2
Symbol names can be unibyte and multibyte.  Make sure to get that right.
If you see ã instead of \343 then the symbol has a unibyte name.

Andreas.

--
Andreas Schwab, [hidden email]
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Reply | Threaded
Open this post in threaded view
|

Re: Fixing Gnus, and string encoding question

Noam Postavsky
In reply to this post by Eric Abrahamsen-2
On Sun, 7 Apr 2019 at 00:10, Eric Abrahamsen <[hidden email]> wrote:

> You can get the string I'm fighting with by saving the attached file and
> running:
>
> (with-temp-buffer
>   (set-buffer-multibyte t)
>   (let ((coding-system-for-read 'raw-text))
>     (insert-file-contents "active")
>     (goto-char (point-min))
>     (symbol-name (read (current-buffer)))))
>
> I'm trying to turn that into something that looks like
> "nnml:ã\203\206ã\202¹ã\203\210"

Ah, needs multibyte-char-to-unibyte:

(apply #'string
       (mapcar #'multibyte-char-to-unibyte
               (with-temp-buffer
                 (set-buffer-multibyte t)
                 (let ((coding-system-for-read 'raw-text))
                   (insert-file-contents "active")
                   (goto-char (point-min))
                   (symbol-name (read (current-buffer)))))))

Reply | Threaded
Open this post in threaded view
|

Re: Fixing Gnus, and string encoding question

Andreas Schwab-2
On Apr 07 2019, Noam Postavsky <[hidden email]> wrote:

> On Sun, 7 Apr 2019 at 00:10, Eric Abrahamsen <[hidden email]> wrote:
>
>> You can get the string I'm fighting with by saving the attached file and
>> running:
>>
>> (with-temp-buffer
>>   (set-buffer-multibyte t)
>>   (let ((coding-system-for-read 'raw-text))
>>     (insert-file-contents "active")
>>     (goto-char (point-min))
>>     (symbol-name (read (current-buffer)))))
>>
>> I'm trying to turn that into something that looks like
>> "nnml:ã\203\206ã\202¹ã\203\210"
>
> Ah, needs multibyte-char-to-unibyte:
>
> (apply #'string
>        (mapcar #'multibyte-char-to-unibyte
>                (with-temp-buffer
>                  (set-buffer-multibyte t)
>                  (let ((coding-system-for-read 'raw-text))
>                    (insert-file-contents "active")
>                    (goto-char (point-min))
>                    (symbol-name (read (current-buffer)))))))

(encode-coding-string (symbol-name (read (current-buffer))) 'raw-text)

Andreas.

--
Andreas Schwab, [hidden email]
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Reply | Threaded
Open this post in threaded view
|

Re: Fixing Gnus, and string encoding question

Andreas Schwab-2
In reply to this post by Eric Abrahamsen-2
On Apr 06 2019, Eric Abrahamsen <[hidden email]> wrote:

> (with-temp-buffer
>   (set-buffer-multibyte t)
>   (let ((coding-system-for-read 'raw-text))
>     (insert-file-contents "active")
>     (goto-char (point-min))
>     (symbol-name (read (current-buffer)))))
>    
> I'm trying to turn that into something that looks like
> "nnml:ã\203\206ã\202¹ã\203\210"

(decode-coding-string (symbol-name (read "nnml:\343\203\206\343\202\271\343\203\210")) 'latin-1)

Andreas.

--
Andreas Schwab, [hidden email]
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Reply | Threaded
Open this post in threaded view
|

Re: Fixing Gnus, and string encoding question

Eric Abrahamsen-2
In reply to this post by Andreas Schwab-2
Andreas Schwab <[hidden email]> writes:

> Symbol names can be unibyte and multibyte.  Make sure to get that right.
> If you see ã instead of \343 then the symbol has a unibyte name.

I will meditate on this for a bit.

Andreas Schwab <[hidden email]> writes:

> On Apr 06 2019, Eric Abrahamsen <[hidden email]> wrote:
>
>> (with-temp-buffer
>>   (set-buffer-multibyte t)
>>   (let ((coding-system-for-read 'raw-text))
>>     (insert-file-contents "active")
>>     (goto-char (point-min))
>>     (symbol-name (read (current-buffer)))))
>>    
>> I'm trying to turn that into something that looks like
>> "nnml:ã\203\206ã\202¹ã\203\210"
>
> (decode-coding-string (symbol-name (read "nnml:\343\203\206\343\202\271\343\203\210")) 'latin-1)

That did it! What a relief. I'm not sure why 'latin-1 in particular, but
that aligns all the strings correctly. I'll try to figure that out.

Huge thanks to you and Noam!

Eric