bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

J.P.
Tags: patch

Hi,

The behavior in question only affects SOCKS version 4 [1] and is shown
in the second attachment. The first revises some helpers and adds
additional tests I was too lazy or thoughtless to include in recent
patches.

The third contains a naive attempt at a fix but may suffer from my
inadequate grasp of Emacs coding systems [2]. Please advise if that's so
and/or I need to hit the books (or get my head examined).

Thanks,
J.P.


[1] The function socks--open-network-stream assumes all v5 addresses to
be of type 3, which means a domain name of variable length, presumably
of the LDH character set: https://tools.ietf.org/html/rfc1928#section-5

[2] Re coding-systems and proposed fix: I'm wondering if I should have
used encode-coding-string (or something else) instead. Also: while
socks--open-network-stream is perhaps the real culprit because it
creates the offending string, I figured it's more resilient to have the
function doing the sending to massage any IP address it encounters.


In GNU Emacs 28.0.50 (build 1, x86_64-redhat-linux-gnu, GTK+ Version 3.24.24, cairo version 1.16.0)
 of 2021-01-31 built on localhost
Repository revision: 431b098a206d27a2dff6a88312c28c36926f90e9
Repository branch: master
Windowing system distributor 'Fedora Project', version 11.0.12010000
System Description: Fedora 33 (Workstation Edition)

Configured using:
 'configure --enable-check-lisp-object-type --enable-checking=yes,glyphs
 --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu
 --program-prefix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin
 --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share
 --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec
 --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man
 --infodir=/usr/share/info --with-dbus --with-gif --with-jpeg --with-png
 --with-rsvg --with-tiff --with-xft --with-xpm --with-x-toolkit=gtk3
 --with-gpm=no --with-xwidgets --with-modules --with-harfbuzz
 --with-cairo --with-json build_alias=x86_64-redhat-linux-gnu
 host_alias=x86_64-redhat-linux-gnu CC=gcc 'CFLAGS=-DMAIL_USE_LOCKF -O0
 -flto=auto -ffat-lto-objects -fexceptions -g3 -grecord-gcc-switches
 -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
 -Wp,-D_GLIBCXX_ASSERTIONS
 -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong
 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic
 -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection'
 LDFLAGS=-Wl,-z,relro
 PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig'

Configured features:
ACL CAIRO DBUS FREETYPE GIF GLIB GMP GNUTLS GSETTINGS HARFBUZZ JPEG JSON
LCMS2 LIBOTF LIBSELINUX LIBSYSTEMD LIBXML2 M17N_FLT MODULES NOTIFY
INOTIFY PDUMPER PNG RSVG SOUND THREADS TIFF TOOLKIT_SCROLL_BARS X11 XDBE
XIM XPM XWIDGETS GTK3 ZLIB

Important settings:
  value of $LANG: en_US.UTF-8
  value of $XMODIFIERS: @im=ibus
  locale-coding-system: utf-8-unix

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  global-eldoc-mode: t
  eldoc-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Load-path shadows:
None found.

Features:
(shadow sort mail-extr emacsbug message rmc puny dired dired-loaddefs
rfc822 mml easymenu mml-sec epa derived epg epg-config gnus-util rmail
rmail-loaddefs auth-source cl-seq eieio eieio-core cl-macs
eieio-loaddefs password-cache json map text-property-search time-date
subr-x seq byte-opt gv bytecomp byte-compile cconv mm-decode mm-bodies
mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader cl-loaddefs
cl-lib sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils
iso-transl tooltip eldoc electric uniquify ediff-hook vc-hooks
lisp-float-type mwheel term/x-win x-win term/common-win x-dnd tool-bar
dnd fontset image regexp-opt fringe tabulated-list replace newcomment
text-mode elisp-mode lisp-mode prog-mode register page tab-bar menu-bar
rfn-eshadow isearch timer select scroll-bar mouse jit-lock font-lock
syntax facemenu font-core term/tty-colors frame minibuffer cl-generic
cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao
korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech
european ethiopic indian cyrillic chinese composite charscript charprop
case-table epa-hook jka-cmpr-hook help simple abbrev obarray
cl-preloaded nadvice button loaddefs faces cus-face macroexp files
window text-properties overlay sha1 md5 base64 format env code-pages
mule custom widget hashtable-print-readable backquote threads
xwidget-internal dbusbind inotify lcms2 dynamic-setting
system-font-setting font-render-setting cairo move-toolbar gtk x-toolkit
x multi-tty make-network-process emacs)

Memory information:
((conses 16 51852 8897)
 (symbols 48 6598 1)
 (strings 32 18255 2081)
 (string-bytes 1 611217)
 (vectors 16 12543)
 (vector-slots 8 172252 11638)
 (floats 8 25 39)
 (intervals 56 275 0)
 (buffers 984 10))

0001-Add-more-auth-related-tests-for-socks.el.patch (15K) Download Attachment
0002-DROP-ME-demo-UTF-8-encoding-of-numeric-addresses.patch (1K) Download Attachment
0002-Preserve-numeric-addresses-in-socks-send-command.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

Eli Zaretskii
> From: "J.P." <[hidden email]>
> Date: Sat, 06 Feb 2021 03:46:26 -0800
>
> [2] Re coding-systems and proposed fix: I'm wondering if I should have
> used encode-coding-string (or something else) instead. Also: while
> socks--open-network-stream is perhaps the real culprit because it
> creates the offending string, I figured it's more resilient to have the
> function doing the sending to massage any IP address it encounters.

Emacs holds all non-ASCII characters internally in (a superset of)
UTF-8 encoding.  When passing those strings to external programs or
network connections, they should be encoded as appropriate.

What I don't understand is what is the "appropriate" encoding in this
case.  Can you explain why you use literal bytes in the test?  What
are those bytes supposed to stand for, and which program is supposed
to receive this sequence of bytes on the other end of the connect
command?

Thanks.



Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

Eli Zaretskii
> From: "J.P." <[hidden email]>
> Cc: [hidden email]
> Date: Sat, 06 Feb 2021 06:19:55 -0800
>
> Re appropriate encoding: correct me if I'm wrong (internet), but among
> the Emacs coding systems, it'd be latin-1.

That depends on what the other end expects.  Does it expect latin-1 in
this case?

> In the proposed patch for socks-send-command, swapping out the call
> to unibyte-string with (encode-coding-string address 'latin-1) has
> the same effect of preserving, say, char 216 as the single byte
> "\330" and not "\303\230".

Does emitting the single byte \330 produce the correct result in this
case?  Then by all means please use

   (encode-coding-string address 'latin-1)

> Re program on the other end: this would be any program offering a proxy
> service that speaks the same protocol. Popular ones include tor and ssh.
> There's also a "bind" command that allows your client (Emacs) to act as
> a server and listen/accept connections on the remote end as if they were
> present on your local network.

And those expect Latin-1 encoding in this case?



Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

J.P.
Eli Zaretskii <[hidden email]> writes:

>> Re appropriate encoding: correct me if I'm wrong (internet), but among
>> the Emacs coding systems, it'd be latin-1.
>
> That depends on what the other end expects.  Does it expect latin-1 in
> this case?

From the point of view of Emacs, I'd say yes: the other end, meaning the
proxy service, expects latin-1. From the service's point of view, it
only speaks byte sequences and doesn't interpret any fields as text [1].
This continues after proxying has commenced; incoming byte sequences are
forwarded verbatim as opaque payloads.

> Does emitting the single byte \330 produce the correct result in this
> case?  Then by all means please use
>
>    (encode-coding-string address 'latin-1)

It does indeed produce the correct result [2], and I've updated the
patch to reflect this. I wasn't sure whether you wanted me to replace
all the vectors in the tests with strings and/or annotate them with
comments explaining the protocol, so I just left them as is for now.

My main concern (based on sheer ignorance) was any possible side effects
that may occur from encode-coding-string setting the variable
last-coding-system-used to latin-1. I investigated a little by stepping
through the subsequent send_process() call and found that the variable's
value as latin-1 appears short lived because it's quickly reassigned to
binary. I tried to demonstrate this in the attached log of my debug
session (and also show that no conversion is performed). Please pardon
my sad debugging skills.

>> Re program on the other end: this would be any program offering a proxy
>> service that speaks the same protocol. Popular ones include tor and ssh.
>> [...]
>
> And those expect Latin-1 encoding in this case?

I'd say yes, insofar as these programs are examples of a proxy service
of the sort mentioned in the first answer above.

Thanks again


[1] Although, in the case of SOCKS 4A/5, non-numeric addresses, i.e.,
domain names, should probably be expressed in characters a resolver can
understand, like the Punycode ASCII subset.

[2] there is one tiny difference in behavior from the previous iteration
of this patch, but it's not worth anyone's time, so I'll just note it
here for the record: when called in the manner shown in the patch,
encode-coding-string silently replaces multibyte characters with spaces.

The only edge case I could think of in which accidentally passing a
multibyte might be harder to debug than a normal typo would be when
hitting an address like ec2-13-56-13-123.us-west-1.compute.amazonaws.com
and accidentally passing 13.256.13.123 (as "\15\u0100\15\173"), which
would be routed to 13.32.13.123 (flickr/cloudflare).

One way to avoid this would be with validation like that performed by
unibyte-string or, alternatively, by purposefully violating the protocol
and sending say, "\15\15{" instead of "\15 \15{" (and thereby triggering
an error response from the service). All in all, this seems unlikely
enough not to warrant special attention.

0001-Add-more-auth-related-tests-for-socks.el.patch (15K) Download Attachment
0002-Preserve-numeric-addresses-in-socks-send-command.patch (2K) Download Attachment
debug_session.log (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

J.P.
In reply to this post by Eli Zaretskii
Eli,

Apologies for my previous rambling reply. Your time is supremely
valuable, and I should've been more mindful of that. I'm hoping you'll
allow me this mulligan (ahem):

> That depends on what the other end expects.  Does it expect latin-1 in
> this case?

Yes.

> Does emitting the single byte \330 produce the correct result in this
> case?

Yes.

> Then by all means please use
>
>    (encode-coding-string address 'latin-1)

I have updated my changes to reflect this.

> And those [tor/ssh] expect Latin-1 encoding in this case?

Yes.

When you've got a sec, please instruct me on how best to proceed (even
if that means buggering off 'til I've learned more about Emacs!). :)

Appreciate your patience, as always.
J.P.

0001-Add-more-auth-related-tests-for-socks.el.patch (15K) Download Attachment
0002-Preserve-numeric-addresses-in-socks-send-command.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

Eli Zaretskii
> From: "J.P." <[hidden email]>
> Cc: [hidden email]
> Date: Tue, 09 Feb 2021 07:17:29 -0800
>
>
> [1:text/plain Hide]
>
> Eli,
>
> Apologies for my previous rambling reply. Your time is supremely
> valuable, and I should've been more mindful of that. I'm hoping you'll
> allow me this mulligan (ahem):
>
> > That depends on what the other end expects.  Does it expect latin-1 in
> > this case?
>
> Yes.
>
> > Does emitting the single byte \330 produce the correct result in this
> > case?
>
> Yes.
>
> > Then by all means please use
> >
> >    (encode-coding-string address 'latin-1)
>
> I have updated my changes to reflect this.
>
> > And those [tor/ssh] expect Latin-1 encoding in this case?
>
> Yes.
>
> When you've got a sec, please instruct me on how best to proceed (even
> if that means buggering off 'til I've learned more about Emacs!). :)

(I'd really prefer that someone who knows more than I do about SOCKS
review this.)

If I must do that, I have a question: what kind of string can this
ADDRESS be?  My reading of RFC 1928 is that it normally is an IP
address, in which case encoding is not relevant, as it's an ASCII
string.  But it can also be a domain, right?  If so, what form can
this domain take?  If the domain has non-ASCII characters, shouldn't
it be hex-encoded, or run through IDNA?  I mean, are non-ASCII
characters in that place at all allowed?



Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

Eli Zaretskii
> From: "J.P." <[hidden email]>
> Cc: [hidden email]
> Date: Wed, 10 Feb 2021 05:16:58 -0800
>
> Eli Zaretskii <[hidden email]> writes:
>
> > what kind of string can this ADDRESS be? My reading of RFC 1928 is
> > that it normally is an IP address, in which case encoding is not
> > relevant, as it's an ASCII string. But it can also be a domain, right?
>
> This patch only affects IP addresses, but I'm happy to look into the
> domain name form as well.

Then I don't understand why we need to worry about encoding.  IP
addresses are pure ASCII strings, so they need no encoding whatsoever.

I guess I will have to ask you to back up and describe what problems
you saw with the original code, and show me the details of the strings
involved in that.

Thanks for the rest of your message, but I suggest that we limit
ourselves to IP addresses for the time being, and only go to non-ASCII
domains when we have the IP address use case figured out.



Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

J.P.
Eli Zaretskii <[hidden email]> writes:

> Then I don't understand why we need to worry about encoding.  IP
> addresses are pure ASCII strings, so they need no encoding whatsoever.

Clearly, I'm failing you here. Between my dearth of communications
skills and lack of Emacs know how, I've obviously managed to deceive you
into thinking that SOCKS IP address fields ought to contain ASCII text
characters such as the following:

  1  9  2  .  1  6  8  .  1  .  1
  31 39 32 2e 31 36 38 2e 31 2e 31

However, this is not the case [a]. In version 4, all addresses are
four-byte sequences, one byte for each component of an IPv4 address, and
the ordering is left-to-right. For example:

  192 168 1  1
  c0  a8  01 01

In version 5, the one covered by RFC 1928, this is extended to include
16-byte IPv6 addresses as well as ASCII domain names. All three are
exclusive to one another but occupy the same field in a union of sorts.
The first byte of that field, the "ATYP" flag, denotes which of the
three to expect, and it appears as the "atype" argument to
socks-send-command.

> I guess I will have to ask you to back up and describe what problems
> you saw with the original code, and show me the details of the strings
> involved in that.

The Elisp manual distinguishes between multibyte and unibyte "sources"
of strings [1]. For these (SOCKS 4) IP address strings, the function
socks--open-network-stream is that source (it creates them). When such
a string includes characters with code points between 128 and 255 (the
latin-1/iso-8859-1 range), single characters are sent as two utf-8
encoded bytes, which the SOCKS service rejects as violating protocol.

Specifically, when a user passes "example.com" to the entry-point
function socks-open-network-stream, its internal helper
socks--open-network-stream resolves the host name into an IP in list
form and then converts this to a string by calling

  (apply #'format "%c%c%c%c" '(93 184 216 34))

This produces a multibyte string of the same character length:

  "]¸Ø\""

However, when socks-send-command passes this to process-send-string,
whose coding system is (binary . binary), the underlying six-byte
sequence is emitted verbatim:

  "]\302\270\303\230\""

My initial idea was to leverage the function unibyte-string to ensure
every character can be encoded in 8 bits before transmission. Regardless,
performing some combination of validating and converting before sending
may be worthwhile since it'll only run once per connection.

Sorry for the extended play-by-play. I certainly hope none of it came
off as insulting or pedantic. I'm quite certain your grasp of such
concepts long ago outpaced any understanding I could ever hope to
attain.

J.P.


[a] My versions of tor and ssh definitely honor requests like

  curl --proxy socks5h://localhost:1080 http://93.184.216.34

passing the IP address as a domain name. Although this defies RFC 1928,
which specifies FQDNs only [1], I'm getting the sense that influential
projects treat the latter more as a living standard. (Note: in its unit
tests, tor only includes this form for its extension commands [2].)

[1] (elisp) Non-ASCII in Strings, second paragraph
[1] https://tools.ietf.org/html/rfc1928#section-5
[2] https://gitweb.torproject.org/tor.git/tree/src/test/test_socks.c#n335



Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

J.P.
Eli Zaretskii <[hidden email]> writes:

> Then they are what we call "raw bytes", and encoding them with
> raw-text-unix should suffice.

Thanks. Unfortunately, this produces the same utf-8 encoded bytes.

  (encode-coding-char 192 'raw-text-unix)
  ⇒ "\303\200"
 
It looks like raw-text-unix is an alias for binary [1], the coding
system already used by the network process sending the erroneous
request. I suppose it's always possible to strong arm it like

  (encode-coding-char (or (decode-char 'eight-bit c) c) 'raw-text-unix)
  ⇒ "^@" ... "\377"

But what about your original latin-1 suggestion? Is that no longer in
contention?

  (encode-coding-char 192 'latin-1)
  ⇒ "\300"

> How does the code which calls socks.el create these raw bytes?

This library has an entry-point function that's part of the url-gateway
dispatch mechanism. I can't say for certain, but it looks like url-http
is the only library directly using this facility. Regardless, the
function gets called with a (possibly multibyte) host name, which in
rare cases may be an ASCII IP address created by url-gateway.

With SOCKS4, that's kind of moot, since all names are looked up through
socks-nslookup-host, which returns an IPv4 address as a list of fixnums.
Its caller is an internal helper that converts this list into a
multibyte string for socks-send-command to emit onto the wire (where
it's then rejected by the service).

Currently, IP addresses aren't used at all for v5 connect-command
requests. And raw-byte IP addresses do not yet appear anywhere [2]. This
patch would introduce them, either as an argument to socks-send-command
or as something ephemeral produced by it (the current idea).


[1] (elisp) Coding System Basics
[2] Of course, these are generalities that don't apply to users who wire
everything up manually.





Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

J.P.
Eli Zaretskii <[hidden email]> writes:

> So what is the problem with using unibyte-string for producing a
> unibyte string from a list of bytes?  It sounds like it's exactly
> what is needed here, and is actually used in some places in socks.el.

Great, thanks. That's what was needed indeed. I've updated things
accordingly and have added a doc string to socks-send-command explaining
the address parameter.


0001-Add-more-auth-related-tests-for-socks.el.patch (16K) Download Attachment
0002-Use-raw-bytes-for-SOCKS-4-IP-addresses.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

Eli Zaretskii
> From: "J.P." <[hidden email]>
> Cc: [hidden email]
> Date: Wed, 17 Feb 2021 06:59:07 -0800
>
> Eli Zaretskii <[hidden email]> writes:
>
> > So what is the problem with using unibyte-string for producing a
> > unibyte string from a list of bytes?  It sounds like it's exactly
> > what is needed here, and is actually used in some places in socks.el.
>
> Great, thanks. That's what was needed indeed. I've updated things
> accordingly and have added a doc string to socks-send-command explaining
> the address parameter.

Thanks.  I wanted to install this, but it failed to apply to the
master branch.  Could you please rebase on the current master branch
and resubmit?



Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

J.P.
Eli Zaretskii <[hidden email]> writes:

> Thanks.  I wanted to install this, but it failed to apply to the
> master branch.  Could you please rebase on the current master branch
> and resubmit?

Hi Eli, I've gone ahead and rebased atop

  commit c85c8e7d42ae2a5fc95fa7b14257389d8383b34d
  Author: Stefan Kangas <[hidden email]>
 
      Add toolbar for help-mode
 
   lisp/help-mode.el | 28 +++++++++++++++++++++++-----
   1 file changed, 23 insertions(+), 5 deletions(-)

However the src/dest blob hashes (i.e., git-hash-object(1) SHAs in the
diffs' index dead..beef 0644 lines) remain unchanged. Which leads me to
think maybe the failure is related to these being two individual patches
(a "change set"). I've had trouble in the past applying similar sets
using debbugs-gnu-apply-patch, but that might've just been incompetence.

Regardless, I'd be happy to squash these down into a single commit if you
think that'd help. Thanks.

0001-Add-more-auth-related-tests-for-socks.el.patch (16K) Download Attachment
0002-Use-raw-bytes-for-SOCKS-4-IP-addresses.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

J.P.
In reply to this post by Eli Zaretskii

I went ahead and squashed them down for good measure, in case you prefer
the single commit. BTW, I meant before/after, not src/dest re the git
diff index lines. Thanks.

0001-Use-raw-bytes-for-SOCKS-4-IP-addresses.patch (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

J.P.
In reply to this post by J.P.
Eli Zaretskii <[hidden email]> writes:

> Running the tests, I see a lot of stuff on the console.  Do we really
> need that? can some of it be shut up?

Sorry about that. I changed it so it's only noisy during interactive
sessions. Is that enough or should I go all the way? Thanks.

0001-Mute-noisy-test-fixture-for-socks.el.patch (925 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#46342: 28.0.50; socks-send-command munges IP address bytes to UTF-8

Eli Zaretskii
> From: "J.P." <[hidden email]>
> Cc: [hidden email]
> Date: Sat, 20 Feb 2021 07:08:55 -0800
>
> > Running the tests, I see a lot of stuff on the console.  Do we really
> > need that? can some of it be shut up?
>
> Sorry about that. I changed it so it's only noisy during interactive
> sessions. Is that enough or should I go all the way? Thanks.

Thanks, installed.

And with that, I think we can close this bug?