bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535

Adam Niederer
Hello, I believe I've found an indentation issue. To reproduce, start
emacs, create a buffer in js-mode, paste in this code, and press C-x h
TAB to indent the buffer:

let x = /* 👍 */ { foo: 0
                   bar: 0 }

let x = /* ☺ */ { foo: 0
                  bar: 0 }

Both 25.2 and 26.0.50 add one extra space before "bar" in the first
first snippet with U+1F44D THUMBS UP SIGN in the comment, whereas the
second snippet with U+263A WHITE SMILING FACE properly aligns "bar" with
"foo". This appears to happen whenever the character in the comment
needs a surrogate pair.

This issue also happens in python-mode:

"👍", {"a": 2,
       "b": 3}

"☺", {"a":2,
      "b":3}

Interestingly, pressing TAB with one's point on the second line of each
snippet to dedent the line yields a correct result for both symbols:

"👍", {"a": 2,
    "b": 3}

"☺", {"a":2,
    "b":3}

Just in case those Emoji don't make it through the mail properly, the
first snippet in each example contains U+1F44D THUMBS UP SIGN before
the map, and the second snippet contains U+263A WHITE SMILING FACE.

-Adam


In GNU Emacs 26.0.50 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.22.15)
of 2017-06-17 built on AdamsPC
Repository revision: 49c0ff29c2e0243ba35ec17e3e3af49369be43db
Windowing system distributor 'The X.Org Foundation', version 11.0.11903000
System Description: Arch Linux

Recent messages:
Auto-saving...
20 (#o24, #x14, ?\C-t)
21 (#o25, #x15, ?\C-u)
20 (#o24, #x14, ?\C-t) [2 times]
Undo! [3 times]
20 (#o24, #x14, ?\C-t)
Auto-saving...
mwheel-scroll: Beginning of buffer
Mark set
Auto-saving...done

Configured features:
XPM JPEG TIFF GIF PNG RSVG IMAGEMAGICK SOUND GPM DBUS GCONF GSETTINGS
NOTIFY ACL GNUTLS LIBXML2 FREETYPE M17N_FLT LIBOTF XFT ZLIB
TOOLKIT_SCROLL_BARS GTK3 X11 LIBSYSTEMD

Important settings:
value of $LC_COLLATE: en_US.UTF-8
value of $LANG: en_US.UTF-8
locale-coding-system: utf-8-unix

Major mode: JavaScript

Minor modes in effect:
tooltip-mode: t
global-eldoc-mode: t
electric-indent-mode: t
mouse-wheel-mode: t
tool-bar-mode: t
menu-bar-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
blink-cursor-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
line-number-mode: t
transient-mark-mode: t

Load-path shadows:
None found.

Features:
(shadow sort mail-extr cl-extra help-fns radix-tree cl-seq help-mode
debug emacsbug message subr-x puny dired dired-loaddefs format-spec
rfc822 mml mml-sec password-cache epa derived epg epg-config gnus-util
rmail rmail-loaddefs mm-decode mm-bodies mm-encode mail-parse rfc2231
mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums
mm-util mail-prsvr mail-utils js advice sgml-mode dom json map seq
byte-opt bytecomp byte-compile cconv imenu thingatpt cc-mode cc-fonts
easymenu cc-guess cc-menus cc-cmds cc-styles cc-align cc-engine cc-vars
cc-defs cl gv cl-loaddefs cl-lib time-date mule-util tooltip eldoc
electric uniquify ediff-hook vc-hooks lisp-float-type mwheel term/x-win
x-win term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe
tabulated-list replace newcomment text-mode elisp-mode lisp-mode
prog-mode register page menu-bar rfn-eshadow isearch timer select
scroll-bar mouse jit-lock font-lock syntax facemenu font-core
term/tty-colors frame cl-generic cham georgian utf-8-lang misc-lang
vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932
hebrew greek romanian slovak czech european ethiopic indian cyrillic
chinese composite charscript charprop case-table epa-hook jka-cmpr-hook
help simple abbrev obarray minibuffer cl-preloaded nadvice loaddefs
button faces cus-face macroexp files text-properties overlay sha1 md5
base64 format env code-pages mule custom widget hashtable-print-readable
backquote dbusbind inotify dynamic-setting system-font-setting
font-render-setting move-toolbar gtk x-toolkit x multi-tty
make-network-process emacs)

Memory information:
((conses 16 133992 34676)
(symbols 48 23645 1)
(miscs 40 95 584)
(strings 32 30245 2336)
(string-bytes 1 974865)
(vectors 16 20847)
(vector-slots 8 721119 47152)
(floats 8 53 405)
(intervals 56 801 49)
(buffers 976 14))




Reply | Threaded
Open this post in threaded view
|

bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535

Andreas Schwab-2
On Jun 17 2017, Adam Niederer <[hidden email]> wrote:

> let x = /* 👍 */ { foo: 0
>                    bar: 0 }

(char-width ?👍) => 2

> let x = /* ☺ */ { foo: 0
>                   bar: 0 }

(char-width ?☺) => 1

Andreas.

--
Andreas Schwab, [hidden email]
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



Reply | Threaded
Open this post in threaded view
|

bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535

Eli Zaretskii
In reply to this post by Adam Niederer
> From: Adam Niederer <[hidden email]>
> Date: Sat, 17 Jun 2017 02:28:41 -0400
>
> Hello, I believe I've found an indentation issue. To reproduce, start
> emacs, create a buffer in js-mode, paste in this code, and press C-x h
> TAB to indent the buffer:
>
> let x = /* 👍 */ { foo: 0
>                    bar: 0 }
>
> let x = /* ☺ */ { foo: 0
>                   bar: 0 }
>
> Both 25.2 and 26.0.50 add one extra space before "bar" in the first
> first snippet with U+1F44D THUMBS UP SIGN in the comment, whereas the
> second snippet with U+263A WHITE SMILING FACE properly aligns "bar" with
> "foo".

That's because U+1F44D is a double-width character:

  (char-width ?👍) => 2

while U+263A is not double-width.

So as long as indentation works in columns and not in pixels, this is
a "feature".

> This appears to happen whenever the character in the comment needs a
> surrogate pair.

I don't believe surrogates have anything to do with this, since Emacs
works with Unicode codepoints, not their UTF-16 encodings.

> Interestingly, pressing TAB with one's point on the second line of each
> snippet to dedent the line yields a correct result for both symbols:
>
> "👍", {"a": 2,
>     "b": 3}
>
> "☺", {"a":2,
>     "b":3}

Which is probably a subtle bug: this should behave like the first
snippet.

Thanks.



Reply | Threaded
Open this post in threaded view
|

bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535

Andreas Schwab-2
On Jun 17 2017, Eli Zaretskii <[hidden email]> wrote:

> That's because U+1F44D is a double-width character:
>
>   (char-width ?👍) => 2

The list in international/character.el is outdated.

> So as long as indentation works in columns and not in pixels, this is
> a "feature".

You surely don't want indentation to depend on font selection.

Andreas.

--
Andreas Schwab, [hidden email]
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



Reply | Threaded
Open this post in threaded view
|

bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535

Eli Zaretskii
> From: Andreas Schwab <[hidden email]>
> Cc: Adam Niederer <[hidden email]>,  [hidden email]
> Date: Sat, 17 Jun 2017 10:24:41 +0200
>
> On Jun 17 2017, Eli Zaretskii <[hidden email]> wrote:
>
> > That's because U+1F44D is a double-width character:
> >
> >   (char-width ?👍) => 2
>
> The list in international/character.el is outdated.

I think the intent was to produce it from the Unicode data
(EastAsianWidth.txt).  I don't recall why this didn't happen; patches
are welcome.  Alternatively, synching the data with the latest Unicode
manually would be good as a stopgap.

> > So as long as indentation works in columns and not in pixels, this is
> > a "feature".
>
> You surely don't want indentation to depend on font selection.

Patches for doing indentation in pixels are welcome, of course.



Reply | Threaded
Open this post in threaded view
|

bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535

Andreas Schwab-2
On Jun 17 2017, Eli Zaretskii <[hidden email]> wrote:

> I think the intent was to produce it from the Unicode data
> (EastAsianWidth.txt).  I don't recall why this didn't happen; patches
> are welcome.  Alternatively, synching the data with the latest Unicode
> manually would be good as a stopgap.

Actually, even Unicode 10 lists it as double width.

Andreas.

--
Andreas Schwab, [hidden email]
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



Reply | Threaded
Open this post in threaded view
|

bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535

Eli Zaretskii
> From: Andreas Schwab <[hidden email]>
> Cc: [hidden email],  [hidden email]
> Date: Sat, 17 Jun 2017 14:09:44 +0200
>
> On Jun 17 2017, Eli Zaretskii <[hidden email]> wrote:
>
> > I think the intent was to produce it from the Unicode data
> > (EastAsianWidth.txt).  I don't recall why this didn't happen; patches
> > are welcome.  Alternatively, synching the data with the latest Unicode
> > manually would be good as a stopgap.
>
> Actually, even Unicode 10 lists it as double width.

OK, then why did you say the data was outdated?



Reply | Threaded
Open this post in threaded view
|

bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535

Andreas Schwab-2
On Jun 17 2017, Eli Zaretskii <[hidden email]> wrote:

> OK, then why did you say the data was outdated?

Because it was.

Andreas.

--
Andreas Schwab, [hidden email]
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



Reply | Threaded
Open this post in threaded view
|

bug#27403: 26.0.50; Indentation misalignment with Unicode code points >65535

Eli Zaretskii
> From: Andreas Schwab <[hidden email]>
> Cc: [hidden email],  [hidden email]
> Date: Sat, 17 Jun 2017 20:07:56 +0200
>
> On Jun 17 2017, Eli Zaretskii <[hidden email]> wrote:
>
> > OK, then why did you say the data was outdated?
>
> Because it was.

Where's the up-to-date data we could use?