bug#23086: 25.1.50; Emacs ignores Unicode line and paragraph separator characters

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#23086: 25.1.50; Emacs ignores Unicode line and paragraph separator characters

Philipp Stephani

Type some characters
C-x 8 RET LINE SEPARATOR (or PARAGRAPH SEPARATOR)
Type some more characters
M-q

Expected behavior: Emacs treats these characters as line and paragraph
separators: they are displayed as line breaks, M-q doesn't remove them,
and forward-paragraph etc. treat the paragraph separator as paragraph
end.

Actual behavior: These characters are displayed as one-pixel horizontal
whitespace and otherwise ignore.

Also discussed in
https://lists.gnu.org/archive/html/emacs-devel/2015-08/msg01043.html.
https://www.emacswiki.org/emacs/unicode-whitespace.el supposedly adds
support for these characters, but I think proper treatment of Unicode
separators should be part of Emacs.



In GNU Emacs 25.1.50.1 (x86_64-unknown-linux-gnu, GTK+ Version 3.10.8)
Repository revision: 780a605e1d2de4b975e6f1f29b491c9af419dcff
Windowing system distributor 'The X.Org Foundation', version 11.0.11501000
System Description: Ubuntu 14.04 LTS

Configured using:
 'configure --with-modules --disable-build-details 'CFLAGS=-g -O0''

Configured features:
XPM JPEG TIFF GIF PNG RSVG SOUND GPM DBUS GCONF GSETTINGS NOTIFY ACL
LIBSELINUX GNUTLS LIBXML2 FREETYPE M17N_FLT LIBOTF XFT ZLIB
TOOLKIT_SCROLL_BARS GTK3 X11 MODULES

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Text

Minor modes in effect:
  tooltip-mode: t
  global-eldoc-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent messages:
For information about GNU Emacs and the GNU system, type C-h C-a.
Quit
Fill column set to 10 (was 70)
Quit
Making completion list...

Load-path shadows:
None found.

Features:
(shadow sort mail-extr emacsbug message dired dired-loaddefs format-spec
rfc822 mml easymenu mml-sec password-cache epa derived epg epg-config
gnus-util rmail rmail-loaddefs mm-decode mm-bodies mm-encode mail-parse
rfc2231 mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045
ietf-drums mm-util mail-prsvr mail-utils iso-transl time-date mule-util
tooltip eldoc electric uniquify ediff-hook vc-hooks lisp-float-type
mwheel term/x-win x-win term/common-win x-dnd tool-bar dnd fontset image
regexp-opt fringe tabulated-list newcomment elisp-mode lisp-mode
prog-mode register page menu-bar rfn-eshadow timer select scroll-bar
mouse jit-lock font-lock syntax facemenu font-core term/tty-colors frame
cl-generic cham georgian utf-8-lang misc-lang vietnamese tibetan thai
tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek romanian
slovak czech european ethiopic indian cyrillic chinese charscript
case-table epa-hook jka-cmpr-hook help simple abbrev obarray minibuffer
cl-preloaded nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote dbusbind inotify
dynamic-setting system-font-setting font-render-setting move-toolbar gtk
x-toolkit x multi-tty make-network-process emacs)

Memory information:
((conses 16 174467 8982)
 (symbols 48 30106 0)
 (miscs 40 468 148)
 (strings 32 66519 6641)
 (string-bytes 1 1505951)
 (vectors 16 13333)
 (vector-slots 8 488346 23035)
 (floats 8 167 91)
 (intervals 56 233 2)
 (buffers 976 13)
 (heap 1024 43667 1138))

--
Google Germany GmbH
Erika-Mann-Straße 33
80636 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle

Diese E-Mail ist vertraulich.  Wenn Sie nicht der richtige Adressat sind,
leiten Sie diese bitte nicht weiter, informieren Sie den Absender und löschen
Sie die E-Mail und alle Anhänge.  Vielen Dank.

This e-mail is confidential.  If you are not the right addressee please do not
forward it, please inform the sender, and please erase this e-mail including
any attachments.  Thanks.



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#23086: 25.1.50; Emacs ignores Unicode line and paragraph separator characters

Eli Zaretskii
> From: Philipp Stephani <[hidden email]>
> Date: Tue, 22 Mar 2016 11:42:46 +0100
>
> Type some characters
> C-x 8 RET LINE SEPARATOR (or PARAGRAPH SEPARATOR)
> Type some more characters
> M-q
>
> Expected behavior: Emacs treats these characters as line and paragraph
> separators: they are displayed as line breaks, M-q doesn't remove them,
> and forward-paragraph etc. treat the paragraph separator as paragraph
> end.
>
> Actual behavior: These characters are displayed as one-pixel horizontal
> whitespace and otherwise ignore.
>
> Also discussed in
> https://lists.gnu.org/archive/html/emacs-devel/2015-08/msg01043.html.
> https://www.emacswiki.org/emacs/unicode-whitespace.el supposedly adds
> support for these characters, but I think proper treatment of Unicode
> separators should be part of Emacs.

It is not clear to me what exactly is the requested feature.  Can you
propose a detailed list of requirements?

I'm asking because these characters come in Unicode with a non-trivial
baggage, that is a far cry from just breaking the line; see

  http://unicode.org/reports/tr14/
  http://unicode.org/reports/tr29/

There are also implications on the bidirectional display (it is
sensitive to where the line and the paragraph begin and end).

If we want to support these two characters, we should think about
which parts of the relevant functionality we want to see in Emacs,
because users will expect that.  In addition, there are other
white-space characters defined by Unicode, and it would make sense to
treat them all alike.  I'm not sure it makes sense to support just the
line-breaking and paragraph-separator parts of only these two
characters.

Then there are Emacs-specific issues, for example:

 . do we treat u+2028 and u+2029 as literal characters, or as a form
   of EOL encoding?
 . if the former, how do we distinguish them from newlines on display?
 . should Isearch find these when looking for "\n"? how about regexp
   search for "$"?

There are probably more implications, these just the ones that popped
in my mind in 5 sec.  IOW, I think Someone™ should think this over and
present a detailed proposal.

Thanks.



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#23086: 25.1.50; Emacs ignores Unicode line and paragraph separator characters

John Wiegley
>>>>> Eli Zaretskii <[hidden email]> writes:

> There are probably more implications, these just the ones that popped in my
> mind in 5 sec. IOW, I think Someone™ should think this over and present a
> detailed proposal.

Very much agreed. Reading this bug description gives me that "There be
dragons" feeling. :)

--
John Wiegley                  GPG fingerprint = 4710 CF98 AF9B 327B B80F
http://newartisans.com                          60E1 46C4 BD1A 7AC1 4BA2



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#23086: 25.1.50; Emacs ignores Unicode line and paragraph separator characters

Eli Zaretskii
In reply to this post by Eli Zaretskii
> Date: Tue, 22 Mar 2016 18:13:15 +0200
> From: Eli Zaretskii <[hidden email]>
> Cc: [hidden email]
>
> > From: Philipp Stephani <[hidden email]>
> > Date: Tue, 22 Mar 2016 11:42:46 +0100
> >
> > Type some characters
> > C-x 8 RET LINE SEPARATOR (or PARAGRAPH SEPARATOR)
> > Type some more characters
> > M-q
> >
> > Expected behavior: Emacs treats these characters as line and paragraph
> > separators: they are displayed as line breaks, M-q doesn't remove them,
> > and forward-paragraph etc. treat the paragraph separator as paragraph
> > end.
> >
> > Actual behavior: These characters are displayed as one-pixel horizontal
> > whitespace and otherwise ignore.
> >
> > Also discussed in
> > https://lists.gnu.org/archive/html/emacs-devel/2015-08/msg01043.html.
> > https://www.emacswiki.org/emacs/unicode-whitespace.el supposedly adds
> > support for these characters, but I think proper treatment of Unicode
> > separators should be part of Emacs.
>
> It is not clear to me what exactly is the requested feature.  Can you
> propose a detailed list of requirements?
>
> I'm asking because these characters come in Unicode with a non-trivial
> baggage, that is a far cry from just breaking the line; see
>
>   http://unicode.org/reports/tr14/
>   http://unicode.org/reports/tr29/
>
> There are also implications on the bidirectional display (it is
> sensitive to where the line and the paragraph begin and end).
>
> If we want to support these two characters, we should think about
> which parts of the relevant functionality we want to see in Emacs,
> because users will expect that.  In addition, there are other
> white-space characters defined by Unicode, and it would make sense to
> treat them all alike.  I'm not sure it makes sense to support just the
> line-breaking and paragraph-separator parts of only these two
> characters.
>
> Then there are Emacs-specific issues, for example:
>
>  . do we treat u+2028 and u+2029 as literal characters, or as a form
>    of EOL encoding?
>  . if the former, how do we distinguish them from newlines on display?
>  . should Isearch find these when looking for "\n"? how about regexp
>    search for "$"?
>
> There are probably more implications, these just the ones that popped
> in my mind in 5 sec.  IOW, I think Someone™ should think this over and
> present a detailed proposal.

So I've dusted off this year-old bug reported and decided to improve
Emacs in this area.  Here's what I propose:

 . u+2028 and u+2029 (and also perhaps u+0085) will be treated a form
   of EOL encoding, which means they will not appear on display, and
   will cause the next character be displayed on the next screen line
 . M-q will remove u+2028, as it removes newlines, and put newlines
   at all EOLs as part of filling
 . M-q will NOT remove u+2029, unless the user wants to refill several
   paragraphs as a single paragraph, and there happens to be a u+2029
   between some of the paragraphs
 . forward-paragraph etc. will treat u+2029 as paragraph end
 . bidi reordering will treat u+2029 as paragraph end

There are some compromises in these decisions, but they make the job
much easier and less intrusive, and I think they will advance the
level of our Unicode support quite a bit.

Comments?

I think we should also make $ match these two characters, in addition
to the newline, but that could be more difficult.  Would someone who
knows their way in regex.c want to work on this part?



Loading...