bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Itai Berli
According to the Emacs manual (section 37.26 Bidirectional Display)

>  Emacs provides a “Full Bidirectionality” class implementation of the
>  UBA, consistent with the requirements of the Unicode Standard v8.0.

And again (section 22.19 Bidirectional Editing)

> Emacs implements the Unicode Bidirectional Algorithm described in the Unicode Standard Annex #9, for reordering of bidirectional text for display.

However these statements are false. Emacs does not implement the Unicode
Bidirectional Algorithm correctly, and therefore does not even provide
'Implicit bidirectionality', which is the minimal level of conformance
listed in section 4.2 'Explicit Formatting Character' of the Unicode
8.0.0 Bidirectional Algorithm specifications
(www.unicode.org/reports/tr9/tr9-33.html), let alone 'Full bidirectionality'.

The reason has to do with the way the Emacs bidi implementation
recognizes separate paragraphs, which is inconsistent with the Unicode
specifications.

The unicode Bidirectional Algorithm, specify (section 3 'Basic
Display Algorithm')

> The algorithm reorders text only within a paragraph; characters in one
> paragraph have no effect on characters in a different
> paragraph. Paragraphs are divided by the Paragraph Separator or
> appropriate Newline Function (for guidelines on the handling of CR,
> LF, and CRLF, see Section 4.4, Directionality, and Section 5.8,
> Newline Guidelines of [Unicode]).

However Emacs, by its own admition (section 22.19 Bidirectional
Editing), take the following approach:

> Paragraph boundaries are empty lines, i.e., lines consisting entirely of whitespace characters.

I'll repeat: according to Unicode a paragraph ends with a paragraph
separator. What constitutes a paragraph separator is specified precisely
in section 5.8 'Newline Guidelines' of The Unicode Standard version
8.0.0. For instance, on a MacOS X system, it is `LF` (line feed,
Unicode 000A). The formatting effects of the bidi algorithm must not
cross the paragraph separator boundary.

And yet in Emacs the formatting extend beyond the paragraph separator,
and this is the case on all operating systems. Consider, for instance,
the following example.

ILLUSTRATION: An English paragraph directly following a Hebrew paragraph
is formatted like Hebrew text.
http://imgur.com/3eyrUfA

The first, Hebrew paragraph is formatted correctly, however the second,
English paragraph is formatted wrongly, as though it was a Hebrew
paragraph: it is right justified, the question mark appears on the left,
and so does the cursor. Once an empty paragraph is inserted between the two
paragraph, the English paragraph is formatted correctly.

ILLUSTRATION: When paragraphs are separated by an empty paragraph, they
are formatted correctly.
http://imgur.com/ZsHGkwf

This is not just a theoretical question of conformance to standards;
this problem has practical consequences.

Consider, for
instance, a LaTeX document for typesetting Hebrew
text. Normally in order to eliminate the usual leading indentation of
the first line of a paragraph, a `\noinent` command is placed at the
beginning of the paragraph. However, because the Unicode bidi algorithm
determins the directionality of a paragraph based on its first word, the
Hebrew text is formatted like English text. This is not a problem; it is
to be expected.

ILLUSTRATION: A LaTeX document for typesetting a Hebrew paragraph with
no indentation of the first line.
http://imgur.com/xYUkZKr

One way to resolve this is to explicitly change the directionality of the
paragraph, however, disregarding the fact that this is not currently
possible due to a separate Emacs bug, even if it were possible, it would
affect the placement of the backslash at the beginning of the
`\noindent` command, which will no longer look like a LaTeX command.

ILLUSTRATION: Explicitly changing the directionality of the
paragraph.
http://imgur.com/sPcVReA

(Note: This is a screenshot of a Microsoft Word application,
since due to a bug, Emacs doesn't currently enable to change the
automatically determined directionality of a paragraph.)

So the best way to resolve this problem would be to place the `\noindent`
command on a separate paragraph. Unfortunately, here Emacs' faulty
implementatino of the Unicode bidi algorithm rears its ugly
head. Since Emacs doesn't recognize the paragraph separator for what it
is, it will format the Hebrew text wrongly as though it were an English text.

ILLUSTRATION: Putting the `\noindent` on a separate paragraph results in
the Hebrew text being formatted like English text
http://imgur.com/44ds6rK

Placing an empty paragraph between the `\noindent' command and the
Hebrew text will resolve the formatting problem inside the Emacs editor, but
now the `\indent` command, which only affects the current LaTeX
paragraphs (LaTeX paragraphs are ended by an empty line), no longer
eliminates the indentation of the first line of the Hebrew paragraph in
the typeset file.



In GNU Emacs 25.1.1 (x86_64-apple-darwin13.4.0, NS appkit-1265.21
Version 10.9.5 (Build 13F1911))
 of 2016-09-21 built on builder10-9.porkrind.org
Windowing system distributor 'Apple', version 10.3.1504
Configured using:
 'configure --with-ns '--enable-locallisppath=/Library/Application
 Support/Emacs/${version}/site-lisp:/Library/Application
 Support/Emacs/site-lisp' --with-modules'

Configured features:
NOTIFY ACL GNUTLS LIBXML2 ZLIB TOOLKIT_SCROLL_BARS NS MODULES

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Fundamental

Minor modes in effect:
  ivy-mode: t
  shell-dirtrack-mode: t
  projectile-mode: t
  helm-descbinds-mode: t
  async-bytecomp-package-mode: t
  tooltip-mode: t
  global-eldoc-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  buffer-read-only: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent messages:
ad-handle-definition: ‘ibuffer’ got redefined
Turn on helm-projectile key bindings
For information about GNU Emacs and the GNU system, type C-h C-a.

Load-path shadows:
/Users/itaiberli/.emacs.d/elpa/seq-2.20/seq hides
/Applications/Emacs.app/Contents/Resources/lisp/emacs-lisp/seq

Features:
(shadow sort mail-extr emacsbug message rfc822 mml mml-sec epg mm-decode
mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader
sendmail rfc2047 rfc2045 ietf-drums mail-utils colir color counsel
jka-compr esh-util etags xref project swiper reftex reftex-vars
two-column ivy delsel ivy-overlay helm-projectile helm-files rx
image-dired tramp tramp-compat tramp-loaddefs trampver shell pcomplete
format-spec dired-x dired-aux ffap helm-tags helm-bookmark helm-adaptive
helm-info bookmark pp helm-external helm-net browse-url xml url
url-proxy url-privacy url-expand url-methods url-history url-cookie
url-domsuf url-util url-parse auth-source gnus-util mm-util help-fns
mail-prsvr password-cache url-vars mailcap helm-buffers helm-grep
helm-regexp helm-utils helm-locate helm-help helm-types projectile grep
compile comint ansi-color ring ibuf-ext ibuffer thingatpt helm-descbinds
helm easy-mmode helm-source cl-seq eieio-compat eieio eieio-core
helm-multi-match helm-lib dired helm-config helm-easymenu cl-macs
async-bytecomp async advice edmacro kmacro finder-inf tex-site info
package epg-config seq byte-opt gv bytecomp byte-compile cl-extra
help-mode easymenu cconv cl-loaddefs pcase cl-lib time-date mule-util
tooltip eldoc electric uniquify ediff-hook vc-hooks lisp-float-type
mwheel ns-win ucs-normalize term/common-win tool-bar dnd fontset image
regexp-opt fringe tabulated-list newcomment elisp-mode lisp-mode
prog-mode register page menu-bar rfn-eshadow timer select scroll-bar
mouse jit-lock font-lock syntax facemenu font-core frame cl-generic cham
georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao
korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech
european ethiopic indian cyrillic chinese charscript case-table epa-hook
jka-cmpr-hook help simple abbrev minibuffer cl-preloaded nadvice
loaddefs button faces cus-face macroexp files text-properties overlay
sha1 md5 base64 format env code-pages mule custom widget
hashtable-print-readable backquote kqueue cocoa ns multi-tty
make-network-process emacs)

Memory information:
((conses 16 312045 13704)
 (symbols 48 30403 0)
 (miscs 40 88 192)
 (strings 32 51754 11765)
 (string-bytes 1 1669992)
 (vectors 16 50218)
 (vector-slots 8 844617 7052)
 (floats 8 564 218)
 (intervals 56 242 111)
 (buffers 976 18))



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: Explicit directionality marks CAN be inserted!

Itai Berli
I'd like to retract my statement I made in the LaTeX example that
inserting explicit directionality marks doesn't work in Emacs. It
does.



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Eli Zaretskii
In reply to this post by Itai Berli
> From: Itai Berli <[hidden email]>
> Date: Thu, 29 Jun 2017 12:16:00 +0300
>
> I'll repeat: according to Unicode a paragraph ends with a paragraph
> separator. What constitutes a paragraph separator is specified precisely
> in section 5.8 'Newline Guidelines' of The Unicode Standard version
> 8.0.0. For instance, on a MacOS X system, it is `LF` (line feed,
> Unicode 000A). The formatting effects of the bidi algorithm must not
> cross the paragraph separator boundary.
>
> And yet in Emacs the formatting extend beyond the paragraph separator,
> and this is the case on all operating systems. Consider, for instance,
> the following example.

The UBA allows applications to employ "higher-level protocols" when
deciding on base paragraph direction.  See section 4.3 in UAX#9 and
specifically clause HL1 there.

This is what Emacs does: it applies its own heuristics for this
decision.  The reason for that is that Emacs's implementation of the
UBA must work reasonably well in plain-text buffers, where typically
long paragraphs are broken into lines by newline characters (which are
paragraph separators according to the UBA), and many times the
partition into lines is done by auto-fill or similar features, thus
making the first character of the next line fairly arbitrary.  Using
the UBA paragraph-direction determination would then produce
unacceptable results, whereby the direction of a part of a paragraph
could change in unpredictable ways when text is refilled.

> Consider, for
> instance, a LaTeX document for typesetting Hebrew
> text. Normally in order to eliminate the usual leading indentation of
> the first line of a paragraph, a `\noinent` command is placed at the
> beginning of the paragraph. However, because the Unicode bidi algorithm
> determins the directionality of a paragraph based on its first word, the
> Hebrew text is formatted like English text. This is not a problem; it is
> to be expected.

The Emacs bidirectional display doesn't have special facilities for
marked-up text, such as TeX and HTML/XML.  Because those markups use
punctuation characters for their markup, doing so in RTL context can
produce unpleasant results in the default display, as you point out.

You can alleviate this to some extent by (in your case) starting the
paragraph with an RLM control character before \noindent, optionally
followed by an LRM or enclosing \noindent in LRE..PDF (so that the
backslash displays to the left of "noindent").  This is admittedly a
bit awkward, but I think the results are still acceptable.

I will gladly work with anyone who'd volunteer to introduce features
required to better support markup languages.  This will require
low-level display changes and some support from the relevant major
modes to use those features.  For now, the demand was sufficiently low
(I think you are about the second person to raise the issue since
bidirectional display debuted in Emacs 24.1) to keep this issue low on
our TODO.

> One way to resolve this is to explicitly change the directionality of the
> paragraph, however, disregarding the fact that this is not currently
> possible due to a separate Emacs bug, even if it were possible, it would
> affect the placement of the backslash at the beginning of the
> `\noindent` command, which will no longer look like a LaTeX command.

I think my suggestion above fixes this latter issue as well.

Thanks.



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Itai Berli
In reply to this post by Itai Berli
> The UBA allows applications to employ "higher-level protocols" when
> deciding on base paragraph direction.  See section 4.3 in UAX#9 and specifically clause HL1 there.

> This is what Emacs does: it applies its own heuristics for this
> decision.  The reason for that is that Emacs's implementation of the
> UBA must work reasonably well in plain-text buffers, where typically
> long paragraphs are broken into lines by newline characters (which are
> paragraph separators according to the UBA), and many times the
> partition into lines is done by auto-fill or similar features, thus
> making the first character of the next line fairly arbitrary.  Using
> the UBA paragraph-direction determination would then produce
> unacceptable results, whereby the direction of a part of a paragraph
> could change in unpredictable ways when text is refilled.

 As I understand it, the "higher-level protocols" provision is intended
 to allow for such things as table cells, elements of structured markup
 languages, and word processors that use an idio-syncratic
 implementation of a paragraph separator *under the hood*. It is not
 intended for plain running text; for this the standard specifies
 explicitly what the paragraph separators for every operating system
 are.

> typically long paragraphs are broken into lines by newline characters

I see no evidence of the validity of this statement on my system (Emacs
25.1.1). But even if this were so, it would still not merit
*hard-coding* the paragraph separator as a blank line, as there are
situations (such as the one I presented in my bug report) that require
a diffferent configuration.

> You can alleviate this to some extent by ...(in your case) starting
> the paragraph with an RLM control character before \noindent,
> optionally followed by an LRM or enclosing \noindent in LRE..PDF (so
> that the backslash displays to the left of "noindent").  This is
> admittedly a bit awkward, but I think the results are still acceptable.

As you mentioned, the solution is cubersome. It might have been
acceptable if this was the sole issue, but this example illustrates just one of
several problems that arise due to current paragraph separator
convention.

In conclusion, and on a personal note, I implore you to change this
behavior, and to do so as soon as possible, and not only for specialized
markup documents, but for every document.

I am currently working on my thesis. Emacs is useless to me as a text
editor of Hebrew texts without this feature. This is no
exaggeration.

The original reason I chose Emacs over other editors was because of
the combination of AUCTeX and the promise of full Unicode
compatibility. AUCTeX has delivered on its promise, but in the area of
Unicode, as far as my needs are concerned it is if there was no Unicode
support at all, and I will be sadly forced to look for a different editor.



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Itai Berli
I'd like to add another reason why this behavior is problematic: it breaks interoperability with other plain text editors, since the text will not be displayed the same way. Consider, for instance, the very same plain text file
in GEdit: http://imgur.com/Iw4yrdQ

Finally, the question of whether Emacs behavior is consistent with the UBA specifications is debatable, since when UBA section 3 states "Paragraphs may also be determined by higher-level protocols" the question is what exactly the "also" means: is it that the higher-level protocols (HLP) can decide that a newline character is not a paragraph boundary, as Emacs does, or is it that the HLP can only declare paragraph boundaries in addition to paragraph separator characters?

On Thu, Jun 29, 2017 at 9:36 PM, Itai Berli <[hidden email]> wrote:
> The UBA allows applications to employ "higher-level protocols" when
> deciding on base paragraph direction.  See section 4.3 in UAX#9 and specifically clause HL1 there.

> This is what Emacs does: it applies its own heuristics for this
> decision.  The reason for that is that Emacs's implementation of the
> UBA must work reasonably well in plain-text buffers, where typically
> long paragraphs are broken into lines by newline characters (which are
> paragraph separators according to the UBA), and many times the
> partition into lines is done by auto-fill or similar features, thus
> making the first character of the next line fairly arbitrary.  Using
> the UBA paragraph-direction determination would then produce
> unacceptable results, whereby the direction of a part of a paragraph
> could change in unpredictable ways when text is refilled.

 As I understand it, the "higher-level protocols" provision is intended
 to allow for such things as table cells, elements of structured markup
 languages, and word processors that use an idio-syncratic
 implementation of a paragraph separator *under the hood*. It is not
 intended for plain running text; for this the standard specifies
 explicitly what the paragraph separators for every operating system
 are.

> typically long paragraphs are broken into lines by newline characters

I see no evidence of the validity of this statement on my system (Emacs
25.1.1). But even if this were so, it would still not merit
*hard-coding* the paragraph separator as a blank line, as there are
situations (such as the one I presented in my bug report) that require
a diffferent configuration.

> You can alleviate this to some extent by ...(in your case) starting
> the paragraph with an RLM control character before \noindent,
> optionally followed by an LRM or enclosing \noindent in LRE..PDF (so
> that the backslash displays to the left of "noindent").  This is
> admittedly a bit awkward, but I think the results are still acceptable.

As you mentioned, the solution is cubersome. It might have been
acceptable if this was the sole issue, but this example illustrates just one of
several problems that arise due to current paragraph separator
convention.

In conclusion, and on a personal note, I implore you to change this
behavior, and to do so as soon as possible, and not only for specialized
markup documents, but for every document.

I am currently working on my thesis. Emacs is useless to me as a text
editor of Hebrew texts without this feature. This is no
exaggeration.

The original reason I chose Emacs over other editors was because of
the combination of AUCTeX and the promise of full Unicode
compatibility. AUCTeX has delivered on its promise, but in the area of
Unicode, as far as my needs are concerned it is if there was no Unicode
support at all, and I will be sadly forced to look for a different editor.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Eli Zaretskii
> From: Itai Berli <[hidden email]>
> Date: Tue, 4 Jul 2017 13:42:19 +0300
>
> I'd like to add another reason why this behavior is problematic: it breaks interoperability with other plain text
> editors, since the text will not be displayed the same way. Consider, for instance, the very same plain text file
> in GEdit: http://imgur.com/Iw4yrdQ
> in Emacs: http://imgur.com/7kfWseE

As I already explained, the behavior of GEdit is unacceptable for
Emacs, because most modes derived from Text mode tend to deal with
buffers where lines are broken by newlines, so potentially switching
paragraph direction just because a newline happens to be there would
have devastating effect on the text as displayed.  This is perhaps in
contrast with other editors and word-processors which mostly deal with
long lines without hard newlines.  That's why the notion of paragraph
in Emacs's UBA implementation was chosen to fit the traditional Emacs
definition of paragraph in text-mode and its derivatives.

> Finally, the question of whether Emacs behavior is consistent with the UBA specifications is debatable, since
> when UBA section 3 states "Paragraphs may also be determined by higher-level protocols" the question is
> what exactly the "also" means: is it that the higher-level protocols (HLP) can decide that a newline character is
> not a paragraph boundary, as Emacs does, or is it that the HLP can only declare paragraph boundaries in
> addition to paragraph separator characters?

It is clear from the context and the example following the above
sentence that "also" doesn't mean "in addition".

However, the main issue is not the paragraph boundary, the main issue
is how the base direction of the paragraph is determined.  Because no
matter where the paragraph boundary is, if the base direction is not
recalculated there, then the fact that the boundary is there doesn't
matter.

From Section 4.3 Higher-Level Protocols of the UAX#9:

  HL1. Override P3, and set the paragraph embedding level
       explicitly. This does not apply when deciding how to treat FSI
       in rule X5c.

       . A higher-level protocol may set any paragraph level. This can
        be done on the basis of the context, such as on a table cell,
        paragraph, document, or system level. (P2 may be skipped if
        P3 is overridden). [...]
       . A higher-level protocol may apply rules equivalent to P2 and
        P3 but default to level 1 (RTL) rather than 0 (LTR) to match
        overall RTL context.
       . A higher-level protocol may use an entirely different
        algorithm that heuristically auto-detects the paragraph
        embedding level based on the paragraph text and its
        context. For example, it could base it on whether there are
        more RTL characters in the text than LTR. As another example,
        when the paragraph contains no strong characters, its
        direction could be determined by the levels of the paragraphs
        before and after.

And Section 3.3.1, which describes the P1, P2, and P3 paragraph-level
rules, says:

  Whenever a higher-level protocol specifies the paragraph level,
  rules P2 and P3 may be overridden: see HL1.

So an application is allowed to override _all_ of the paragraph-level
rules, and do what suits it best.  And based on some non-negligible
experience with bidi-aware applications, I submit that an application
that does _not_ employ some higher-level protocol for base paragraph
direction will violate user expectations when working with plain text.
E.g., try reading in MS Outlook an unformatted text message which has
a lot of RTL text mixed with LTR.  It's unreadable; I always
copy/paste it into Emacs, and only then I'm able to read it.



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Itai Berli
As I already explained, the behavior of GEdit is unacceptable for
Emacs, because most modes derived from Text mode tend to deal with
buffers where lines are broken by newlines, so potentially switching
paragraph direction just because a newline happens to be there would
have devastating effect on the text as displayed.

How about letting the user decide what's best for them? Would it be possible to add an option to Emacs that a user can set, say, in their .emacs file, which will determine whether the bidi imiplementation will consider the newline character as the paragraph separator or an empty line?

On Tue, Jul 4, 2017 at 6:03 PM, Eli Zaretskii <[hidden email]> wrote:
> From: Itai Berli <[hidden email]>
> Date: Tue, 4 Jul 2017 13:42:19 +0300
>
> I'd like to add another reason why this behavior is problematic: it breaks interoperability with other plain text
> editors, since the text will not be displayed the same way. Consider, for instance, the very same plain text file
> in GEdit: http://imgur.com/Iw4yrdQ
> in Emacs: http://imgur.com/7kfWseE

As I already explained, the behavior of GEdit is unacceptable for
Emacs, because most modes derived from Text mode tend to deal with
buffers where lines are broken by newlines, so potentially switching
paragraph direction just because a newline happens to be there would
have devastating effect on the text as displayed.  This is perhaps in
contrast with other editors and word-processors which mostly deal with
long lines without hard newlines.  That's why the notion of paragraph
in Emacs's UBA implementation was chosen to fit the traditional Emacs
definition of paragraph in text-mode and its derivatives.

> Finally, the question of whether Emacs behavior is consistent with the UBA specifications is debatable, since
> when UBA section 3 states "Paragraphs may also be determined by higher-level protocols" the question is
> what exactly the "also" means: is it that the higher-level protocols (HLP) can decide that a newline character is
> not a paragraph boundary, as Emacs does, or is it that the HLP can only declare paragraph boundaries in
> addition to paragraph separator characters?

It is clear from the context and the example following the above
sentence that "also" doesn't mean "in addition".

However, the main issue is not the paragraph boundary, the main issue
is how the base direction of the paragraph is determined.  Because no
matter where the paragraph boundary is, if the base direction is not
recalculated there, then the fact that the boundary is there doesn't
matter.

From Section 4.3 Higher-Level Protocols of the UAX#9:

  HL1. Override P3, and set the paragraph embedding level
       explicitly. This does not apply when deciding how to treat FSI
       in rule X5c.

       . A higher-level protocol may set any paragraph level. This can
         be done on the basis of the context, such as on a table cell,
         paragraph, document, or system level. (P2 may be skipped if
         P3 is overridden). [...]
       . A higher-level protocol may apply rules equivalent to P2 and
         P3 but default to level 1 (RTL) rather than 0 (LTR) to match
         overall RTL context.
       . A higher-level protocol may use an entirely different
         algorithm that heuristically auto-detects the paragraph
         embedding level based on the paragraph text and its
         context. For example, it could base it on whether there are
         more RTL characters in the text than LTR. As another example,
         when the paragraph contains no strong characters, its
         direction could be determined by the levels of the paragraphs
         before and after.

And Section 3.3.1, which describes the P1, P2, and P3 paragraph-level
rules, says:

  Whenever a higher-level protocol specifies the paragraph level,
  rules P2 and P3 may be overridden: see HL1.

So an application is allowed to override _all_ of the paragraph-level
rules, and do what suits it best.  And based on some non-negligible
experience with bidi-aware applications, I submit that an application
that does _not_ employ some higher-level protocol for base paragraph
direction will violate user expectations when working with plain text.
E.g., try reading in MS Outlook an unformatted text message which has
a lot of RTL text mixed with LTR.  It's unreadable; I always
copy/paste it into Emacs, and only then I'm able to read it.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Eli Zaretskii
> From: Itai Berli <[hidden email]>
> Date: Tue, 4 Jul 2017 18:57:33 +0300
>
> How about letting the user decide what's best for them? Would it be possible to add an option to Emacs that a
> user can set, say, in their .emacs file, which will determine whether the bidi imiplementation will consider the
> newline character as the paragraph separator or an empty line?

Could be.  I'd need to carefully review the code to say for sure.
Originally, the regexp which defines where paragraph begins was
customizable, but it led to grave bugs, so I removed that.  Maybe a
more restricted facility could avoid such pitfalls.



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Itai Berli
If you can do it, that'll be fantastic. And while you're perusing the code, perhaps you can see if it is also possible to allow the user to decide whether they want the bidi control characters to be visible or not

On Tue, Jul 4, 2017 at 7:18 PM, Eli Zaretskii <[hidden email]> wrote:
> From: Itai Berli <[hidden email]>
> Date: Tue, 4 Jul 2017 18:57:33 +0300
>
> How about letting the user decide what's best for them? Would it be possible to add an option to Emacs that a
> user can set, say, in their .emacs file, which will determine whether the bidi imiplementation will consider the
> newline character as the paragraph separator or an empty line?

Could be.  I'd need to carefully review the code to say for sure.
Originally, the regexp which defines where paragraph begins was
customizable, but it led to grave bugs, so I removed that.  Maybe a
more restricted facility could avoid such pitfalls.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Eli Zaretskii
> From: Itai Berli <[hidden email]>
> Date: Tue, 4 Jul 2017 19:37:04 +0300
>
> And while you're perusing the code, perhaps you can see if it is also
> possible to allow the user to decide whether they want the bidi control characters to be visible or not

You can do that already: just customize glyphless-char-display-control
to be 'zero-width' for the 'format-control' class, and these
characters will become invisible.  Didn't I mention that up-thread?



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Itai Berli
You did, but it would be much nicer for a noob like me to be able to simply type in my .emacs file something like: (bidi.markers.visible false), or maybe even

(bidi.markers.ALM null)
(bidi.markers.RLM ⊲)
(bidi.markers.LRM ⊳)
...

Isn't the Bidi feature important and complicated enough to merit its own tailored set of customizable parameters?


On Tue, Jul 4, 2017 at 7:47 PM, Eli Zaretskii <[hidden email]> wrote:
> From: Itai Berli <[hidden email]>
> Date: Tue, 4 Jul 2017 19:37:04 +0300
>
> And while you're perusing the code, perhaps you can see if it is also
> possible to allow the user to decide whether they want the bidi control characters to be visible or not

You can do that already: just customize glyphless-char-display-control
to be 'zero-width' for the 'format-control' class, and these
characters will become invisible.  Didn't I mention that up-thread?

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Eli Zaretskii
> From: Itai Berli <[hidden email]>
> Date: Tue, 4 Jul 2017 20:01:25 +0300
>
> You did, but it would be much nicer for a noob like me to be able to simply type in my .emacs file something
> like: (bidi.markers.visible false), or maybe even
>
> (bidi.markers.ALM null)
> (bidi.markers.RLM ⊲)
> (bidi.markers.LRM ⊳)

Sorry, I don't see why the exact way how to customize this is so
important.  glyphless-char-display-control is a user-level
customizable variable, not some obscure feature that requires Lisp
programming to tailor it to your needs.

> Isn't the Bidi feature important and complicated enough to merit its own tailored set of customizable
> parameters?

It does have its private customizations, but this one isn't one of
them, I don't see why it should be.  The characters of the Cf general
category are quite a few, and Emacs handled them all the same, because
they all have the same nature.



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Itai Berli
Is there any progress with allowing the user to customize the end-of-paragraph mark to be the OS paragraph separator character?

On Tue, Jul 4, 2017 at 8:46 PM, Eli Zaretskii <[hidden email]> wrote:
> From: Itai Berli <[hidden email]>
> Date: Tue, 4 Jul 2017 20:01:25 +0300
>
> You did, but it would be much nicer for a noob like me to be able to simply type in my .emacs file something
> like: (bidi.markers.visible false), or maybe even
>
> (bidi.markers.ALM null)
> (bidi.markers.RLM ⊲)
> (bidi.markers.LRM ⊳)

Sorry, I don't see why the exact way how to customize this is so
important.  glyphless-char-display-control is a user-level
customizable variable, not some obscure feature that requires Lisp
programming to tailor it to your needs.

> Isn't the Bidi feature important and complicated enough to merit its own tailored set of customizable
> parameters?

It does have its private customizations, but this one isn't one of
them, I don't see why it should be.  The characters of the Cf general
category are quite a few, and Emacs handled them all the same, because
they all have the same nature.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Eli Zaretskii
> From: Itai Berli <[hidden email]>
> Date: Wed, 12 Jul 2017 18:10:19 +0300
>
> Is there any progress with allowing the user to customize the end-of-paragraph mark to be the OS paragraph
> separator character?

No, I didn't yet have time to work on that.  (And I think you were
talking about the newline character, not the paragraph separator
character.)



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Itai Berli
> I think you were talking about the newline character, not the paragraph separator character.

On UNIX and contemporary macOS it's U+000A (LF), on Windows it's the sequence U+000D U+000A (CR LF).

On Wed, Jul 12, 2017 at 6:36 PM, Eli Zaretskii <[hidden email]> wrote:
> From: Itai Berli <[hidden email]>
> Date: Wed, 12 Jul 2017 18:10:19 +0300
>
> Is there any progress with allowing the user to customize the end-of-paragraph mark to be the OS paragraph
> separator character?

No, I didn't yet have time to work on that.  (And I think you were
talking about the newline character, not the paragraph separator
character.)

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Eli Zaretskii
> From: Itai Berli <[hidden email]>
> Date: Wed, 12 Jul 2017 18:52:10 +0300
>
> > I think you were talking about the newline character, not the paragraph separator character.
>
> On UNIX and contemporary macOS it's U+000A (LF), on Windows it's the sequence U+000D U+000A (CR
> LF).

Not in the Emacs buffer: there we have only the newline (a.k.a. "LF")
characters.



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Eli Zaretskii
In reply to this post by Eli Zaretskii
> Date: Tue, 04 Jul 2017 19:18:39 +0300
> From: Eli Zaretskii <[hidden email]>
> Cc: [hidden email]
>
> > From: Itai Berli <[hidden email]>
> > Date: Tue, 4 Jul 2017 18:57:33 +0300
> >
> > How about letting the user decide what's best for them? Would it be possible to add an option to Emacs that a
> > user can set, say, in their .emacs file, which will determine whether the bidi imiplementation will consider the
> > newline character as the paragraph separator or an empty line?
>
> Could be.  I'd need to carefully review the code to say for sure.
> Originally, the regexp which defines where paragraph begins was
> customizable, but it led to grave bugs, so I removed that.  Maybe a
> more restricted facility could avoid such pitfalls.

It turned out to be relatively easy, so I implemented this on the
master branch of the Emacs Git repository.  There are two new
variables that you should set to "^" to get the behavior you wanted.
I hope you can build the master branch and see whether the new
facilities solve your case.

Thanks.



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Itai Berli
Thanks. I've never built Emacs from source. I think it might be easier for me to wait till this patch makes it to the official release.

On Mon, Jul 17, 2017 at 5:54 PM, Eli Zaretskii <[hidden email]> wrote:
> Date: Tue, 04 Jul 2017 19:18:39 +0300
> From: Eli Zaretskii <[hidden email]>
> Cc: [hidden email]
>
> > From: Itai Berli <[hidden email]>
> > Date: Tue, 4 Jul 2017 18:57:33 +0300
> >
> > How about letting the user decide what's best for them? Would it be possible to add an option to Emacs that a
> > user can set, say, in their .emacs file, which will determine whether the bidi imiplementation will consider the
> > newline character as the paragraph separator or an empty line?
>
> Could be.  I'd need to carefully review the code to say for sure.
> Originally, the regexp which defines where paragraph begins was
> customizable, but it led to grave bugs, so I removed that.  Maybe a
> more restricted facility could avoid such pitfalls.

It turned out to be relatively easy, so I implemented this on the
master branch of the Emacs Git repository.  There are two new
variables that you should set to "^" to get the behavior you wanted.
I hope you can build the master branch and see whether the new
facilities solve your case.

Thanks.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Jean-Christophe Helary

On Jul 18, 2017, at 0:16, Itai Berli <[hidden email]> wrote:

Thanks. I've never built Emacs from source. I think it might be easier for me to wait till this patch makes it to the official release.

It's actually pretty easy to build from source. The easiest way (that depends on your platform) is to install the version that corresponds to HEAD. The slightly less trivial way is toget the code from Savannah:
clone the code and follow the instructions.
I got used to doing that a few weeks ago and it is fascinating to see all the new features pouring in everyday.

Jean-Christophe


On Mon, Jul 17, 2017 at 5:54 PM, Eli Zaretskii <[hidden email]> wrote:
> Date: Tue, 04 Jul 2017 19:18:39 +0300
> From: Eli Zaretskii <[hidden email]>
> Cc: [hidden email]
>
> > From: Itai Berli <[hidden email]>
> > Date: Tue, 4 Jul 2017 18:57:33 +0300
> >
> > How about letting the user decide what's best for them? Would it be possible to add an option to Emacs that a
> > user can set, say, in their .emacs file, which will determine whether the bidi imiplementation will consider the
> > newline character as the paragraph separator or an empty line?
>
> Could be.  I'd need to carefully review the code to say for sure.
> Originally, the regexp which defines where paragraph begins was
> customizable, but it led to grave bugs, so I removed that.  Maybe a
> more restricted facility could avoid such pitfalls.

It turned out to be relatively easy, so I implemented this on the
master branch of the Emacs Git repository.  There are two new
variables that you should set to "^" to get the behavior you wanted.
I hope you can build the master branch and see whether the new
facilities solve your case.

Thanks.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator

Itai Berli
Eli, what version number should I download?

On Mon, Jul 17, 2017 at 6:23 PM, Jean-Christophe Helary <[hidden email]> wrote:

On Jul 18, 2017, at 0:16, Itai Berli <[hidden email]> wrote:

Thanks. I've never built Emacs from source. I think it might be easier for me to wait till this patch makes it to the official release.

It's actually pretty easy to build from source. The easiest way (that depends on your platform) is to install the version that corresponds to HEAD. The slightly less trivial way is toget the code from Savannah:
clone the code and follow the instructions.
I got used to doing that a few weeks ago and it is fascinating to see all the new features pouring in everyday.

Jean-Christophe


On Mon, Jul 17, 2017 at 5:54 PM, Eli Zaretskii <[hidden email]> wrote:
> Date: Tue, 04 Jul 2017 19:18:39 +0300
> From: Eli Zaretskii <[hidden email]>
> Cc: [hidden email]
>
> > From: Itai Berli <[hidden email]>
> > Date: Tue, 4 Jul 2017 18:57:33 +0300
> >
> > How about letting the user decide what's best for them? Would it be possible to add an option to Emacs that a
> > user can set, say, in their .emacs file, which will determine whether the bidi imiplementation will consider the
> > newline character as the paragraph separator or an empty line?
>
> Could be.  I'd need to carefully review the code to say for sure.
> Originally, the regexp which defines where paragraph begins was
> customizable, but it led to grave bugs, so I removed that.  Maybe a
> more restricted facility could avoid such pitfalls.

It turned out to be relatively easy, so I implemented this on the
master branch of the Emacs Git repository.  There are two new
variables that you should set to "^" to get the behavior you wanted.
I hope you can build the master branch and see whether the new
facilities solve your case.

Thanks.



12
Loading...