bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist

Sebastian Urban
Hello,

for words like:
    męski
    miód
    klątwa
    ślad
    łuk
    żaba
    źrebak
    grzać
    bańka
ispell.el sends to Aspell only part of the word, e.g. "lad" instead of
"ślad", or "kl"/"twa" (depending on the cursor position) instead of
"klątwa".

I think this is because wrong value of (NOT-)CASECHARS, which is ASCII
A-z letters and a few chars of which only ó/Ó is valid for Polish.

Although, for some reason, it doesn't recognize "ó" in word "miód",
sending "mi" or "d". It is on the list of CASECHARS under \363, so it
should work.  Moreover, if I type in regexp-builder "[\363\323]" it
won't recognize ó/Ó, but it doesn't have a problem with other Polish
chars, like "ł" ("[\502]") or "ż" ("[\574]").

If I put in my init.el:
--8<---------------cut here---------------start------------->8---
(setq ispell-program-name "C:/cygwin64/bin/aspell")
(add-hook 'ispell-initialize-spellchecker-hook
           (lambda ()
           (add-to-list 'ispell-local-dictionary-alist
                        '("pl"
                          ;; "[[:alpha:]]"
                          ;; "[^[:alpha:]]"
                          ;; ęóąśłżźćńĘÓĄŚŁŻŹĆŃ
"[A-Za-z\431\363\405\533\502\574\572\407\504\430\323\404\532\501\573\571\406\503]"
"[^A-Za-z\431\363\405\533\502\574\572\407\504\430\323\404\532\501\573\571\406\503]"
                          "[.]" nil nil nil iso-8859-2))))
(setq ispell-dictionary "pl")
--8<---------------cut here---------------start------------->8---

everything seems to work, even ó/Ó are recognised. "[[:alpha:]]" works
as well, so I leaved it as an alternative. Changing from iso-8859-2 to
utf-8 doesn't break anything.

Tested on:
- GNU Emacs 26.3 (build 1, x86_64-w64-mingw32) of 2019-08-29,
- GNU Emacs 28.0.50 (build 1, x86_64-w64-mingw32) of 2020-07-05,
with Aspell from Cygwin installation.


S. U.



Reply | Threaded
Open this post in threaded view
|

bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist

Eli Zaretskii
> From: Sebastian Urban <[hidden email]>
> Date: Wed, 29 Jul 2020 18:12:02 +0200
>
> for words like:
>     męski
>     miód
>     klątwa
>     ślad
>     łuk
>     żaba
>     źrebak
>     grzać
>     bańka
> ispell.el sends to Aspell only part of the word, e.g. "lad" instead of
> "ślad", or "kl"/"twa" (depending on the cursor position) instead of
> "klątwa".
>
> I think this is because wrong value of (NOT-)CASECHARS, which is ASCII
> A-z letters and a few chars of which only ó/Ó is valid for Polish.
>
> Although, for some reason, it doesn't recognize "ó" in word "miód",
> sending "mi" or "d". It is on the list of CASECHARS under \363, so it
> should work.  Moreover, if I type in regexp-builder "[\363\323]" it
> won't recognize ó/Ó, but it doesn't have a problem with other Polish
> chars, like "ł" ("[\502]") or "ż" ("[\574]").
>
> If I put in my init.el:
> --8<---------------cut here---------------start------------->8---
> (setq ispell-program-name "C:/cygwin64/bin/aspell")
> (add-hook 'ispell-initialize-spellchecker-hook
>            (lambda ()
>            (add-to-list 'ispell-local-dictionary-alist
>                         '("pl"
>                           ;; "[[:alpha:]]"
>                           ;; "[^[:alpha:]]"
>                           ;; ęóąśłżźćńĘÓĄŚŁŻŹĆŃ
> "[A-Za-z\431\363\405\533\502\574\572\407\504\430\323\404\532\501\573\571\406\503]"
> "[^A-Za-z\431\363\405\533\502\574\572\407\504\430\323\404\532\501\573\571\406\503]"
>                           "[.]" nil nil nil iso-8859-2))))
> (setq ispell-dictionary "pl")
> --8<---------------cut here---------------start------------->8---
>
> everything seems to work, even ó/Ó are recognised.

I don't understand this change.  Values above octal 377 cannot be
right in the above regexps, because they are supposed to be in Latin-2
encoding, which is a single-byte encoding, and so can only handle
values below octal 400.  How did you come up with those values?

Anyway, I'm quite sure some other factor is at work here.

> Tested on:
> - GNU Emacs 26.3 (build 1, x86_64-w64-mingw32) of 2019-08-29,
> - GNU Emacs 28.0.50 (build 1, x86_64-w64-mingw32) of 2020-07-05,
> with Aspell from Cygwin installation.

Your Emacs is a native MinGW build, whereas Aspell seems to be a
Cygwin build?  If so, you could have incompatibility in character
encoding.  What is your Windows locale?  And what does

  M-: (getenv "LANG") RET

yield inside Emacs?



Reply | Threaded
Open this post in threaded view
|

bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist

Sebastian Urban
> I don't understand this change.  Values above octal 377 cannot be
> right in the above regexps, because they are supposed to be in
> Latin-2 encoding, which is a single-byte encoding, and so can only
> handle values below octal 400.  How did you come up with those
> values?

Basically, C-x = on a char, which gave me octal values.  I though it
was recognising only A-z + ó/Ó and some other chars that I'm not
interested in, so I swapped those values for the ones corresponding to
the Polish chars.  That's the whole story.

> Anyway, I'm quite sure some other factor is at work here.

Well, I did some tests, e.g. switched back to the original value of
"polish" in my "pl" dictionary, and... it works.  And if I change from
iso-8859-2 to utf-8 in my "pl" (with original value from "polish") it
doesn't work.  So, as you later wrote - wrong character encoding,
I guess.

Looking for a cause (in default settings), I think I found it in
ispell-dictionary-base-alist and ispell-dictionary-alist.  During
"transfer" from *-base-* to ispell-dictionary-alist, the value of
CHARACTER-SET is changed in all cases from iso-* or cp1255 to utf-8,
then ispell uses these (from ispell-dictionary-alist) when it "talks"
with Aspell.

On the other hand, if I use Emacs 26.3 from Cygwin, everything works
out of the box, I don't even have to set "polish" as default
dictionary. But there, in Cygwin command line, "env | grep LANG" gives
"LANG=pl_PL.UTF-8".

> Your Emacs is a native MinGW build, whereas Aspell seems to be
> a Cygwin build?

Both Emacses are official Win builds, and Aspell is installed through
Cygwin.

> If so, you could have incompatibility in character encoding.  What
> is your Windows locale?

"Polish" everywhere in "Control Panel" -> "Regional and Language".

> And what does M-: (getenv "LANG") RET yield inside Emacs?

"PLK"


S. U.

P.S.
> Moreover, if I type in regexp-builder "[\363\323]" it won't
> recognize ó/Ó, but it doesn't have a problem with other Polish
> chars, like "ł" ("[\502]") or "ż" ("[\574]").

In the "Character List" buffer for unicode-bmp, regexp-builder
(numbers are octal values):
- 0-177 and 400-777 - highlights chars
- 240-377 - doesn't highlight chars (it highlights them if I use hex
   value, or insert them directly)
I didn't check "80h-9Fh" chars.  Chars like C-a were checked by
inserting them with quoted-insert in another buffer.




Reply | Threaded
Open this post in threaded view
|

bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist

Eli Zaretskii
> From: Sebastian Urban <[hidden email]>
> Cc: [hidden email]
> Date: Thu, 30 Jul 2020 13:39:55 +0200
>
> > I don't understand this change.  Values above octal 377 cannot be
> > right in the above regexps, because they are supposed to be in
> > Latin-2 encoding, which is a single-byte encoding, and so can only
> > handle values below octal 400.  How did you come up with those
> > values?
>
> Basically, C-x = on a char, which gave me octal values.

This gives you the Unicode codepoint, not its Latin-2 encoding.  They
are different.  The database in ispell.el uses Latin-2 encodings of
Polish characters.

> Well, I did some tests, e.g. switched back to the original value of
> "polish" in my "pl" dictionary, and... it works.  And if I change from
> iso-8859-2 to utf-8 in my "pl" (with original value from "polish") it
> doesn't work.  So, as you later wrote - wrong character encoding,
> I guess.
>
> Looking for a cause (in default settings), I think I found it in
> ispell-dictionary-base-alist and ispell-dictionary-alist.  During
> "transfer" from *-base-* to ispell-dictionary-alist, the value of
> CHARACTER-SET is changed in all cases from iso-* or cp1255 to utf-8,
> then ispell uses these (from ispell-dictionary-alist) when it "talks"
> with Aspell.
>
> On the other hand, if I use Emacs 26.3 from Cygwin, everything works
> out of the box, I don't even have to set "polish" as default
> dictionary. But there, in Cygwin command line, "env | grep LANG" gives
> "LANG=pl_PL.UTF-8".

Native MinGW builds cannot use the UTF-8 encoding.

So, do we have a problem to solve, or can this issue be closed?



Reply | Threaded
Open this post in threaded view
|

bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist

Sebastian Urban
>>> I don't understand this change.  Values above octal 377 cannot be
>>> right in the above regexps, because they are supposed to be in
>>> Latin-2 encoding, which is a single-byte encoding, and so can only
>>> handle values below octal 400.  How did you come up with those
>>> values?
>>
>> Basically, C-x = on a char, which gave me octal values.
>
> This gives you the Unicode codepoint, not its Latin-2 encoding.
> They are different.

So, it would work even if I would add "\999999999", because Emacs
would not recognize and simply ignore it, which means the only reason
it worked was explicitly set encoding (iso-8859-2)?

> The database in ispell.el uses Latin-2 encodings of Polish
> characters.

As base, but before ispell.el sends the string to the Aspell it
translates it to uft-8, right?  Because that's the only difference
between my custom "pl" dictionary and value of "polish" in
ispell-dictionary-alist.

> Native MinGW builds cannot use the UTF-8 encoding.

So, with my setup (not saying that it's the best one, it's just
current one, if there is a better one I can change), for Polish lang,
I have to define local dictionary with iso-8859-2 coding?

> So, do we have a problem to solve, or can this issue be closed?

If it's a problem of MinGW, and my setup, then I guess it's not an
Emacs problem, so yes, it can be closed.


S. U.



Reply | Threaded
Open this post in threaded view
|

bug#42602: Wrong (not-)casechars value for "polish" in ispell-dictionary-base-alist

Stefan Kangas
Sebastian Urban <[hidden email]> writes:

>> So, do we have a problem to solve, or can this issue be closed?
>
> If it's a problem of MinGW, and my setup, then I guess it's not an
> Emacs problem, so yes, it can be closed.

I'm therefore closing this bug report.

Best regards,
Stefan Kangas