bug#46933: Possible bugs in filepos-to-bufferpos / bufferpos-to-filepos

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

bug#46933: Possible bugs in filepos-to-bufferpos / bufferpos-to-filepos

Eli Zaretskii
> Date: Thu, 04 Mar 2021 21:21:24 +0000
> From: Gregory Heytings <[hidden email]>
>
> (Disclaimer: I have no knowledge whatsoever about the ISO-2022-JP
> encoding, and although this looks like a bug, I'm not sure this is
> actually a bug; I report this at the suggesion of Eli in bug#46859.)
>
> I downloaded the file [1], and converted it to the ISO-2022-JP encoding
> with iconv -t iso-2022-jp one.txt > iso-2022-jp.txt.  The resulting file
> is attached to this bug report.  It ends with two CRLFs, at byte offsets
> 2993 and 2995.  However, after emacs -Q iso-2022-jp.txt, with M-:
> (goto-char (filepos-to-bufferpos POS 'exact)) we get:
>
> POS = 2991, 2992: last but one visible character (HIRAGANA LETTER RU)
> POS = 2993, 2994: last visible character (IDEOGRAPHIC FULL STOP)
> POS = 2995, 2996: first CRLF
> POS = 2997: second CRLF
> POS = 2998: point-max
> POS = 2999: first CRLF
> POS = 3000, 3001: second CRLF
> POS >= 3002: point-max
>
> I would have expected:
>
> POS = 2989, 2990: last but one visible character (HIRAGANA LETTER RU)
> POS = 2991, 2992: last visible character (IDEOGRAPHIC FULL STOP)
> POS = 2993, 2994: first CRLF
> POS = 2995, 2996: second CRLF
> POS >= 2997: point-max
>
> The opposite operation M-: (bufferpos-to-filepos (- (point) POS) 'exact)
> apparently also has bugs; its return values are not coherent with the
> above ones:
>
> POS = 0: 3003
> POS = 1: 3001
> POS = 2: 2999
> POS = 3 (IDEOGRAPHIC FULL STOP): 2997
> POS = 4 (HIRAGANA LETTER RU): 2995
>
> I would have expected:
>
> POS = 0: 2997
> POS = 1: 2995
> POS = 2: 2993
> POS = 3 (IDEOGRAPHIC FULL STOP): 2991
> POS = 4 (HIRAGANA LETTER RU): 2989
>
> [1] https://darza.com/ecbackend/vendor/symfony/mime/Tests/Fixtures/samples/charsets/iso-2022-jp/one.txt

There's something strange going on here with encoding of the buffer
using iso-2022-jp-dos: near the end of the encoded bytestream, between
the encoded HIRAGANA LETTER KO (こ) and HIRAGANA LETTER TO (と), we
get 6 extra bytes: "ESC ( B ESC $ B".  AFAIU, this sequence mean
switch to ASCII and then switch back to Japanese.  So together these 6
bytes are a no-op as regards to their effect on the text, but they
disrupt the logic of filepos-to-bufferpos because they introduce extra
bytes that aren't there in the original file.

Kenichi, why are these 6 bytes inserted by encode-coding-region, but
not when we encode the same text as part of saving the buffer to its
file?  And why does it happen near the end of the text, between those
2 particular letters?



Reply | Threaded
Open this post in threaded view
|

bug#46933: Possible bugs in filepos-to-bufferpos / bufferpos-to-filepos

handa
In article <[hidden email]>, Eli Zaretskii <[hidden email]> writes:

> Kenichi, why are these 6 bytes inserted by encode-coding-region, but
> not when we encode the same text as part of saving the buffer to its
> file?  And why does it happen near the end of the text, between those
> 2 particular letters?

There surely exists a bug.  Could you please try the attached patch?

The reason why that bug did not happen on file writing is that the code
in write_region calls encoding routine repeatedly without
CODING_MODE_LAST_BLOCK flag, and only in the case that flushing is
required (e.g. the case of iso-2022-jp), just for flushing, it calls
enoding routine again with CODING_MODE_LAST_BLOCK flag.  In that case,
carryover does not happen in encode_coding ().

---
K. Handa
[hidden email]

diff --git a/src/coding.c b/src/coding.c
index 221a9cad89..a9d5a7ccdc 100644
--- a/src/coding.c
+++ b/src/coding.c
@@ -7799,7 +7799,14 @@ encode_coding (struct coding_system *coding)
     coding_set_source (coding);
     consume_chars (coding, translation_table, max_lookup);
     coding_set_destination (coding);
+    /* If consume_chars did not consume all source chars, we call
+       coding->encoder again in the next iteration, and thus, for this
+       iteration, we must clear CODING_MODE_LAST_BLOCK flag.  */
+    unsigned saved_mode = coding->mode;
+    if (coding->consumed_char < coding->src_chars)
+      coding->mode &= ~CODING_MODE_LAST_BLOCK;
     (*(coding->encoder)) (coding);
+    coding->mode = saved_mode;
   } while (coding->consumed_char < coding->src_chars);
 
   if (BUFFERP (coding->dst_object) && coding->produced_char > 0)



Reply | Threaded
Open this post in threaded view
|

bug#46933: Possible bugs in filepos-to-bufferpos / bufferpos-to-filepos

Eli Zaretskii
> From: handa <[hidden email]>
> Cc: [hidden email], [hidden email]
> Date: Sat, 27 Mar 2021 14:38:56 +0900
>
> In article <[hidden email]>, Eli Zaretskii <[hidden email]> writes:
>
> > Kenichi, why are these 6 bytes inserted by encode-coding-region, but
> > not when we encode the same text as part of saving the buffer to its
> > file?  And why does it happen near the end of the text, between those
> > 2 particular letters?
>
> There surely exists a bug.  Could you please try the attached patch?
>
> The reason why that bug did not happen on file writing is that the code
> in write_region calls encoding routine repeatedly without
> CODING_MODE_LAST_BLOCK flag, and only in the case that flushing is
> required (e.g. the case of iso-2022-jp), just for flushing, it calls
> enoding routine again with CODING_MODE_LAST_BLOCK flag.  In that case,
> carryover does not happen in encode_coding ().

Thanks.  The patch fixes the problem with the extra 6 bytes, so I
installed it.

The results of filepos-to-bufferpos with the file attached by Gregory
are better now, but there are still problems for some values of BYTE
argument.  The problem is that ISO-2022 encoding (and others like it)
include shift-in and shift-out sequences, used to switch between
character sets.  As a trivial example, each CR+LF sequence has the
"ESC ( B" sequence before it and "ESC $ B" sequence after it, to
switch to ASCII before the newline, then switch to Japanese after it.
And likewise whenever there's Latin text within Japanese (there are
quite a lot of them in this particular file).  These shift-in and
shift-out sequences consume bytes, but don't produce any characters.
So if the BYTE argument of filepos-to-bufferpos specifies a byte in
the middle of one of these shift sequences, the result will be
incorrect, because decoding a partial sequence produces the bytes of
that sequence verbatim, and the logic in filepos-to-bufferpos of using
the length of the decoded text breaks.  We need special handling of
this and other similar coding-systems to fix these corner use cases,
similarly to what we do in filepos-to-bufferpos--dos.  Patches
welcome.

I'm leaving this bug open because not all of the problem was fixed.



Reply | Threaded
Open this post in threaded view
|

bug#46933: Possible bugs in filepos-to-bufferpos / bufferpos-to-filepos

handa
In article <[hidden email]>, Eli Zaretskii <[hidden email]> writes:
> Thanks.  The patch fixes the problem with the extra 6 bytes, so I
> installed it.

Thank you for the improved concise comment in the code.

> The results of filepos-to-bufferpos with the file attached by Gregory
> are better now, but there are still problems for some values of BYTE
> argument.  The problem is that ISO-2022 encoding (and others like it)
> include shift-in and shift-out sequences, used to switch between
> character sets.  As a trivial example, each CR+LF sequence has the
> "ESC ( B" sequence before it and "ESC $ B" sequence after it, to
> switch to ASCII before the newline, then switch to Japanese after it.
> And likewise whenever there's Latin text within Japanese (there are
> quite a lot of them in this particular file).  These shift-in and
> shift-out sequences consume bytes, but don't produce any characters.
> So if the BYTE argument of filepos-to-bufferpos specifies a byte in
> the middle of one of these shift sequences, the result will be
> incorrect, because decoding a partial sequence produces the bytes of
> that sequence verbatim, and the logic in filepos-to-bufferpos of using
> the length of the decoded text breaks.  We need special handling of
> this and other similar coding-systems to fix these corner use cases,
> similarly to what we do in filepos-to-bufferpos--dos.  Patches
> welcome.

How about something like this method:
1. Encode the buffer text one line by one until we get a longer byte
sequence than BYTE.
2. Delete the result of enoding the last line above.
3. Provided that the above last line has chars C1 C2 ... Cn,
encode characters C1...Cn, C1...Cn-1, C1...Cn-2 until we get a shorter
byte sequence than BYTE.

The first step may be optimized by encode multiple lines instead of
single line.

---
K. Handa
[hidden email]



Reply | Threaded
Open this post in threaded view
|

bug#46933: Possible bugs in filepos-to-bufferpos / bufferpos-to-filepos

Eli Zaretskii
> From: handa <[hidden email]>
> Cc: [hidden email], [hidden email]
> Date: Sat, 27 Mar 2021 22:23:59 +0900
>
> How about something like this method:
> 1. Encode the buffer text one line by one until we get a longer byte
> sequence than BYTE.
> 2. Delete the result of enoding the last line above.
> 3. Provided that the above last line has chars C1 C2 ... Cn,
> encode characters C1...Cn, C1...Cn-1, C1...Cn-2 until we get a shorter
> byte sequence than BYTE.
>
> The first step may be optimized by encode multiple lines instead of
> single line.

Even if we do optimize, this would be very slow, I think.  And what if
the buffer has no newlines?

In any case, the problem is not with encoding, the problem is with
decoding.  Encoding doesn't have this problem because we always encode
more than enough (we use the value of BYTE as the count of
_characters_ to encode, so for ISO-2022 encoding it is usually much
more than needed).  By contrast, when decoding, we decode exactly
BYTE+1 bytes, which then hits the problem if that offset is inside a
shift sequence.