bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

classic Classic list List threaded Threaded
103 messages Options
1234 ... 6
Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Pip Cet
On Sat, May 16, 2020 at 10:34 AM Eli Zaretskii <[hidden email]> wrote:
> So far, I've seen this in a C Mode buffer reverted because "git pull"
> brought a modified version, and in an Info mode buffer reverted
> because the manual was rebuilt after the Texinfo sources were
> modified.  In the latter case I captured a backtrace, see below.
>
> The problem seem to involve invalid markers, perhaps markers that were
> unchained and put on the free list

Even unchained markers shouldn't be put on the free list as long as
they're still reachable, so I suspect the problem is more likely to be
caused by that.

> (witness the PVEC_FREE object that
> caused the abort in the backtrace below, where Emacs seems to be
> trying to display an error message about an invalid marker).

What I would do next is run with a breakpoint on wrong_type_argument
(if that's impossible, change the code in CHECK_MARKER to abort upon
encountering a PVEC_FREE vector) to see where the reference to the
freed pseudovector came from. An undo list, maybe?



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Eli Zaretskii
> From: Pip Cet <[hidden email]>
> Date: Sun, 17 May 2020 10:56:28 +0000
> Cc: [hidden email]
>
> What I would do next is run with a breakpoint on wrong_type_argument
> (if that's impossible, change the code in CHECK_MARKER to abort upon
> encountering a PVEC_FREE vector) to see where the reference to the
> freed pseudovector came from. An undo list, maybe?

I'm already running with such a breakpoint, let's how it will catch
something.



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Eli Zaretskii
> Date: Sun, 17 May 2020 18:28:04 +0300
> From: Eli Zaretskii <[hidden email]>
> Cc: [hidden email]
>
> I'm already running with such a breakpoint, let's how it will catch
> something.                                        ^^^

Should have been "hope".  Sorry.



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Eli Zaretskii
> Date: Fri, 22 May 2020 10:22:56 +0300
> From: Eli Zaretskii <[hidden email]>
> Cc: [hidden email]
>
> Since the previous call to before-change-functions already used the
> same overlay markers, I suspect that the call to
> before-change-functions caused the memory to be unmapped (perhaps due
> to GC).

FTR: I am now running the 27.0.91 pretest with the patch for bug#40661
applied.  It's a long shot, since the problem here is not with
pointers to buffer text, but I just want to be sure I didn't
rediscover a complicated way to reproduce that bug ;-)



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Eli Zaretskii
In reply to this post by Eli Zaretskii
> From: Andrea Corallo <[hidden email]>
> Cc: [hidden email], Stefan Monnier <[hidden email]>,
>         [hidden email]
> Date: Fri, 22 May 2020 08:35:55 +0000
>
> I'be curious of the outcome if you had a look to your 'garbage_collect'
> assembly to investigate the possible relation with 41357 as suggested
> here
> https://lists.gnu.org/archive/html/bug-gnu-emacs/2020-05/msg01095.html

Sorry, I'm not sure I understand what you mean by the above.  Did you
mean whether I disassembled garbage_collect and looked at the code?
If so, the answer is NO, I didn't yet have time for that.

However, given the latest findings, I now doubt even more that the
issue you identified can have any relation to this problem.  As seen
by the backtrace I've shown in my last message, the buffer's overlay
list has invalid overlay objects at the point of the crash.  The 2
pointers to the overlay lists of a buffer are unconditionally marked
in mark_buffer, so I don't see how problems in GC with Lisp objects in
registers could interfere in this case.  Am I missing something?



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Pip Cet
In reply to this post by Eli Zaretskii
On Fri, May 22, 2020 at 7:22 AM Eli Zaretskii <[hidden email]> wrote:
>   (gdb) p current_buffer->overlays_before
>   $28 = (struct Lisp_Overlay *) 0x170cb080
>   (gdb) p $28->start
>   $29 = XIL(0xa0000000170cb040)
>   (gdb) xtype
>   Lisp_Vectorlike
>   Cannot access memory at address 0x18ac04f8

Note that didn't try to print $29, but the original invalid marker. In
particular, I believe 0x170cb040 is a pointer to a valid marker.

>   (gdb) p $28->next
>   $30 = (struct Lisp_Overlay *) 0x13050320
>   (gdb) p $28->next->start
>   $31 = XIL(0xa000000016172310)
>   (gdb) xtype
>   Lisp_Vectorlike
>   Cannot access memory at address 0x18ac04f8

Same here.

If you could disassemble signal_before_change, we'd know whether
start_marker and end_marker live in callee-saved registers, and thus
whether this is likely to be Andrea's bug.



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Eli Zaretskii
> From: Pip Cet <[hidden email]>
> Date: Fri, 22 May 2020 11:47:03 +0000
> Cc: Stefan Monnier <[hidden email]>, [hidden email]
>
> On Fri, May 22, 2020 at 7:22 AM Eli Zaretskii <[hidden email]> wrote:
> >   (gdb) p current_buffer->overlays_before
> >   $28 = (struct Lisp_Overlay *) 0x170cb080
> >   (gdb) p $28->start
> >   $29 = XIL(0xa0000000170cb040)
> >   (gdb) xtype
> >   Lisp_Vectorlike
> >   Cannot access memory at address 0x18ac04f8
>
> Note that didn't try to print $29, but the original invalid marker.

Sorry, I don't follow.  "xtype" shows the type of the last result,
AFAIK, in this case the type of $29.  If this changed somehow, either
we have a bug in .gdbinit or I have been using GDB incorrectly for I
don't know how many years.

What am I missing?



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Eli Zaretskii
In reply to this post by Pip Cet
> From: Pip Cet <[hidden email]>
> Date: Fri, 22 May 2020 11:47:03 +0000
> Cc: Stefan Monnier <[hidden email]>, [hidden email]
>
> On Fri, May 22, 2020 at 7:22 AM Eli Zaretskii <[hidden email]> wrote:
> >   (gdb) p current_buffer->overlays_before
> >   $28 = (struct Lisp_Overlay *) 0x170cb080
> >   (gdb) p $28->start
> >   $29 = XIL(0xa0000000170cb040)
> >   (gdb) xtype
> >   Lisp_Vectorlike
> >   Cannot access memory at address 0x18ac04f8
>
> Note that didn't try to print $29, but the original invalid marker. In
> particular, I believe 0x170cb040 is a pointer to a valid marker.
>
> >   (gdb) p $28->next
> >   $30 = (struct Lisp_Overlay *) 0x13050320
> >   (gdb) p $28->next->start
> >   $31 = XIL(0xa000000016172310)
> >   (gdb) xtype
> >   Lisp_Vectorlike
> >   Cannot access memory at address 0x18ac04f8
>
> Same here.
>
> If you could disassemble signal_before_change, we'd know whether
> start_marker and end_marker live in callee-saved registers, and thus
> whether this is likely to be Andrea's bug.

Since $28 is neither start_marker nor end_marker, but the first
overlay on the buffer's overlay chain, how could it be affected by
whether start_marker or end_marker are in a callee-saved register?
What am I missing here?



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Pip Cet
In reply to this post by Eli Zaretskii
On Fri, May 22, 2020 at 12:13 PM Eli Zaretskii <[hidden email]> wrote:

> > From: Pip Cet <[hidden email]>
> > Date: Fri, 22 May 2020 11:47:03 +0000
> > Cc: Stefan Monnier <[hidden email]>, [hidden email]
> >
> > On Fri, May 22, 2020 at 7:22 AM Eli Zaretskii <[hidden email]> wrote:
> > >   (gdb) p current_buffer->overlays_before
> > >   $28 = (struct Lisp_Overlay *) 0x170cb080
> > >   (gdb) p $28->start
> > >   $29 = XIL(0xa0000000170cb040)
> > >   (gdb) xtype
> > >   Lisp_Vectorlike
> > >   Cannot access memory at address 0x18ac04f8
> >
> > Note that didn't try to print $29, but the original invalid marker.
>
> Sorry, I don't follow.  "xtype" shows the type of the last result,
> AFAIK, in this case the type of $29.  If this changed somehow, either
> we have a bug in .gdbinit or I have been using GDB incorrectly for I
> don't know how many years.

I think it's most likely to be a GDB bug, and I can't reproduce it here.

But it's definitely trying to access memory at address 0x18ac04f8,
which corresponds to start_marker.

  (gdb) p rvoe_arg.location
  $35 = (Lisp_Object *) 0x15c9298 <globals+120>
  (gdb) xtype
  Lisp_Vectorlike
  Cannot access memory at address 0x18ac04f8
  (gdb) p rvoe_arg.errorp
  $36 = false

Surely rvoe_arg.location isn't a vectorlike, so that also points to
GDB not dealing with things correctly.



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Eli Zaretskii
> From: Pip Cet <[hidden email]>
> Date: Fri, 22 May 2020 12:39:27 +0000
> Cc: Stefan Monnier <[hidden email]>, [hidden email]
>
> > Sorry, I don't follow.  "xtype" shows the type of the last result,
> > AFAIK, in this case the type of $29.  If this changed somehow, either
> > we have a bug in .gdbinit or I have been using GDB incorrectly for I
> > don't know how many years.
>
> I think it's most likely to be a GDB bug, and I can't reproduce it here.
>
> But it's definitely trying to access memory at address 0x18ac04f8,
> which corresponds to start_marker.

My interpretation of that equality was that both start_marker and the
buffer's overlay chain git invalidated because some code relocated
objects and unmapped the previously referenced memory, perhaps due to
GC.  I don't yet have an explanation for how this could happen, so
maybe this hypothesis is wrong.

>   (gdb) p rvoe_arg.location
>   $35 = (Lisp_Object *) 0x15c9298 <globals+120>
>   (gdb) xtype
>   Lisp_Vectorlike
>   Cannot access memory at address 0x18ac04f8
>   (gdb) p rvoe_arg.errorp
>   $36 = false
>
> Surely rvoe_arg.location isn't a vectorlike, so that also points to
> GDB not dealing with things correctly.

rvoe_arg.location should be a pointer to the value of
before-change-functions, so yes, it isn't supposed to be vectorlike.
But I very much doubt there's such a blatant bug in GDB: this is the
latest GDB 9.1, and I'm using these commands from .gdbinit all the
time.  I tend to think this is somehow part of the bug that caused the
crash.



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Andrea Corallo
In reply to this post by Eli Zaretskii
Eli Zaretskii <[hidden email]> writes:

>> From: Andrea Corallo <[hidden email]>
>> Cc: [hidden email], Stefan Monnier <[hidden email]>,
>>         [hidden email]
>> Date: Fri, 22 May 2020 08:35:55 +0000
>>
>> I'be curious of the outcome if you had a look to your 'garbage_collect'
>> assembly to investigate the possible relation with 41357 as suggested
>> here
>> https://lists.gnu.org/archive/html/bug-gnu-emacs/2020-05/msg01095.html
>
> Sorry, I'm not sure I understand what you mean by the above.  Did you
> mean whether I disassembled garbage_collect and looked at the code?

Yes, should be quick to see if callee-save regs are pushed.

> However, given the latest findings, I now doubt even more that the
> issue you identified can have any relation to this problem.  As seen
> by the backtrace I've shown in my last message, the buffer's overlay
> list has invalid overlay objects at the point of the crash.  The 2
> pointers to the overlay lists of a buffer are unconditionally marked
> in mark_buffer, so I don't see how problems in GC with Lisp objects in
> registers could interfere in this case.  Am I missing something?

Not that I'm aware, I'm no expert of the piece of code you are looking
at and haven't investigated into.  Was just a 'cheap' idea to exclude a
potential problem from the table.

  Andrea

--
[hidden email]



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Pip Cet
In reply to this post by Eli Zaretskii
On Fri, May 22, 2020 at 12:48 PM Eli Zaretskii <[hidden email]> wrote:

> > From: Pip Cet <[hidden email]>
> > Date: Fri, 22 May 2020 12:39:27 +0000
> > Cc: Stefan Monnier <[hidden email]>, [hidden email]
> >
> > > Sorry, I don't follow.  "xtype" shows the type of the last result,
> > > AFAIK, in this case the type of $29.  If this changed somehow, either
> > > we have a bug in .gdbinit or I have been using GDB incorrectly for I
> > > don't know how many years.
> >
> > I think it's most likely to be a GDB bug, and I can't reproduce it here.
> >
> > But it's definitely trying to access memory at address 0x18ac04f8,
> > which corresponds to start_marker.
>
> My interpretation of that equality was that both start_marker and the
> buffer's overlay chain git invalidated because some code relocated
> objects and unmapped the previously referenced memory, perhaps due to
> GC.  I don't yet have an explanation for how this could happen, so
> maybe this hypothesis is wrong.

I think it has to be, because the error message would then read
"Cannot access memory at address 0x170cb040", which is the only
address xvectype is supposed to look at.

> But I very much doubt there's such a blatant bug in GDB: this is the
> latest GDB 9.1, and I'm using these commands from .gdbinit all the
> time.  I tend to think this is somehow part of the bug that caused the
> crash.

I'm not sure how it could be. I don't think posting the disassembled
code for `signal_before_change' can hurt, since there's no easy way
for anyone else to reproduce it.



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Andrea Corallo
Eli Zaretskii <[hidden email]> writes:

>> From: Pip Cet <[hidden email]>
>> Date: Fri, 22 May 2020 14:04:03 +0000
>> Cc: Stefan Monnier <[hidden email]>, [hidden email]
>>
>> I don't think posting the disassembled code for
>> `signal_before_change' can hurt, since there's no easy way for
>> anyone else to reproduce it.
>
> I see this on two different systems where Emacs was compiled with two
> different versions of GCC.  So if you want to see the disassembly, any
> 32-bit GCC will do, I think.

I believe the triplet can make a difference given the calling convention
can change no?  Also CFLAGS are clearly a factor.

--
[hidden email]



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Eli Zaretskii
> From: Andrea Corallo <[hidden email]>
> Cc: Pip Cet <[hidden email]>, [hidden email],
>         [hidden email]
> Date: Fri, 22 May 2020 14:40:05 +0000
>
> > I see this on two different systems where Emacs was compiled with two
> > different versions of GCC.  So if you want to see the disassembly, any
> > 32-bit GCC will do, I think.
>
> I believe the triplet can make a difference given the calling convention
> can change no?  Also CFLAGS are clearly a factor.

My CFLAGS are in my original report of this bug.



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Pip Cet
I believe this isn't the problem we're looking for, but it might be
related anyway.

I'm seeing this in the assembler source code for insdel.c produced
with the mingw cross compiler (i686-w64-mingw32-gcc-win32):

    movl    60(%esp), %eax
    movl    %eax, (%esp)
    movl    72(%esp), %eax
    movl    %eax, 4(%esp)
    call    _Fmarker_position
If I'm reading this correctly, it's of some concern for wide-int
builds: the two 32-bit halves of a Lisp_Object are stored
non-consecutively.

Our stack marking doesn't catch that; at least, it doesn't for
symbols, where the less-significant half isn't a valid pointer. For
pseudovectors, things should still work...

So I think we have a problem with such --wide-int builds in cases
where a stack temporary holds an unpinned uninterned symbol while GC
is called. Something like

(prog1
  (gensym)
  (garbage-collect))

might trigger it. No problem with gcc -m32 on GNU/Linux, for some reason.



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Andrea Corallo
Pip Cet <[hidden email]> writes:

> I believe this isn't the problem we're looking for, but it might be
> related anyway.
>
> I'm seeing this in the assembler source code for insdel.c produced
> with the mingw cross compiler (i686-w64-mingw32-gcc-win32):
>
>     movl    60(%esp), %eax
>     movl    %eax, (%esp)
>     movl    72(%esp), %eax
>     movl    %eax, 4(%esp)
>     call    _Fmarker_position
> If I'm reading this correctly, it's of some concern for wide-int
> builds: the two 32-bit halves of a Lisp_Object are stored
> non-consecutively.
>
> Our stack marking doesn't catch that; at least, it doesn't for
> symbols, where the less-significant half isn't a valid pointer. For
> pseudovectors, things should still work...
>
> So I think we have a problem with such --wide-int builds in cases
> where a stack temporary holds an unpinned uninterned symbol while GC
> is called. Something like
>
> (prog1
>   (gensym)
>   (garbage-collect))
>
> might trigger it. No problem with gcc -m32 on GNU/Linux, for some reason.

Very interesting.  AFAIK there's no guarantees for the compiler to spill
a DI reg in adjacent memory.  Also reading the GC code your observation
seems correct to me.

--
[hidden email]



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Stefan Monnier
>> If I'm reading this correctly, it's of some concern for wide-int
>> builds: the two 32-bit halves of a Lisp_Object are stored
>> non-consecutively.

This shouldn't be a problem: wide-int builds use MSB tagging, so all
Lisp_Objects which contain a pointer have their lowest 32bits exactly
identical to that pointer (and the higher 32bits just contain the tag).
So we'll find them in the stack even if the two halves are separate
simply because the pointer-part will be found like any other pointer.


        Stefan




Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Pip Cet
On Sat, May 23, 2020 at 10:38 PM Stefan Monnier
<[hidden email]> wrote:
> >> If I'm reading this correctly, it's of some concern for wide-int
> >> builds: the two 32-bit halves of a Lisp_Object are stored
> >> non-consecutively.
>
> This shouldn't be a problem: wide-int builds use MSB tagging, so all
> Lisp_Objects which contain a pointer have their lowest 32bits exactly
> identical to that pointer (and the higher 32bits just contain the tag).

As I said, I don't believe that's true for symbols. Qnil is always
binary 0, so we offset all symbols by the offset of lispsym.

> So we'll find them in the stack even if the two halves are separate
> simply because the pointer-part will be found like any other pointer.

Yes, that's what I meant to say when I said it should still work for
pseudovectors.



Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Stefan Monnier
>> This shouldn't be a problem: wide-int builds use MSB tagging, so all
>> Lisp_Objects which contain a pointer have their lowest 32bits exactly
>> identical to that pointer (and the higher 32bits just contain the tag).
> As I said, I don't believe that's true for symbols.  Qnil is always
> binary 0, so we offset all symbols by the offset of lispsym.

Oh, right, good point: I had completely forgotten about that "detail".
We should probably adjust our conservative stack scanning accordingly.


        Stefan




Reply | Threaded
Open this post in threaded view
|

bug#41321: 27.0.91; Emacs aborts due to invalid pseudovector objects

Pip Cet
In reply to this post by Eli Zaretskii
On Fri, May 22, 2020 at 7:22 AM Eli Zaretskii <[hidden email]> wrote:
>   #0  PSEUDOVECTORP (code=<optimized out>, a=<optimized out>) at lisp.h:1720
>   #1  MARKERP (x=<optimized out>) at lisp.h:2618
>   #2  CHECK_MARKER (x=XIL(0xa000000018ac0518)) at marker.c:133
>   #3  0x010f073c in Fmarker_position (marker=XIL(0xa000000018ac0518))
>       at marker.c:452

I think I've worked it out: it's this mingw bug:
https://sourceforge.net/p/mingw-w64/bugs/778/

On mingw, if <stdint.h> is included before/instead of stddef.h,
alignof (max_align_t) == 16. However, as can be seen by the backtrace
above, Eli's malloc only returned an 8-byte-aligned block. That's not
normally a problem, because mark_maybe_object doesn't care about
alignment; but in conjunction with the gcc behavior change, we rely or
mark_maybe_pointer to mark the pointer, and it doesn't, because the
pointer is not aligned to a LISP_ALIGNMENT = 16-byte boundary.

Brute-force patch attached until we can work out how to fix this properly.

0001-Accept-unaligned-pointers-in-maybe_lisp_pointer.patch (1K) Download Attachment
1234 ... 6