bug#38748: 28.0.50; crash on MacOS 10.15.2

classic Classic list List threaded Threaded
52 messages Options
123
Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Andrii Kolomoiets
Unfortunately I have no recipe to reproduce this issue.  Emacs just
crashing from time to time.

See attached crash info.

Emacs is buit from nearly recent master (commit
7c5d6a2afc6c23a7fff8456f506ee2aa2d37a3b9)

In GNU Emacs 28.0.50 (build 2, x86_64-apple-darwin19.2.0, NS appkit-1894.20 Version 10.15.2 (Build 19C57))
Windowing system distributor 'Apple', version 10.3.1894
System Description:  Mac OS X 10.15.2

Configured using:
 'configure --disable-dependency-tracking --disable-silent-rules
 --enable-locallisppath=/usr/local/share/emacs/site-lisp
 --infodir=/usr/local/Cellar/emacs/dev/share/info/emacs
 --prefix=/usr/local/Cellar/emacs/dev --with-gnutls --without-x
 --with-xml2 --without-dbus --with-modules --disable-ns-self-contained
 --with-ns'

Configured features:
NOTIFY KQUEUE ACL GNUTLS LIBXML2 ZLIB TOOLKIT_SCROLL_BARS NS MODULES
THREADS JSON PDUMPER GMP

emacs-crash.txt (105K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Alan Third
On Thu, Dec 26, 2019 at 11:47:29AM +0200, Andrii Kolomoiets wrote:
> Unfortunately I have no recipe to reproduce this issue.  Emacs just
> crashing from time to time.
>
> See attached crash info.
>
> Emacs is buit from nearly recent master (commit
> 7c5d6a2afc6c23a7fff8456f506ee2aa2d37a3b9)
>
<snip>
>
> Exception Type:        EXC_BAD_ACCESS (SIGABRT)
> Exception Codes:       KERN_INVALID_ADDRESS at 0x00000000434f4e44
> Exception Note:        EXC_CORPSE_NOTIFY
>
<snip>
>
> 20  org.gnu.Emacs                 0x00000001084a7c86 handle_sigsegv + 168
> 21  libsystem_platform.dylib       0x00007fff6b73a42d _sigtramp + 29
> 22  ???                           000000000000000000 0 + 0
> 23  org.gnu.Emacs                 0x00000001084ddd80 mark_object + 272
> 24  org.gnu.Emacs                 0x00000001084ddd80 mark_object + 272

Looks like a crash in GC.
--
Alan Third



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Eli Zaretskii
> Date: Thu, 26 Dec 2019 13:04:20 +0000
> From: Alan Third <[hidden email]>
> Cc: [hidden email]
>
> > 20  org.gnu.Emacs                 0x00000001084a7c86 handle_sigsegv + 168
> > 21  libsystem_platform.dylib       0x00007fff6b73a42d _sigtramp + 29
> > 22  ???                           000000000000000000 0 + 0
> > 23  org.gnu.Emacs                 0x00000001084ddd80 mark_object + 272
> > 24  org.gnu.Emacs                 0x00000001084ddd80 mark_object + 272
>
> Looks like a crash in GC.

Yes, but why?

One possibility is stack overflow.  If that's not the reason, then one
needs to employ the technique described in etc/DEBUG to find out which
object got corrupted and why.



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Andrii Kolomoiets
Eli Zaretskii <[hidden email]> writes:

>> Date: Thu, 26 Dec 2019 13:04:20 +0000
>> From: Alan Third <[hidden email]>
>> Cc: [hidden email]
>>
>> > 20  org.gnu.Emacs                 0x00000001084a7c86 handle_sigsegv + 168
>> > 21  libsystem_platform.dylib       0x00007fff6b73a42d _sigtramp + 29
>> > 22  ???                           000000000000000000 0 + 0
>> > 23  org.gnu.Emacs                 0x00000001084ddd80 mark_object + 272
>> > 24  org.gnu.Emacs                 0x00000001084ddd80 mark_object + 272
>>
>> Looks like a crash in GC.
>
> Yes, but why?
>
> One possibility is stack overflow.  If that's not the reason, then one
> needs to employ the technique described in etc/DEBUG to find out which
> object got corrupted and why.
I followed the steps described in etc/DEBUG.

Emacs is configured using:
'configure --without-xml2 --with-ns --with-modules
 --disable-ns-self-contained --enable-checking=yes,glyphs
 --enable-check-lisp-object-type 'CFLAGS=-O3 -g3''

See gdb session output attached.

Hope this will help.


gdb-bt-full.txt (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Eli Zaretskii
> From: Andrii Kolomoiets <[hidden email]>
> Cc: Alan Third <[hidden email]>,  [hidden email]
> Date: Fri, 27 Dec 2019 13:28:11 +0200
>
> > One possibility is stack overflow.  If that's not the reason, then one
> > needs to employ the technique described in etc/DEBUG to find out which
> > object got corrupted and why.
>
> I followed the steps described in etc/DEBUG.
>
> See gdb session output attached.

The attachment just shows the output of "bt full", I see nothing there
that should have been produced by following the etc/DEBUG instructions
under "Debugging problems which happen in GC".

Are you sure you posted the file you intended to?

Thanks.



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Andrii Kolomoiets
Eli Zaretskii <[hidden email]> writes:

>> From: Andrii Kolomoiets <[hidden email]>
>> Cc: Alan Third <[hidden email]>,  [hidden email]
>> Date: Fri, 27 Dec 2019 13:28:11 +0200
>>
>> > One possibility is stack overflow.  If that's not the reason, then one
>> > needs to employ the technique described in etc/DEBUG to find out which
>> > object got corrupted and why.
>>
>> I followed the steps described in etc/DEBUG.
>>
>> See gdb session output attached.
>
> The attachment just shows the output of "bt full", I see nothing there
> that should have been produced by following the etc/DEBUG instructions
> under "Debugging problems which happen in GC".
>
> Are you sure you posted the file you intended to?

My bad, didn't read that section at all.  I read only "Configuring Emacs
for debugging" section because of this text in `report-emacs-bug'
letter: "If Emacs crashed, include the output from 'bt full' and 'xbacktrace'".

Now Emacs is built with -O0 and I need some help, please.

(gdb) bt full
#0  terminate_due_to_signal (sig=607650026, backtrace_limit=1116) at ../../emacs/src/emacs.c:370
No locals.
#1  0x0000000100a28660 in ?? ()
No symbol table info available.
#2  0x0000000000000000 in ?? ()
No symbol table info available.

Lisp Backtrace:
Cannot access memory at address 0xadf0

I can print the 'last_marked_index':

(gdb) p last_marked_index
$2 = 41

But what can I do with 'last_marked'?

(gdb) p last_marked[40]
'last_marked' has unknown type; cast it to its declared type

Give me some tips, please. TIA.



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Eli Zaretskii
> From: Andrii Kolomoiets <[hidden email]>
> Cc: [hidden email],  [hidden email]
> Date: Sun, 29 Dec 2019 21:01:42 +0200
>
> I can print the 'last_marked_index':
>
> (gdb) p last_marked_index
> $2 = 41
>
> But what can I do with 'last_marked'?
>
> (gdb) p last_marked[40]
> 'last_marked' has unknown type; cast it to its declared type

last_marked is an array of Lisp objects, arranged in circular order,
i.e. when the index reaches the last element, it is reset back to
zero.

To print the object at last_marked[i], for some i, you do

  (gdb) p last_marked[i]
  (gdb) xtype

The xtype command will tell you the type of the Lisp object.  You then
display it with the corresponding xTYPE command: xint for an integer,
xcons for a cons cell, xstring for a string, xvector for a vector,
xbuffer for a buffer, etc.  Here's a short example:

  (gdb) p last_marked_index
  $2 = 1
  (gdb) p last_marked[0]
  $3 = XIL(0x8000000006287630)
  (gdb) xtype
  Lisp_String
  (gdb) xstring
  $4 = (struct Lisp_String *) 0x6287630
  " *buffer-defaults*"

So in this example, the last marked object was a Lisp string whose
contents is " *buffer-defaults*".  GDB stores its C definition in
history slot $4, so we can look at its details:

  (gdb) p *$4
  $5 = {
    u = {
      s = {
        size = 18,
        size_byte = -2,
        intervals = 0x0,
        data = 0x19a1dea <DEFAULT_REHASH_SIZE+14054> " *buffer-defaults*"
      },
      next = 0x12,
      gcaligned = 18 '\022'
    }
  }

All of those commands are in src/.gdbinit; if GDB says it doesn't know
these commands, tell it to read that file:

  (gdb) source /path/to/emacs/src/.gdbinit

If last_marked_index is 41, you should print the objects starting from
last_marked[40], going back (39, 38, 37, etc.), trying to find the
object that is corrupted (e.g., the corresponding xTYPE command will
error out trying to display it).



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Andrii Kolomoiets
Eli Zaretskii <[hidden email]> writes:

>> From: Andrii Kolomoiets <[hidden email]>
>> Cc: [hidden email],  [hidden email]
>> Date: Sun, 29 Dec 2019 21:01:42 +0200
>>
>> I can print the 'last_marked_index':
>>
>> (gdb) p last_marked_index
>> $2 = 41
>>
>> But what can I do with 'last_marked'?
>>
>> (gdb) p last_marked[40]
>> 'last_marked' has unknown type; cast it to its declared type
>
> last_marked is an array of Lisp objects, arranged in circular order,
> i.e. when the index reaches the last element, it is reset back to
> zero.
>
> To print the object at last_marked[i], for some i, you do
>
>   (gdb) p last_marked[i]
>   (gdb) xtype
>
> The xtype command will tell you the type of the Lisp object.  You then
> display it with the corresponding xTYPE command: xint for an integer,
> xcons for a cons cell, xstring for a string, xvector for a vector,
> xbuffer for a buffer, etc.  Here's a short example:
>
>   (gdb) p last_marked_index
>   $2 = 1
>   (gdb) p last_marked[0]
>   $3 = XIL(0x8000000006287630)
>   (gdb) xtype
>   Lisp_String
>   (gdb) xstring
>   $4 = (struct Lisp_String *) 0x6287630
>   " *buffer-defaults*"

I'm still have no luck to print last_marked item:

(gdb) p last_marked_index
$1 = 278
(gdb) p last_marked[277]
'last_marked' has unknown type; cast it to its declared type

IDK if it make sense, casting last_modified to Lisp_Object gives me
this:

(gdb) p (Lisp_Object)last_marked
$6 = XIL(0x102dc4203)
(gdb) xtype
Lisp_Cons
(gdb) xcons
$7 = (struct Lisp_Cons *) 0x102dc4200
{
  u = {
    s = {
      car = XIL(0x102a3aa15),
      u = {
        cdr = XIL(0x102dc4213),
        chain = 0x102dc4213
      }
    },
    gcaligned = 0x15
  }
}

But I found the commit after which error is occurs:
b2949d39261e82c33572ba8a250298ef0b165b95

Commenting out that 'ok = false;' line make Emacs works without errors.

Justin, can you please check if Emacs prior to that commit is works fine
for you?



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Eli Zaretskii
> From: Andrii Kolomoiets <[hidden email]>
> Cc: [hidden email],  [hidden email],  [hidden email]
> Date: Wed, 01 Jan 2020 22:42:19 +0200
>
> >   (gdb) p last_marked_index
> >   $2 = 1
> >   (gdb) p last_marked[0]
> >   $3 = XIL(0x8000000006287630)
> >   (gdb) xtype
> >   Lisp_String
> >   (gdb) xstring
> >   $4 = (struct Lisp_String *) 0x6287630
> >   " *buffer-defaults*"
>
> I'm still have no luck to print last_marked item:
>
> (gdb) p last_marked_index
> $1 = 278
> (gdb) p last_marked[277]
> 'last_marked' has unknown type; cast it to its declared type

This looks like some compiler bug, or maybe bug in GDB on your
platform?  Because the source clearly says

   Lisp_Object last_marked[LAST_MARKED_SIZE] EXTERNALLY_VISIBLE;

so the type should be known to GDB.  But this is just an aside.

> But I found the commit after which error is occurs:
> b2949d39261e82c33572ba8a250298ef0b165b95
>
> Commenting out that 'ok = false;' line make Emacs works without errors.

I cannot explain how that change could cause any harm.  Here's the
relevant code fragment:

      if (CONSP (parent_face))
        {
          Lisp_Object tail;
          ok = false;
          for (tail = parent_face; !NILP (tail); tail = XCDR (tail))
            {
              ok = get_lface_attributes (w, f, XCAR (tail), inherited_attrs,
                                         false, named_merge_points);
              if (!ok)
                break;
              attr_val = face_inherited_attr (w, f, inherited_attrs, attr_idx,
                                              named_merge_points);
              if (!UNSPECIFIEDP (attr_val))
                break;
            }
          if (!ok) /* bad face? */
            break;  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        }
      else
        {
          ok = get_lface_attributes (w, f, parent_face, inherited_attrs,
                                     false, named_merge_points);
          if (!ok)
            break;
          attr_val = inherited_attrs[attr_idx];
        }

Since parent_face is a cons cell, then we enter the for-loop (since a
cons cell cannot be nil), and then we immediately call
get_lface_attributes whose return value overwrites the initial value
of 'ok'.

So how could the initial value of 'ok' matter here?  What am I
missing?

Can you run the unmodified code with a breakpoint on the line
indicated by "<<<<<" above, and see if the breakpoint ever breaks?  If
it does break, can you show the face being merged in this case?

Also, if you build Emacs with exactly the same configure options, but
without optimizations, does the problem persist?

Thanks.



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Pieter van Oostrum-2
In reply to this post by Andrii Kolomoiets
Andrii Kolomoiets <[hidden email]> writes:

> But I found the commit after which error is occurs:
> b2949d39261e82c33572ba8a250298ef0b165b95
>
> Commenting out that 'ok = false;' line make Emacs works without errors.
>
> Justin, can you please check if Emacs prior to that commit is works fine
> for you?

I had Emacs built from master a few days ago, and got the same crashes, about twice a day, often when Emacs was idle.
So I decided to compile from the parent of the commit mentioned above, which is 73f37da12d.

However, this one also crashed, albeit with a different crash. See the attachment.



--
Pieter van Oostrum
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]

Emacs_2020-01-04-165858_Cochabamba.crash (69K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Alan Third
On Sat, Jan 04, 2020 at 05:48:04PM +0100, Pieter van Oostrum wrote:

> Andrii Kolomoiets <[hidden email]> writes:
>
> > But I found the commit after which error is occurs:
> > b2949d39261e82c33572ba8a250298ef0b165b95
> >
> > Commenting out that 'ok = false;' line make Emacs works without errors.
> >
> > Justin, can you please check if Emacs prior to that commit is works fine
> > for you?
>
> I had Emacs built from master a few days ago, and got the same crashes, about twice a day, often when Emacs was idle.
> So I decided to compile from the parent of the commit mentioned above, which is 73f37da12d.
>
> However, this one also crashed, albeit with a different crash. See the attachment.
>
> 8   org.gnu.Emacs                 0x00000001011cdb58 handle_fatal_signal + 24
> 9   org.gnu.Emacs                 0x00000001011cdbf2 deliver_thread_signal + 146
> 10  org.gnu.Emacs                 0x00000001011cb3da deliver_fatal_thread_signal + 26
> 11  org.gnu.Emacs                 0x00000001011cdc96 handle_sigsegv + 134
> 12  libsystem_platform.dylib       0x00007fff756adf5a _sigtramp + 26
> 13  ???                           000000000000000000 0 + 0
> 14  org.gnu.Emacs                 0x0000000101053bab Fmouse_pixel_position + 187

Hmm, I made a change to the NS mouse position code recently

fbf9fea4fdad467429058077b8087dbd0758b964

Perhaps that’s related somehow.

--
Alan Third



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Pieter van Oostrum-2
Alan Third <[hidden email]> writes:

> On Sat, Jan 04, 2020 at 05:48:04PM +0100, Pieter van Oostrum wrote:
>> Andrii Kolomoiets <[hidden email]> writes:
>>
>> > But I found the commit after which error is occurs:
>> > b2949d39261e82c33572ba8a250298ef0b165b95
>> >
>> > Commenting out that 'ok = false;' line make Emacs works without errors.
>> >
>> > Justin, can you please check if Emacs prior to that commit is works fine
>> > for you?
>>
>> I had Emacs built from master a few days ago, and got the same
>> crashes, about twice a day, often when Emacs was idle.
>> So I decided to compile from the parent of the commit mentioned above, which is 73f37da12d.
>>
>> However, this one also crashed, albeit with a different crash. See the attachment.
>>
>> 8   org.gnu.Emacs                 0x00000001011cdb58 handle_fatal_signal + 24
>> 9   org.gnu.Emacs                 0x00000001011cdbf2 deliver_thread_signal + 146
>> 10  org.gnu.Emacs                 0x00000001011cb3da deliver_fatal_thread_signal + 26
>> 11  org.gnu.Emacs                 0x00000001011cdc96 handle_sigsegv + 134
>> 12  libsystem_platform.dylib       0x00007fff756adf5a _sigtramp + 26
>> 13  ???                           000000000000000000 0 + 0
>> 14  org.gnu.Emacs                 0x0000000101053bab Fmouse_pixel_position + 187
>
> Hmm, I made a change to the NS mouse position code recently
>
> fbf9fea4fdad467429058077b8087dbd0758b964
>
> Perhaps that’s related somehow.

No. I compiled the version before that (9042ece787cf93665776ffb69893fcb1357aacbe) and it crashed with exactly the same crash. So, no, it must have been introduced before that.
On the other hand, I have been working before this with a version from Dec 1, 2019 (I think 9f2145f42daab13aed5cf89fdb6a7c5579819ec0) and I have used that quite a time without crashes. Whereas the other versions crashed 1-2 times a day.
--
Pieter van Oostrum
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Robert Pluim
In reply to this post by Eli Zaretskii
>>>>> On Thu, 02 Jan 2020 16:06:23 +0200, Eli Zaretskii <[hidden email]> said:

Iʼm now seeing this as well on both master and emacs-27

    Eli> This looks like some compiler bug, or maybe bug in GDB on your
    Eli> platform?  Because the source clearly says

    Eli>    Lisp_Object last_marked[LAST_MARKED_SIZE] EXTERNALLY_VISIBLE;

    Eli> so the type should be known to GDB.  But this is just an aside.

    >> But I found the commit after which error is occurs:
    >> b2949d39261e82c33572ba8a250298ef0b165b95
    >>
    >> Commenting out that 'ok = false;' line make Emacs works without errors.

I can confirm this.

    Eli> I cannot explain how that change could cause any harm.  Here's the
    Eli> relevant code fragment:

    Eli>       if (CONSP (parent_face))
    Eli> {
    Eli>  Lisp_Object tail;
    Eli>  ok = false;
    Eli>  for (tail = parent_face; !NILP (tail); tail = XCDR (tail))
    Eli>    {
    Eli>      ok = get_lface_attributes (w, f, XCAR (tail), inherited_attrs,
    Eli> false, named_merge_points);
    Eli>      if (!ok)
    Eli> break;
    Eli>      attr_val = face_inherited_attr (w, f, inherited_attrs, attr_idx,
    Eli>      named_merge_points);
    Eli>      if (!UNSPECIFIEDP (attr_val))
    Eli> break;
    Eli>    }
    Eli>  if (!ok) /* bad face? */
    Eli>    break;  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    Eli> }
    Eli>       else
    Eli> {
    Eli>  ok = get_lface_attributes (w, f, parent_face, inherited_attrs,
    Eli>     false, named_merge_points);
    Eli>  if (!ok)
    Eli>    break;
    Eli>  attr_val = inherited_attrs[attr_idx];
    Eli> }

    Eli> Since parent_face is a cons cell, then we enter the for-loop (since a
    Eli> cons cell cannot be nil), and then we immediately call
    Eli> get_lface_attributes whose return value overwrites the initial value
    Eli> of 'ok'.

    Eli> So how could the initial value of 'ok' matter here?  What am I
    Eli> missing?

    Eli> Can you run the unmodified code with a breakpoint on the line
    Eli> indicated by "<<<<<" above, and see if the breakpoint ever breaks?  If
    Eli> it does break, can you show the face being merged in this case?

It never breaks there for me.

    Eli> Also, if you build Emacs with exactly the same configure options, but
    Eli> without optimizations, does the problem persist?

Yes. Iʼll note that when this happens there are over 9000 stackframes,
so perhaps itʼs stack exhaustion. macOS has a default stack of 8192
kB, Iʼll see if increasing it helps.

Iʼm running under lldb as well, perhaps that will work better with
'last_marked'.

Robert



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Pip Cet
On Wed, Jan 8, 2020 at 5:40 PM Robert Pluim <[hidden email]> wrote:
>     >> But I found the commit after which error is occurs:>     >> b2949d39261e82c33572ba8a250298ef0b165b95
>     >>
>     >> Commenting out that 'ok = false;' line make Emacs works without errors.
>
> I can confirm this.

I think we should disassemble the two versions and see where the
differences are, unless this is too difficult because of inlining. Can
you provide compiler details?

>     Eli> I cannot explain how that change could cause any harm.  Here's the
>     Eli> relevant code fragment:

>     Eli> So how could the initial value of 'ok' matter here?  What am I
>     Eli> missing?

I think it's likely to be the stack thing; the ok = false might make
the difference between allocating inherited_attrs on the stack once
and doing so once per recursion of face_inherited_attr. The latter
case might lead to a stack overflow more easily.

> Yes. Iʼll note that when this happens there are over 9000 stackframes,
> so perhaps itʼs stack exhaustion. macOS has a default stack of 8192
> kB, Iʼll see if increasing it helps.

That does sound like infinite recursion, or infinite recursion waiting
for something to change asynchronously that breaks the loop. If the
"ok = false" prevents the compiler from recognizing
face_inherited_attr is effectively tail-recursive, that might be it?

Changing the line to "ok = true" would be an interesting experiment.



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Eli Zaretskii
> From: Pip Cet <[hidden email]>
> Date: Wed, 8 Jan 2020 19:18:15 +0000
> Cc: Eli Zaretskii <[hidden email]>, [hidden email], [hidden email],
> Andrii Kolomoiets <[hidden email]>, [hidden email]
>
> > Yes. Iʼll note that when this happens there are over 9000 stackframes,
> > so perhaps itʼs stack exhaustion. macOS has a default stack of 8192
> > kB, Iʼll see if increasing it helps.
>
> That does sound like infinite recursion, or infinite recursion waiting
> for something to change asynchronously that breaks the loop.

No, GC is known to take many thousands of recursive calls to
mark_object.  9000 is not a particularly high number, and doesn't
necessarily signal infinite recursion.



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Pip Cet
On Wed, Jan 8, 2020 at 7:58 PM Eli Zaretskii <[hidden email]> wrote:
> > > Yes. Iʼll note that when this happens there are over 9000 stackframes,
> > > so perhaps itʼs stack exhaustion. macOS has a default stack of 8192
> > > kB, Iʼll see if increasing it helps.
> > That does sound like infinite recursion, or infinite recursion waiting
> > for something to change asynchronously that breaks the loop.
> No, GC is known to take many thousands of recursive calls to
> mark_object.  9000 is not a particularly high number, and doesn't
> necessarily signal infinite recursion.

In general, you're absolutely correct. But in this case, it still
sounds very likely: infinite recursion of a properly tail-recursive
function would loop rather than cause a stack overflow, which would
explain everything, except for why it's not actually an infinite loop;
I suspect the macOS code somewhere does modify things asynchronously.



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Robert Pluim
In reply to this post by Pip Cet
>>>>> On Wed, 8 Jan 2020 19:18:15 +0000, Pip Cet <[hidden email]> said:

    Pip> On Wed, Jan 8, 2020 at 5:40 PM Robert Pluim <[hidden email]> wrote:
    >> >> But I found the commit after which error is occurs:>     >> b2949d39261e82c33572ba8a250298ef0b165b95
    >> >>
    >> >> Commenting out that 'ok = false;' line make Emacs works without errors.
    >>
    >> I can confirm this.

    Pip> I think we should disassemble the two versions and see where the
    Pip> differences are, unless this is too difficult because of inlining. Can
    Pip> you provide compiler details?

gcc --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Iʼve attached the disassembly of the two versions. They're very very
similar (this is with -g3 -O0).

    Eli> I cannot explain how that change could cause any harm.  Here's the
    Eli> relevant code fragment:

    Eli> So how could the initial value of 'ok' matter here?  What am I
    Eli> missing?

    Pip> I think it's likely to be the stack thing; the ok = false might make
    Pip> the difference between allocating inherited_attrs on the stack once
    Pip> and doing so once per recursion of face_inherited_attr. The latter
    Pip> case might lead to a stack overflow more easily.

The allocation of inherited_attrs is the same in both.

    >> Yes. Iʼll note that when this happens there are over 9000 stackframes,
    >> so perhaps itʼs stack exhaustion. macOS has a default stack of 8192
    >> kB, Iʼll see if increasing it helps.

    Pip> That does sound like infinite recursion, or infinite recursion waiting
    Pip> for something to change asynchronously that breaks the loop. If the
    Pip> "ok = false" prevents the compiler from recognizing
    Pip> face_inherited_attr is effectively tail-recursive, that might be it?

    Pip> Changing the line to "ok = true" would be an interesting experiment.

Hmm, yes. Iʼll try that.

BTW, running under lldb, last_marked can be accessed successfully, but
of course under lldb you donʼt get all the nice commands from
.gdbinit. Iʼd build a newer version of gdb, but signing binaries on
macOS is a real hassle.

Robert


modified.txt (9K) Download Attachment
unmodified.txt (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Pip Cet
On Wed, Jan 8, 2020 at 9:43 PM Robert Pluim <[hidden email]> wrote:
> gcc --version
> Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
> Apple LLVM version 10.0.1 (clang-1001.0.46.4)
> Target: x86_64-apple-darwin18.7.0
> Thread model: posix
> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
>
> Iʼve attached the disassembly of the two versions. They're very very
> similar (this is with -g3 -O0).

But wait, doesn't the bug happen in both unoptimized versions? I
should have been clearer: my suspicion is the bug only goes away if
tail calls are optimized, which happens only with optimizations
enabled.



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Robert Pluim
>>>>> On Wed, 8 Jan 2020 22:18:11 +0000, Pip Cet <[hidden email]> said:

    Pip> On Wed, Jan 8, 2020 at 9:43 PM Robert Pluim <[hidden email]> wrote:
    >> gcc --version
    >> Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
    >> Apple LLVM version 10.0.1 (clang-1001.0.46.4)
    >> Target: x86_64-apple-darwin18.7.0
    >> Thread model: posix
    >> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
    >>
    >> Iʼve attached the disassembly of the two versions. They're very very
    >> similar (this is with -g3 -O0).

    Pip> But wait, doesn't the bug happen in both unoptimized versions? I
    Pip> should have been clearer: my suspicion is the bug only goes away if
    Pip> tail calls are optimized, which happens only with optimizations
    Pip> enabled.

No, it only happens with the initialisation of 'ok', optimised or not.

As another data point, Iʼm writing this from an emacs with 'ok =
true', which has not crashed yet....

Robert



Reply | Threaded
Open this post in threaded view
|

bug#38748: 28.0.50; crash on MacOS 10.15.2

Eli Zaretskii
In reply to this post by Pip Cet
> From: Pip Cet <[hidden email]>
> Date: Wed, 8 Jan 2020 20:39:43 +0000
> Cc: [hidden email], [hidden email], [hidden email],
> [hidden email], [hidden email]
>
> > No, GC is known to take many thousands of recursive calls to
> > mark_object.  9000 is not a particularly high number, and doesn't
> > necessarily signal infinite recursion.
>
> In general, you're absolutely correct. But in this case, it still
> sounds very likely: infinite recursion of a properly tail-recursive
> function would loop rather than cause a stack overflow, which would
> explain everything, except for why it's not actually an infinite loop;
> I suspect the macOS code somewhere does modify things asynchronously.

The backtrace shows a very recursive GC, it doesn't show any other
function being deeply recursive.  So I'm not sure I understand what
tail-recursive function did you have in mind.  Can you elaborate?



123