Emacs-26 threads problem [win64]

classic Classic list List threaded Threaded
36 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Emacs-26 threads problem [win64]

Fabrice Popineau-3
There is a segmentation fault on ms-windows when running threads tests :

$ HOME=/nonexistent EMACSLOADPATH= LC_ALL=C EMACS_TEST_DIRECTORY=/d/Source/emacs/emacs/test "../src/emacs" -batch --no-site-file --no-site-lisp -L ";../../emacs/test" -l ert -l src/thread-tests.el --eval "(ert-run-tests-batch-and-exit (quote (not (tag :expensive-test))))"
Running 28 tests (2017-10-09 21:45:59+0200)
   passed   1/28  thread-errors
   passed   2/28  thread-signal-early
   passed   3/28  thread-sticky-point
   passed   4/28  threads-alive
   passed   5/28  threads-all-threads
   passed   6/28  threads-basic
   passed   7/28  threads-condvar-mutex
   passed   8/28  threads-condvar-name
   passed   9/28  threads-condvar-name-2
   passed  10/28  threads-condvar-type
   passed  11/28  threads-condvar-wait
   passed  12/28  threads-condvarp
   passed  13/28  threads-condvarp-2
   passed  14/28  threads-io-switch
   passed  15/28  threads-is-one
   passed  16/28  threads-join
   passed  17/28  threads-join-self
   passed  18/28  threads-let-binding
Segmentation fault

This is with the latest emacs-26 sources.

Fabrice
Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Eli Zaretskii
> From: Fabrice Popineau <[hidden email]>
> Date: Mon, 9 Oct 2017 21:48:39 +0200
>
> passed 18/28 threads-let-binding
> Segmentation fault
>
> This is with the latest emacs-26 sources.

Doesn't segfault for me, so it could be something specific to the
64-bit build.  Can you attach a debugger, show the backtrace, and look
around to find the immediate reason for the segfault?

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Fabrice Popineau-3
Here it is :

[New Thread 21364.0x3eb8]

Thread 15 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 21364.0x3eb8]
0x00000004001ecb07 in run_thread (state=0x400ae5068 <dumped_data+4654312>) at ../../emacs/src/thread.c:700
700       self->m_stack_bottom = self->stack_top = (char *) &stack_pos;
(gdb) p self
$1 = (struct thread_state *) 0x400ae5068 <dumped_data+4654312>
(gdb) p *self
$2 = {header = {size = 4611686018662539271}, m_last_thing_searched = {i = 0}, m_saved_last_thing_searched = {i = 0}, name = {
    i = 0}, function = {i = -17087155184}, error_symbol = {i = 0}, error_data = {i = 0}, event_object = {i = 0},
  m_stack_bottom = 0x0, stack_top = 0x0, m_catchlist = 0x0, m_handlerlist = 0x0, m_handlerlist_sentinel = 0x0,
  m_specpdl_size = 50, m_specpdl = 0x5f270b8, m_specpdl_ptr = 0x5f270b8, m_lisp_eval_depth = 0, m_current_buffer = 0x5f26c80,
  m_search_regs = {num_regs = 0, start = 0x0, end = 0x0}, m_search_regs_saved = false, m_saved_search_regs = {num_regs = 0,
    start = 0x0, end = 0x0}, m_re_match_object = {i = 0}, m_waiting_for_user_input_p = 0, m_waiting_for_input = false,
  m_getcjmp = {{Part = {0, 0}} <repeats 16 times>}, thread_id = 0, thread_condvar = {wait_count = 0, wait_count_lock = {
      DebugInfo = 0xffffffffffffffff, LockCount = -1, RecursionCount = 0, OwningThread = 0x0, LockSemaphore = 0x0,
      SpinCount = 33556432}, events = {0x40c, 0x410}, initialized = true}, wait_condvar = 0x0, not_holding_lock = 0,
  next_thread = 0x400664500 <main_thread>}
(gdb) p stack_pos
$3 = {__max_align_ll = 0, __max_align_ld = 3.587554638101247699761989924457637e-4943}

I am not sure what causes the segfault.
m_stack_bottom is of type char* but stack_top is of type void*. Could that be a problem ?

Fabrice


2017-10-10 8:25 GMT+02:00 Eli Zaretskii <[hidden email]>:
> From: Fabrice Popineau <[hidden email]>
> Date: Mon, 9 Oct 2017 21:48:39 +0200
>
> passed 18/28 threads-let-binding
> Segmentation fault
>
> This is with the latest emacs-26 sources.

Doesn't segfault for me, so it could be something specific to the
64-bit build.  Can you attach a debugger, show the backtrace, and look
around to find the immediate reason for the segfault?

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Eli Zaretskii
> From: Fabrice Popineau <[hidden email]>
> Date: Tue, 10 Oct 2017 18:13:19 +0200
> Cc: Emacs developers <[hidden email]>
>
> (gdb) p stack_pos
> $3 = {__max_align_ll = 0, __max_align_ld = 3.587554638101247699761989924457637e-4943}

What is &stack_pos, the address of stack_pos?  Is it properly aligned?

Also, can you tell what Windows exception was translated into SIGSEGV?
I think GDB will show that information if you issue the following
command before running the segfaulting program:

  (gdb) set debugexceptions 1

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Eli Zaretskii
> Date: Tue, 10 Oct 2017 19:30:11 +0300
> From: Eli Zaretskii <[hidden email]>
> Cc: [hidden email]
>
> > From: Fabrice Popineau <[hidden email]>
> > Date: Tue, 10 Oct 2017 18:13:19 +0200
> > Cc: Emacs developers <[hidden email]>
> >
> > (gdb) p stack_pos
> > $3 = {__max_align_ll = 0, __max_align_ld = 3.587554638101247699761989924457637e-4943}
>
> What is &stack_pos, the address of stack_pos?  Is it properly aligned?

Also, this very snippet runs every time a non-main thread is run by
Emacs, and yet all of the 18 tests before this one succeeded for you.
What is different in this 19th test that it fails?

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Fabrice Popineau-3
In reply to this post by Eli Zaretskii


2017-10-10 18:30 GMT+02:00 Eli Zaretskii <[hidden email]>:
> From: Fabrice Popineau <[hidden email]>
> Date: Tue, 10 Oct 2017 18:13:19 +0200
> Cc: Emacs developers <[hidden email]>
>
> (gdb) p stack_pos
> $3 = {__max_align_ll = 0, __max_align_ld = 3.587554638101247699761989924457637e-4943}

What is &stack_pos, the address of stack_pos?  Is it properly aligned?


(gdb) p &stack_pos
$1 = (max_align_t *) 0xbc1fec0
(gdb) p (char *)&stack_pos
$2 = 0xbc1fec0 "" 

I don't see anything wrong here ?

Also, can you tell what Windows exception was translated into SIGSEGV?
I think GDB will show that information if you issue the following
command before running the segfaulting program:

  (gdb) set debugexceptions 1

[New Thread 1144.0x5860]
gdb: Target exception EXCEPTION_ACCESS_VIOLATION at 0x4001ecb07

Fabrice 

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Fabrice Popineau-3
In reply to this post by Eli Zaretskii


2017-10-10 18:48 GMT+02:00 Eli Zaretskii <[hidden email]>:
> Date: Tue, 10 Oct 2017 19:30:11 +0300
> From: Eli Zaretskii <[hidden email]>
> Cc: [hidden email]
>
> > From: Fabrice Popineau <[hidden email]>
> > Date: Tue, 10 Oct 2017 18:13:19 +0200
> > Cc: Emacs developers <[hidden email]>
> >
> > (gdb) p stack_pos
> > $3 = {__max_align_ll = 0, __max_align_ld = 3.587554638101247699761989924457637e-4943}
>
> What is &stack_pos, the address of stack_pos?  Is it properly aligned?

Also, this very snippet runs every time a non-main thread is run by
Emacs, and yet all of the 18 tests before this one succeeded for you.
What is different in this 19th test that it fails?

What does this test do exactly ? (Well I could read the code but ... doing many other things in parallel)

Because if I print &stack_pos, I get :

(gdb) run -batch --no-site-file --no-site-lisp -L ";../../emacs/test" -l ert -l src/thread-tests.el --eval "(ert-run-tests-batch-and-exit (quote (not (tag :expensive-test))))"
Starting program: /d/Source/emacs/build-emacs-26/src/emacs -batch --no-site-file --no-site-lisp -L ";../../emacs/test" -l ert -l src/thread-tests.el --eval "(ert-run-tests-batch-and-exit (quote (not (tag :expensive-test))))"
[New Thread 20864.0x60c]
[New Thread 20864.0x5bb4]
[New Thread 20864.0x3070]
[New Thread 20864.0x332c]
[New Thread 20864.0x512c]
Running 28 tests (2017-10-10 19:54:11+0100)
[New Thread 20864.0x42ec]
[Thread 20864.0x42ec exited with code 0]
[New Thread 20864.0x25e8]
stack_pos = 741feb0
stack_pos = 7c1feb0
   passed   1/28  thread-errors
[New Thread 20864.0x1118]
[Thread 20864.0x25e8 exited with code 0]
[Thread 20864.0x1118 exited with code 0]
stack_pos = 841feb0
   passed   2/28  thread-signal-early
   passed   3/28  thread-sticky-point
[New Thread 20864.0x562c]
stack_pos = 8c1feb0
   passed   4/28  threads-alive
   passed   5/28  threads-all-threads
[New Thread 20864.0x471c]
[Thread 20864.0x562c exited with code 0]
[Thread 20864.0x471c exited with code 0]
stack_pos = 941feb0
   passed   6/28  threads-basic
   passed   7/28  threads-condvar-mutex
   passed   8/28  threads-condvar-name
   passed   9/28  threads-condvar-name-2
   passed  10/28  threads-condvar-type
[New Thread 20864.0x2970]
[Thread 20864.0x2970 exited with code 0]
stack_pos = 9c1feb0
   passed  11/28  threads-condvar-wait
   passed  12/28  threads-condvarp
   passed  13/28  threads-condvarp-2
[New Thread 20864.0x16a0]
[Thread 20864.0x16a0 exited with code 0]
stack_pos = a41feb0
   passed  14/28  threads-io-switch
   passed  15/28  threads-is-one
[New Thread 20864.0x4e68]
[Thread 20864.0x4e68 exited with code 0]
stack_pos = ac1feb0
   passed  16/28  threads-join
   passed  17/28  threads-join-self
[New Thread 20864.0x63f4]
[Thread 20864.0x63f4 exited with code 0]
stack_pos = b41feb0
   passed  18/28  threads-let-binding
[New Thread 20864.0x4340]

Thread 15 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 20864.0x4340]
run_thread (state=0x400ae5e78 <dumped_data+4670360>) at ../../emacs/src/thread.c:702
702       self->m_stack_bottom = (char *) &stack_pos;


The value of stack_pos keeps  increasing. Isn't there some limit that is reached here ?

Fabrice

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Eli Zaretskii
> From: Fabrice Popineau <[hidden email]>
> Date: Tue, 10 Oct 2017 20:57:20 +0200
> Cc: Emacs developers <[hidden email]>
>
> What does this test do exactly ? (Well I could read the code but ... doing many other things in parallel)

It executes the thread-related primitives by launching one or more
threads in each test.

> The value of stack_pos keeps increasing. Isn't there some limit that is reached here ?

Maybe.  I don't see anything like that, the stack_pos address remains
almost fixed here.  I think what you see means the threads created by
the tests don't exit, so the following tests create threads with stack
space higher and higher in the address space.

Does the problem go away if you insert at the beginning of each test a
loop like this:

  (while (> (length (all-threads)) 1)
    (sleep-for 0.1))

If this doesn't help, we will have to look at the other threads in a
debugger, to see where they are stuck, which prevents them from
exiting.

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Fabrice Popineau-3


2017-10-10 21:39 GMT+02:00 Eli Zaretskii <[hidden email]>:

> The value of stack_pos keeps increasing. Isn't there some limit that is reached here ?

Maybe.  I don't see anything like that, the stack_pos address remains
almost fixed here.  I think what you see means the threads created by
the tests don't exit, so the following tests create threads with stack
space higher and higher in the address space.

Does the problem go away if you insert at the beginning of each test a
loop like this:

  (while (> (length (all-threads)) 1)
    (sleep-for 0.1))

This makes all the 28 tests pass. A bit scary though.

Fabrice

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Eli Zaretskii
> From: Fabrice Popineau <[hidden email]>
> Date: Tue, 10 Oct 2017 21:58:35 +0200
> Cc: Emacs developers <[hidden email]>
>
>  Does the problem go away if you insert at the beginning of each test a
>  loop like this:
>
>  (while (> (length (all-threads)) 1)
>  (sleep-for 0.1))
>
> This makes all the 28 tests pass.

And the address of stack_pos doesn't grow anymore?

> A bit scary though.

Why scary?

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Fabrice Popineau-3


2017-10-10 22:09 GMT+02:00 Eli Zaretskii <[hidden email]>:
> From: Fabrice Popineau <[hidden email]>
> Date: Tue, 10 Oct 2017 21:58:35 +0200
> Cc: Emacs developers <[hidden email]>
>
>  Does the problem go away if you insert at the beginning of each test a
>  loop like this:
>
>  (while (> (length (all-threads)) 1)
>  (sleep-for 0.1))
>
> This makes all the 28 tests pass.

And the address of stack_pos doesn't grow anymore?

Yes, it still does. Apparently not out of bounds, because I have run the tests several times. 

> A bit scary though.

Why scary?

I don't like to be at the whim of such a waiting loop before being able to run another thread.

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Eli Zaretskii
> From: Fabrice Popineau <[hidden email]>
> Date: Tue, 10 Oct 2017 22:14:58 +0200
> Cc: Emacs developers <[hidden email]>
>
>  > (while (> (length (all-threads)) 1)
>  > (sleep-for 0.1))
>  >
>  > This makes all the 28 tests pass.
>
>  And the address of stack_pos doesn't grow anymore?
>
> Yes, it still does. Apparently not out of bounds, because I have run the tests several times.

Can you show the values?  What I see here is that it goes up one or 2
notches up to test #6, and then goes back and never grows in the rest
of the tests.

>  > A bit scary though.
>
>  Why scary?
>
> I don't like to be at the whim of such a waiting loop before being able to run another thread.

Since each thread uses up some stack space, why is it surprising that
given enough threads running, we get a C stack overflow?

Perhaps it's time to discuss whether we should place some restrictions
on the thread stack space.  Right now, we leave the stack size of the
thread to the system defaults, which in the Windows case means 8MB,
the size specified for the main program in the executable's header.
AFAIK, on Unix the default is 2MB, but can be changed via RLIMIT_STACK
setting.  Maybe these 2 values are too much, but OTOH these threads
need to be able to run the same Lisp programs as the main thread, so
it's unclear whether we can safely decrease the stack size.

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Eli Zaretskii
In reply to this post by Fabrice Popineau-3
> From: Fabrice Popineau <[hidden email]>
> Date: Tue, 10 Oct 2017 22:14:58 +0200
> Cc: Emacs developers <[hidden email]>
>
>  >  Does the problem go away if you insert at the beginning of each test a
>  >  loop like this:
>  >
>  >  (while (> (length (all-threads)) 1)
>  >  (sleep-for 0.1))
>  >
>  > This makes all the 28 tests pass.
>
>  And the address of stack_pos doesn't grow anymore?
>
> Yes, it still does. Apparently not out of bounds, because I have run the tests several times.
>
>  > A bit scary though.
>
>  Why scary?
>
> I don't like to be at the whim of such a waiting loop before being able to run another thread.

We've discussed this issue off-line with Fabrice, and the results of
the investigation seem to point to a GCC bug.  Here are the details:

The faulting instruction is this:

  706       self->m_stack_bottom = (char *) &stack_pos;
  => 0x4001ec91e <run_thread+158>:        movaps %xmm0,0x40(%rbx)

The faulting address:

  (gdb) p/x $rbx
  $1 = 0x400ae5e78
  (gdb) p/x self
  $2 = 0x400ae5e78
  (gdb) p/x $rbx+0x40
  $3 = 0x400ae5eb8
  (gdb) x/x $rbx+0x40
  0x400ae5eb8 <dumped_data+4670424>:      0x00000000

Note that GDB has no problems accessing that address, whereas Emacs
segfaulted in an instruction that accessed it.

The documentation of the MOVAPS instruction says this:

  When the source or destination operand is a memory operand, the
  operand must be aligned on a 16-byte (128-bit version) or 32-byte
  (VEX.256 encoded version) boundary or a general-protection exception
  (#GP) will be generated.

And sure enough, 0x400ae5eb8 is not 16-byte aligned.  Fabrice then
added a dummy member to the tread_state struct:

  diff --git a/src/thread.h b/src/thread.h
  index cb2133d72d..fd06a03071 100644
  --- a/src/thread.h
  +++ b/src/thread.h
  @@ -56,6 +56,8 @@ struct thread_state
        waiting on.  */
     Lisp_Object event_object;

  +  Lisp_Object foobar;
  +
     /* m_stack_bottom must be the first non-Lisp field.  */
     /* An address near the bottom of the stack.
        Tells GC how to save a copy of the stack.  */

and the problem went away.

Further investigation shows that the unaligned address comes from this
fragment of make-thread:

  new_thread = ALLOCATE_PSEUDOVECTOR (struct thread_state, m_stack_bottom,
                                      PVEC_THREAD);

The address of the new_thread structure is then passed as argument to
run_thread a few lines below:

  if (! sys_thread_create (&thr, c_name, run_thread, new_thread))

So it looks like new_thread could be aligned at just 8 bytes, but GCC
assumes for some reason that it's 16-byte aligned, and issues an
instruction that requires such an alignment.  That sounds like a GCC
bug, doesn't it?  Or do heap allocation functions on Posix hosts
return buffers that are always 16-byte aligned?

Note that ALLOCATE_PSEUDOVECTOR could sub-allocate via
allocate_vector_from_block: is this guaranteed to be 16-byte aligned?

I'm not sure how to resolve this issue, so ideas and comments on the
analysis are welcome.

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Richard Copley-2
> I'm not sure how to resolve this issue, so ideas and comments on the
> analysis are welcome.

I think GCC's assumption that 0x40(%rbx) is 16-byte aligned derives
through some arithmetic from the assumption that the stack was 16-byte
aligned on entry. Can you try adding
    __attribute__ ((force_align_arg_pointer))
to the thread function?


__attribute__ ((force_align_arg_pointer))
static void *
run_thread (void *state)
{

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Paul Eggert
In reply to this post by Eli Zaretskii
On 10/13/2017 11:52 AM, Eli Zaretskii wrote:
> So it looks like new_thread could be aligned at just 8 bytes, but GCC
> assumes for some reason that it's 16-byte aligned, and issues an
> instruction that requires such an alignment.

It's not a GCC bug. It's a bug in the way that Emacs allocates
pseudovectors. We assume that an alignment of 8 is enough, but
apparently this is incorrect for the struct thread type under
MS-Windows. I'll see if I can squeeze some time free to look into this.


Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Eli Zaretskii
In reply to this post by Richard Copley-2
> From: Richard Copley <[hidden email]>
> Date: Fri, 13 Oct 2017 23:00:56 +0100
> Cc: Fabrice Popineau <[hidden email]>, Emacs Development <[hidden email]>
>
> I think GCC's assumption that 0x40(%rbx) is 16-byte aligned derives
> through some arithmetic from the assumption that the stack was 16-byte
> aligned on entry. Can you try adding
>     __attribute__ ((force_align_arg_pointer))
> to the thread function?

Thanks, we already tried that.  It didn't work, because the offending
address doesn't come from stack, it comes from the heap.

(Nevertheless, the thread function should have this attribute on
Windows for the 32-bit builds to have the correct stack alignment.  I
will do that soon.  But Fabrice's build is 64-bit, where that
alignment is guaranteed by the ABI.)

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Eli Zaretskii
In reply to this post by Paul Eggert
> Cc: [hidden email]
> From: Paul Eggert <[hidden email]>
> Date: Fri, 13 Oct 2017 15:28:40 -0700
>
> It's not a GCC bug. It's a bug in the way that Emacs allocates
> pseudovectors. We assume that an alignment of 8 is enough, but
> apparently this is incorrect for the struct thread type under
> MS-Windows.

Sorry, I don't understand the nature of the bug.  Does GCC assume that
each struct starts at a 16-byte boundary?  (The fact that it issues
the movaps instruction seems to be evidence of such an assumption.)
If so, what ABI makes that assumption valid?

> I'll see if I can squeeze some time free to look into this.

Thanks.

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Paul Eggert
Eli Zaretskii wrote:
> Does GCC assume that each struct starts at a 16-byte boundary?

No, only structs needing such an alignment, as is the case here.

Fabrice, I installed the attached patch into the emacs-26 branch; would you
please give it a try? Thanks.

0001-Do-not-under-align-pseudovectors.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Eli Zaretskii
> Cc: [hidden email], [hidden email]
> From: Paul Eggert <[hidden email]>
> Date: Sat, 14 Oct 2017 01:05:59 -0700
>
> Eli Zaretskii wrote:
> > Does GCC assume that each struct starts at a 16-byte boundary?
>
> No, only structs needing such an alignment, as is the case here.

What is special about this specific struct that it needs such an
alignment?

I guess what I'm asking is why did GCC emit the movaps instruction for
assigning a value to a struct member in this case?  To do that, GCC
must have decided that this particular struct should be 16-byte
aligned, but what was the rationale for such a decision?

Thanks.

Reply | Threaded
Open this post in threaded view
|

Re: Emacs-26 threads problem [win64]

Andreas Schwab-2
In reply to this post by Paul Eggert
On Okt 14 2017, Paul Eggert <[hidden email]> wrote:

> diff --git a/src/alloc.c b/src/alloc.c
> index 2e6399e..da0c3ad 100644
> --- a/src/alloc.c
> +++ b/src/alloc.c
> @@ -2923,9 +2923,13 @@ set_next_vector (struct Lisp_Vector *v, struct Lisp_Vector *p)
>  
>  enum
>    {
> -    /* Alignment of struct Lisp_Vector objects.  */
> -    vector_alignment = COMMON_MULTIPLE (FLEXALIGNOF (struct Lisp_Vector),
> - GCALIGNMENT),
> +    /* Alignment of struct Lisp_Vector objects.  Because pseudovectors
> +       can contain any C type, align at least as strictly as
> +       max_align_t.  On x86 and x86-64 this can waste up to 8 bytes
> +       for typical vectors, since alignof (max_align_t) is 16 but
> +       typical vectors need only an alignment of 8.  However, it is
> +       not worth the hassle to avoid wasting those bytes.  */

That can be a lot of wastage for small vectors.

Andreas.

--
Andreas Schwab, [hidden email]
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

12