wait_reading_process_ouput hangs in certain cases (w/ patches)

classic Classic list List threaded Threaded
46 messages Options
123
Reply | Threaded
Open this post in threaded view
|

wait_reading_process_ouput hangs in certain cases (w/ patches)

Matthias Dahl
Hello all,

recursively calling accept-process-output with the first call waiting
for the output of a specific process, can hang Emacs (as-in: it is
waiting forever but can be canceled of course) even though it should
not be the case since data was read.

This is actually not all that uncommon. One example of this is a hang
seen with Magit opening its COMMIT_MSG buffer, reported here [1]. I've
myself run into this problem continuously which is why I started to
debug it in the first place.

The hang with Magit happens in flyspell.el which waits for output from
its spellchecker process through accept-process-output and specifies
that specific process as wait_proc. Now depending on timing (race),
wait_reading_process_output can call the pending timers... which in
turn can call accept-process-output again. This almost always leads
to the spellchecker output being read back in full, so there is no
more data left to be read. Thus the original accept-process-output,
which called wait_reading_process_output, will wait for the data to
become available forever since it has no way to know that those have
already been read.

Naturally one could argue that a timeout should be specified when
calling accept-process-output. But nevertheless I still think this is
a breach of contract. The caller expects accept-process-output to
return as promised when data has been read from the specified process
and it clearly isn't always the case and can hang forever, depending
on timing and the specifics of the data being read.

The attached patches fix this by introducing process output read
accounting -- simply counting the bytes read per process. And using
that data to strategically check in wait_reading_process_output for
data being read while we handed over control to timers and/or filters.

I haven't seen any ill side-effects in my tests and it clearly fixes
the problem seen in [1] as well as I would wager quite a few others
that were probably seen by user's of all kinds of setups that seemed
unpredictable and mysterious and never debugged.

As a side-note: I decided against an artificial metric and went with
simply counting the bytes read, since this does come in handy when
doing debugging and being able to see how much data was read from a
process during specific time intervals.

Also, this still leaves the possibility that wait_reading_process_output
could wait forever while being called without wait_proc and a timeout
set. This could be mitigated as well by some sort of a tick counter that
only increases when no wait_proc was specified and data from processes
were processed. I decided against implementing that for now since imho
the chances of that happening are marginal, if at all present. OTOH,
the semantics in that case are not all that clear and would further add
complexity to an already rather unhealthy function. I am naturally open
to other opinions and implementing this as well if requested. :-)

Any suggestions and/or comments are very welcome, as always.

Thanks for the patience to read this longish post. :-)

So long,
Matthias

[1] https://github.com/magit/magit/issues/2915

--
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu

0001-Add-process-output-read-accounting.patch (1K) Download Attachment
0002-src-process.c-wait_reading_process_output-Fix-wait_p.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Eli Zaretskii
> From: Matthias Dahl <[hidden email]>
> Date: Tue, 24 Oct 2017 20:52:20 +0200
>
> recursively calling accept-process-output with the first call waiting
> for the output of a specific process, can hang Emacs (as-in: it is
> waiting forever but can be canceled of course) even though it should
> not be the case since data was read.
>
> This is actually not all that uncommon. One example of this is a hang
> seen with Magit opening its COMMIT_MSG buffer, reported here [1]. I've
> myself run into this problem continuously which is why I started to
> debug it in the first place.
>
> The hang with Magit happens in flyspell.el which waits for output from
> its spellchecker process through accept-process-output and specifies
> that specific process as wait_proc. Now depending on timing (race),
> wait_reading_process_output can call the pending timers... which in
> turn can call accept-process-output again. This almost always leads
> to the spellchecker output being read back in full, so there is no
> more data left to be read. Thus the original accept-process-output,
> which called wait_reading_process_output, will wait for the data to
> become available forever since it has no way to know that those have
> already been read.

I'm not sure I understand the situation where this happens; can you
elaborate?  Are you saying that a Lisp program calls
accept-process-output and waits for a specific process to produce some
output, and meanwhile some timer runs a function that calls
accept-process-output with its 1st argument nil?  What timers do that?
I think this would be a bug in such timers.

Or are you saying that a Lisp program calls accept-process-output with
its 1st arg nil, and while it waits, a timer calls
accept-process-output for a specific process?

Or do you mean something else?

Thanks.

Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Matthias Dahl
Hello Eli,

Thanks for taking the time to review the issue and patches. :-)

On 25/10/17 16:53, Eli Zaretskii wrote:

> I'm not sure I understand the situation where this happens; can you
> elaborate?

Sure. Let's take the Magit issue [1] as an example:

When committing, Magit prepares a COMMIT_MSG buffer and does some
process magic of its own which is pretty much irrelevant for this.

At some point during that, while we are already in an instance of
wait_reading_process_output (due to sit_for), the post-command-hooks are
run.

And here things get interesting. Eventually flyspell-post-command-hook
is run which executes flyspell-word synchronously. That basically does
write out a word that needs to be checked to the spellchecker process,
waits for the results from stdin via accept-process-output and goes on.
Of special note here is that it a) specifies a wait_proc (spellchecker
process) and no timeout or whatsoever.

The output from the spellchecker is usually there instantaneously, so
that is actually unnoticeable, unless wait_reading_process_output, that
was invoked through that specific accept-process-output, decides to run
the timers.

And here comes the catch: At this point, usually the spellchecker output
is already available but not yet read. When the timers run, one of them
calls accept-process-output again which will read the entire available
output of the spellchecker process. Since there will be no more data on
that fd unless some interaction happens with the process, our original
call to accept-process-output/wait_reading_process_output will wait
endlessly for the data to become available (due to wait_proc being set
without a timeout).

Thus, it appears that Magit hangs while in truth, flyspell hangs waiting
for the spellchecker results to return that have already been read back.

The gist of it is: If we have an active wait_reading_process_output call
with a wait_proc set but no timeout that calls out to either timers or
filters, it is entirely possible that those directly or indirectly call
us again recursively, thus reading the output we are waiting for without
us ever noticing it, if no further output becomes available in addition
to what was read unnoticed... like it happens with flyspell.

That is what my patches fix: They simply add a bytes read metric to each
process structure that we can check for change at strategically relevant
points and decide if we got some data back that went unnoticed and break
out from wait_reading_process_output.

I know, flyspell should do its business asynchronously and also specify
a timeout since it is being run through hooks. Those are bugs by itself.
But I also think that wait_reading_process_output violates its contract
and is buggy in this regard as well, since it should properly function
even if it calls out to filters or timers -- and it clearly does not and
I would wager more hangs seen in the wild that weren't debugged, could
be attributed to this very bug.

I hope my rambling speech was somewhat helpful and clear and I could get
the problem described sufficiently without being too confusing. :-)

If there are any question marks left hanging over your head, please
don't hesitate to ask and I will try my best to clear them up -- but it
might end up being another longish mail, so be warned. ;)

Thanks,
Matthias

[1] https://github.com/magit/magit/issues/2915

--
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu


Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Eli Zaretskii
> Cc: [hidden email]
> From: Matthias Dahl <[hidden email]>
> Date: Thu, 26 Oct 2017 16:07:31 +0200
>
> When committing, Magit prepares a COMMIT_MSG buffer and does some
> process magic of its own which is pretty much irrelevant for this.
>
> At some point during that, while we are already in an instance of
> wait_reading_process_output (due to sit_for), the post-command-hooks are
> run.

AFAIK, post-command-hooks cannot be run while we are in sit-for, but I
guess this is not relevant to the rest of the description?

> And here things get interesting. Eventually flyspell-post-command-hook
> is run which executes flyspell-word synchronously. That basically does
> write out a word that needs to be checked to the spellchecker process,
> waits for the results from stdin via accept-process-output and goes on.
> Of special note here is that it a) specifies a wait_proc (spellchecker
> process) and no timeout or whatsoever.
>
> The output from the spellchecker is usually there instantaneously, so
> that is actually unnoticeable, unless wait_reading_process_output, that
> was invoked through that specific accept-process-output, decides to run
> the timers.
>
> And here comes the catch: At this point, usually the spellchecker output
> is already available but not yet read. When the timers run, one of them
> calls accept-process-output again which will read the entire available
> output of the spellchecker process.

I understand that this timer calls accept-process-output with its
argument nil, is that correct?  If so, isn't that a bug for a timer to
do that?  Doing that runs the risk of eating up output from some
subprocess for which the foreground Lisp program is waiting.

So please point out the timer that does this, because I think that
timer needs to be fixed.

> The gist of it is: If we have an active wait_reading_process_output call
> with a wait_proc set but no timeout that calls out to either timers or
> filters, it is entirely possible that those directly or indirectly call
> us again recursively, thus reading the output we are waiting for without
> us ever noticing it, if no further output becomes available in addition
> to what was read unnoticed... like it happens with flyspell.
>
> That is what my patches fix: They simply add a bytes read metric to each
> process structure that we can check for change at strategically relevant
> points and decide if we got some data back that went unnoticed and break
> out from wait_reading_process_output.

We already record the file descriptors on which we wait for process
output, see compute_non_keyboard_wait_mask.  Once
wait_reading_process_output exits, it clears these records.  So it
should be possible for us to prevent accept-process-output calls
issued by such runaway timers from waiting on the descriptors that are
already "taken": if, when we set the bits in the pselect mask, we find
that some of the descriptors are already watched by the same thread as
the current thread, we could exclude them from the pselect mask we are
setting up.  Wouldn't that be a better solution?  Because AFAIU, your
solution just avoids an infinite wait, but doesn't avoid losing the
process output, because it was read by the wrong Lisp code.  Right?

> But I also think that wait_reading_process_output violates its contract
> and is buggy in this regard as well, since it should properly function
> even if it calls out to filters or timers -- and it clearly does not and
> I would wager more hangs seen in the wild that weren't debugged, could
> be attributed to this very bug.

I think the basic contract that is violated here is that the output
from a subprocess is "stolen" by an unrelated call to
accept-process-output, and we should prevent that if we can.

> If there are any question marks left hanging over your head, please
> don't hesitate to ask and I will try my best to clear them up -- but it
> might end up being another longish mail, so be warned. ;)

Well, I'd like to eyeball the timer which commits this crime.

Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Matthias Dahl
Hello Eli...

On 26/10/17 18:23, Eli Zaretskii wrote:

> AFAIK, post-command-hooks cannot be run while we are in sit-for, but I
> guess this is not relevant to the rest of the description?

This probably comes from server.el (server-visit-files) because Magit
uses emacsclient for some of its magic.

I have attached a backtrace, taken during the hang. Unfortunately it is
from a optimized build (would have needed to recompile just now, and I
am a bit in a hurry) but it at least shows the callstack (more or less)
nicely.

> I understand that this timer calls accept-process-output with its
> argument nil, is that correct?  If so, isn't that a bug for a timer to
> do that?  Doing that runs the risk of eating up output from some
> subprocess for which the foreground Lisp program is waiting.

I haven't actually checked which timer it is, to be quite honest since I
didn't think of it as a bug at all.

Correct me if I am wrong, calling accept-process-output w/o arguments
is expected to be quite harmless and can be useful. If you specify a
specific process, you will most definitely wait at least as long as
it takes for that process to produce any output.

Nevertheless: If am not completely mistaken, there is no data lost at
all. It is read and passed to the filter function which was registered
by the interested party -- otherwise the default filter will simply
append it to the buffer it belongs to.

The only thing that is lost is that it was ever read at all and thus
an endless wait begins.

> So please point out the timer that does this, because I think that
> timer needs to be fixed.

If you still need that, I will do some digging and try to find it.

> We already record the file descriptors on which we wait for process
> output, see compute_non_keyboard_wait_mask.  Once
> wait_reading_process_output exits, it clears these records.  So it
> should be possible for us to prevent accept-process-output calls
> issued by such runaway timers from waiting on the descriptors that are
> already "taken": if, when we set the bits in the pselect mask, we find
> that some of the descriptors are already watched by the same thread as
> the current thread, we could exclude them from the pselect mask we are
> setting up.  Wouldn't that be a better solution?  Because AFAIU, your
> solution just avoids an infinite wait, but doesn't avoid losing the
> process output, because it was read by the wrong Lisp code.  Right?
Hm... at the moment I don't see where data is lost with my solution.
Maybe I am being totally blind and making a fool out of myself but I
honestly don't see it.

What you suggest could be dangerous as well, depending on how it is
implemented and the circumstances. What fds get excluded in recursive
calls? Only wait_proc ones? Or every one that is watched somewhere up
in the callstack? Depending on what we do, we could end up with an
almost empty list that doesn't get "ready" as easily as one would
have expected by a naked accept-process-output call... and it could thus
potentially stall as well... worst-case, I know.

Just thinking out loud here. I would really need to check this, those
are just my initial thoughts.

> Well, I'd like to eyeball the timer which commits this crime.

If you still do, let me know and I will try to track it down...

So long,
Matthias

--
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu

emacs-bt.txt (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Matthias Dahl
Hello Eli...

Just to make sure we are not waiting for each other and have dead
lock here. :-)

Have you had any chance to look into this again? Or is there anything
you are waiting for from my side? If you have just been busy, no problem
at all.

Have a nice weekend... and thanks again!
Matthias

--
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu


Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Eli Zaretskii
In reply to this post by Matthias Dahl
> Cc: [hidden email]
> From: Matthias Dahl <[hidden email]>
> Date: Thu, 26 Oct 2017 20:56:03 +0200
>
> > AFAIK, post-command-hooks cannot be run while we are in sit-for, but I
> > guess this is not relevant to the rest of the description?
>
> This probably comes from server.el (server-visit-files) because Magit
> uses emacsclient for some of its magic.

I see post-command-hook being run from server-visit-files, but I don't
see sit-for.  The sit_for (not sit-for) in your backtrace is called
from read_char, something that happens every time Emacs becomes idle.

> I have attached a backtrace, taken during the hang. Unfortunately it is
> from a optimized build (would have needed to recompile just now, and I
> am a bit in a hurry) but it at least shows the callstack (more or less)
> nicely.

It lacks the Lisp backtrace ("xbacktrace" can produce it), and the
fact that most arguments are either optimized out or complex data
types shown as "..." makes the backtrace much less useful.

> > I understand that this timer calls accept-process-output with its
> > argument nil, is that correct?  If so, isn't that a bug for a timer to
> > do that?  Doing that runs the risk of eating up output from some
> > subprocess for which the foreground Lisp program is waiting.
>
> I haven't actually checked which timer it is, to be quite honest since I
> didn't think of it as a bug at all.
>
> Correct me if I am wrong, calling accept-process-output w/o arguments
> is expected to be quite harmless and can be useful. If you specify a
> specific process, you will most definitely wait at least as long as
> it takes for that process to produce any output.
>
> Nevertheless: If am not completely mistaken, there is no data lost at
> all. It is read and passed to the filter function which was registered
> by the interested party -- otherwise the default filter will simply
> append it to the buffer it belongs to.
>
> The only thing that is lost is that it was ever read at all and thus
> an endless wait begins.

But if the wrong call to accept-process-output have read the process
output, it could have also processed it and delivered the results to
the wrong application, no?

> > So please point out the timer that does this, because I think that
> > timer needs to be fixed.
>
> If you still need that, I will do some digging and try to find it.

I think we need a thorough and detailed understanding of what's going
on in this case, before we can discuss the solutions, yes.  IME,
trying to fix a problem without a good understanding what it is that
we are fixing tends to produce partial solutions at best, and new bugs
at worst.

So please reproduce this in an unoptimized build, and please also show
the Lisp backtrace in this scenario.  Then let's take it from there.

> > We already record the file descriptors on which we wait for process
> > output, see compute_non_keyboard_wait_mask.  Once
> > wait_reading_process_output exits, it clears these records.  So it
> > should be possible for us to prevent accept-process-output calls
> > issued by such runaway timers from waiting on the descriptors that are
> > already "taken": if, when we set the bits in the pselect mask, we find
> > that some of the descriptors are already watched by the same thread as
> > the current thread, we could exclude them from the pselect mask we are
> > setting up.  Wouldn't that be a better solution?  Because AFAIU, your
> > solution just avoids an infinite wait, but doesn't avoid losing the
> > process output, because it was read by the wrong Lisp code.  Right?
>
> Hm... at the moment I don't see where data is lost with my solution.
> Maybe I am being totally blind and making a fool out of myself but I
> honestly don't see it.

Maybe there is no loss, but I'm not really sure your proposal solves
the root cause, and without detailed understanding of what's exactly
going on, we have no way of discussing this rationally.

> > Well, I'd like to eyeball the timer which commits this crime.
>
> If you still do, let me know and I will try to track it down...

I do.  I believe we must understand this situation very well before we
reason about its solution.

Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Matthias Dahl
Hello Eli...

On 28/10/17 11:28, Eli Zaretskii wrote:

> It lacks the Lisp backtrace ("xbacktrace" can produce it), and the
> fact that most arguments are either optimized out or complex data
> types shown as "..." makes the backtrace much less useful.

Sorry, like I said, I was in a hurry that was all I could provide at
that time. Attached you will find a full backtrace from a proper debug
build w/o optimization -- I usually only run those when really debugging
something, otherwise I keep debug symbols around for an optimized build,
that will at least give me a starting point and an idea when something
goes wrong.

> But if the wrong call to accept-process-output have read the process
> output, it could have also processed it and delivered the results to
> the wrong application, no?

Given that fact that a filter is registered for a given process, and
thus it is this filter that gets called whenever process output is ready
and needs to be processed, a timer or filter would have to replace that
filter with its own through set-process-filter for this to happen. And
that is something I would clearly consider a bug, since no filter or
timer should do something like that.

Naturally there is also the weird case when accept-process-output was
called from a hook by some package which expects that data and needs it
but doesn't consider that a timer could get called during that time and
the package itself has a timer setup that will also interact with that
very same process, trying to read data back from some interaction with
it. That will naturally fail as well... either way, with or without my
patch. And again, I would consider this a bug.

If a package interacts with its own processes, it should be careful and
consider that timers/filters get called while accept-process-output is
called and take that into account.

> I think we need a thorough and detailed understanding of what's going
> on in this case, before we can discuss the solutions, yes.  IME,
> trying to fix a problem without a good understanding what it is that
> we are fixing tends to produce partial solutions at best, and new bugs
> at worst.

I fully agree with your opinion, I just thought I already properly
explained in my lenghty mails what was going on behind the scenes and
how my (rather simple) solution fixes it.

> So please reproduce this in an unoptimized build, and please also show
> the Lisp backtrace in this scenario.  Then let's take it from there.

Like said earlier, you will find that attached. But keep in mind, that
will not be as helpful as you might hope since you will only see where
the hangs occurs (I already stated that) but not why. At that point, the
timers have already run.

> I do.  I believe we must understand this situation very well before we
> reason about its solution.

One example is semantic. Its idle timers use semantic-throw-on-input
which calls accept-process-output without any arguments. Just as an
example.

So long,
Matthias

--
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu


emacs-bt-full.txt (122K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Matthias Dahl
Hello Eli...

Very friendly ping. :-)

Are there any news on this issue or anything else I can help or provide
you with?

Thanks,
Matthias

--
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu


Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Eli Zaretskii
> Cc: [hidden email]
> From: Matthias Dahl <[hidden email]>
> Date: Fri, 3 Nov 2017 09:52:39 +0100
>
> Very friendly ping. :-)

Thanks, I didn't forget.  I'll get to that soon.

> Are there any news on this issue or anything else I can help or provide
> you with?

I didn't yet have time to read your last message thoroughly, sorry.

Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Eli Zaretskii
In reply to this post by Matthias Dahl
> Cc: [hidden email]
> From: Matthias Dahl <[hidden email]>
> Date: Mon, 30 Oct 2017 10:48:05 +0100
>
> > But if the wrong call to accept-process-output have read the process
> > output, it could have also processed it and delivered the results to
> > the wrong application, no?
>
> Given that fact that a filter is registered for a given process, and
> thus it is this filter that gets called whenever process output is ready
> and needs to be processed, a timer or filter would have to replace that
> filter with its own through set-process-filter for this to happen. And
> that is something I would clearly consider a bug, since no filter or
> timer should do something like that.
>
> Naturally there is also the weird case when accept-process-output was
> called from a hook by some package which expects that data and needs it
> but doesn't consider that a timer could get called during that time and
> the package itself has a timer setup that will also interact with that
> very same process, trying to read data back from some interaction with
> it. That will naturally fail as well... either way, with or without my
> patch. And again, I would consider this a bug.

OK.  I think I'm okay with your patches, but IMO they need a minor
improvement: instead of setting got_some_output to an arbitrary value
of 1, the code should set it to the increment of the bytes read from
the sub-process.  This is what we do when we actually read from the
process, and I think this scenario should behave the same.  WDYT?

With that change, it's okay to commit this to the master branch.

Thanks.

Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Matthias Dahl
Hello Eli...

On 04/11/17 13:11, Eli Zaretskii wrote:

> OK.  I think I'm okay with your patches, but IMO they need a minor
> improvement: instead of setting got_some_output to an arbitrary value
> of 1, the code should set it to the increment of the bytes read from
> the sub-process.  This is what we do when we actually read from the
> process, and I think this scenario should behave the same.  WDYT?

I know what you mean -- and even though this was bugging me as well when
I implemented it, it is actually on purpose.

First of all, I wanted to make as few assumptions as possible about how
processes are used and be on the conservative/defensive side with this
solution, to make sure not to introduce new bugs or corner-cases.

In this particular case, I wanted to minimize any integer overflow
problems. Granted, depending on how many bits unsigned long has, it is
very (!) unlikely to cause any issues, but otoh that is usually how
bugs get introduced in the first place. So if I had calculated the
difference to set got_some_output properly, I would have had to take
the integer overflow problematic into account.

The current solution has exactly one corner-case which is, imho,
extremely unlikely: You would have to run wait_reading_process_output
for a full wrap-around to exactly the same byte count that it was
started with in the first place. It is safe to assume that will
practically really never happen.

Also, no user of wait_reading_process_output uses the return value
in such a way at all -- which I would also consider a bug since it
is not what is officially documented for the return value in the
first place. It simply says positive, negative or zero... without
stating that it will actually return the bytes read.

So, imho, I would lean toward the current solution since it is as
simple as possible with as few corner-cases as possible. If you
would like to keep the status quo though and to always return how
many bytes were actually read, then I suggest a different approach
altogether. Instead of keeping track how many bytes were read for
the total lifetime of a process, we could track only how many bytes
were read for an in-flight wait. The would require a byte counter
and a counter for how many waits are currently in-flight, so the
last one (or first one) could zero the byte counter again. That
would make things quite a bit more complicated, imho... and without
having a proper gain to show for it.

> With that change, it's okay to commit this to the master branch.

May I suggest, even though it is rather late, to also consider this
for the emacs-26 branch? Since it is not a new feature but a bug fix
to a problem people are actually running into (see Magit), I think it
would be justified. Personally, I have run a patched emacs-26 all the
time here without any problems.

Thanks again for taking the time.

So long,
Matthias

--
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu


Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Eli Zaretskii
> Cc: [hidden email]
> From: Matthias Dahl <[hidden email]>
> Date: Mon, 6 Nov 2017 15:15:48 +0100
>
> In this particular case, I wanted to minimize any integer overflow
> problems. Granted, depending on how many bits unsigned long has, it is
> very (!) unlikely to cause any issues, but otoh that is usually how
> bugs get introduced in the first place. So if I had calculated the
> difference to set got_some_output properly, I would have had to take
> the integer overflow problematic into account.

Sorry, I don't understand.  When we call emacs_read in other cases in
that function, the result is an ssize_t, a signed type of the same
width as size_t.  So why would you need to handle overflow, when
reading via emacs_read doesn't?

And even if the result could overflow, we have the likes of
INT_ADD_WRAPV to make handling of that easier and safer.

So I don't see why we should give up in this case.

> Also, no user of wait_reading_process_output uses the return value
> in such a way at all -- which I would also consider a bug since it
> is not what is officially documented for the return value in the
> first place. It simply says positive, negative or zero... without
> stating that it will actually return the bytes read.

I don't agree with this line of reasoning.  We are not developing a
library whose users will be familiar only with the formal interface
definition and nothing else.  We are talking about code that is being
read, studied, and used by Emacs hack^H^H^H^Hdevelopers all the time.
Having a non-trivial function whose behavior is hard to describe and
remember accurately makes using it error-prone, especially if the
deviant behavior happens only in some corner use case that is hard to
reproduce.  In many cases, the information about these corner cases is
lost soon after the code is introduced, and since we have no formal
requirements for what functions like this one should do, and no good
coverage by the test suite, future changes are impossible to test
reliably, in order to make sure they don't break in those corner
cases.

So the situation where "the function FOO returns the number of bytes
read from the subprocess, except when this and that happens, in which
case it returns just 1" -- such situation should be avoided at all
costs, IME.  And in this case, the cost is actually quite low, unless
I'm missing something.

> If you would like to keep the status quo though and to always return
> how many bytes were actually read, then I suggest a different
> approach altogether. Instead of keeping track how many bytes were
> read for the total lifetime of a process, we could track only how
> many bytes were read for an in-flight wait. The would require a byte
> counter and a counter for how many waits are currently in-flight, so
> the last one (or first one) could zero the byte counter again.

Sorry, you lost me half-way through this description.  I actually
meant to return only the increment of how many bytes were read since
the last call, but I don't see why you'd need more than one counter
that will be reset to zero once its value is consumed.

> > With that change, it's okay to commit this to the master branch.
>
> May I suggest, even though it is rather late, to also consider this
> for the emacs-26 branch?

It's a bug that has been there "forever", and the fix is too risky to
have it in the middle of a pretest.  I don't think we understand well
enough all of its implications, and reading from subprocesses is a
very central and very delicate part of Emacs.  Sorry.

Thanks for working on this.

Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Paul Eggert
On 11/06/2017 08:34 AM, Eli Zaretskii wrote:
> When we call emacs_read in other cases in
> that function, the result is an ssize_t, a signed type of the same
> width as size_t.

As a minor point, POSIX does not require that ssize_t be the same width
as size_t, or that ptrdiff_t be the same width as size_t. Emacs has run
(and as far as I know, still would run) on unusual platforms where
ssize_t is narrower than size_t.


Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Eli Zaretskii
> Cc: [hidden email]
> From: Paul Eggert <[hidden email]>
> Date: Mon, 6 Nov 2017 10:24:42 -0800
>
> On 11/06/2017 08:34 AM, Eli Zaretskii wrote:
> > When we call emacs_read in other cases in
> > that function, the result is an ssize_t, a signed type of the same
> > width as size_t.
>
> As a minor point, POSIX does not require that ssize_t be the same width
> as size_t, or that ptrdiff_t be the same width as size_t. Emacs has run
> (and as far as I know, still would run) on unusual platforms where
> ssize_t is narrower than size_t.

Yes, I know.  We could replace ssize_t with ptrdiff_t if needed.

Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Matthias Dahl
In reply to this post by Eli Zaretskii
Hello Eli...

On 06/11/17 17:34, Eli Zaretskii wrote:

> Sorry, I don't understand.  When we call emacs_read in other cases in
> that function, the result is an ssize_t, a signed type of the same
> width as size_t.  So why would you need to handle overflow, when
> reading via emacs_read doesn't?
>
> And even if the result could overflow, we have the likes of
> INT_ADD_WRAPV to make handling of that easier and safer.
>
> So I don't see why we should give up in this case.

The amount of output read (in total) from a process is tracked via the
following member variable of Lisp_Process:

  /* Byte-count for process output read from `infd'.  */
  unsigned long infd_num_bytes_read;

That gives us at least 4 GiB worth of data that can be read back and
tracked before it wraps around to 0.

When wait_reading_process_output starts, the following happens:

  unsigned long initial_wait_proc_num_bytes_read = (wait_proc) ?
                              wait_proc->infd_num_bytes_read : 0;

We save the current counter and compare it later on to see if data has
been read from that process without us noticing it, e.g.:

 if (wait_proc
     && wait_proc->infd_num_bytes_read !=
        initial_wait_proc_num_bytes_read)
 {
   got_some_output = 1;
   break;
 }

What might have happened now is that wait_proc->infd_num_bytes_read was
already at the edge of the maximum representable value range and wrapped
around during some recursive calls to wait_reading_process_output while
we still have a value in initial_wait_proc_num_bytes_read that has not
wrapped around. E.g.:

  wait_proc->infd_num_bytes_read = (2^32)-10
  initial_wait_proc_num_bytes_read = wait_proc->infd_num_bytes_read

  ... recursive calls happen, output is read ...

  wait_proc->infd_num_bytes_read = 256
  initial_wait_proc_num_bytes_read = (2^32)-10

  // this will be wrong
  bytes_read = wait_proc->infd_num_bytes_read -
               initial_wait_proc_num_bytes_read

That causes a problem -- even though this case is unlikely, it is still
possible and I don't think it should be dismissed.

To get how many bytes have been read for this current call, we have to
calculate the difference of initial_wait_proc_num_bytes_read and
wait_proc->infd_num_bytes_read. That naturally fails if there has been
a wrap around in wait_proc->infd_num_bytes_read.

gnulib's integer overflow helpers won't help in this case, at all.

That is the whole reason why I did not set got_some_output to the proper
byte count but simply to 1, to avoid this integer overflow situation
which, if dealt with properly, would have made things a bit more
complex and difficult.

Naturally you can still do a greater than comparison and choose the
operands (more or less) properly that way. Since that calculation does
happen at two different spots in the code and I generally perceived the
bytes_read return value as unnecessary as well as the fact that I did
not really like that whole idea, I forgo doing this and went with the
simple return value of 1.

But maybe this is the simplest of all solutions, keeping KISS in mind.

> I don't agree with this line of reasoning.  We are not developing a
> library whose users will be familiar only with the formal interface
> definition and nothing else.  We are talking about code that is being
> read, studied, and used by Emacs hack^H^H^H^Hdevelopers all the time.

Even the C interface? It is my impression that the C parts of Emacs are
usually what is being avoided at all costs. Besides that, I treated the
C parts like implementation details and kept strictly to the documented
interfaces, expecting no user to ever touch this.

Nevertheless, I agree, that consistency is important and I should have
made that more of a priority in this case, given what you said.

> Having a non-trivial function whose behavior is hard to describe and
> remember accurately makes using it error-prone, especially if the
> deviant behavior happens only in some corner use case that is hard to
> reproduce.

That is why it is important that the documentation and implementation of
a function actually align. In this case, no user should actually expect
the function to return the number of bytes read, just a positive or
negative number or zero... just like it is stated.

But, again, that might be a matter of perspective. For me, what is
documented is like a contract.

> Sorry, you lost me half-way through this description.  I actually
> meant to return only the increment of how many bytes were read since
> the last call, but I don't see why you'd need more than one counter
> that will be reset to zero once its value is consumed.

My suggestion was to have an in-flight bytes read counter, that will
only track the number of bytes read while /any/ wait for that specific
process is active.

To be able to properly convey that is through another variable that
will be non-zero if we have anything waiting for that process's output.

Eventually that in-flight bytes read counter needs to be reset to zero
again. This can only happen, if we have no other parties waiting for
this process's output higher up in the call chain as otherwise we would
loose information and make it difficult/impossible for them to detect
that something was actually read.

So, that "other variable" that is non-zero if we have anything waiting
for that process's output needs to actually track the number of parties
that are waiting, so we can actually know when we are the last one and
can safely reset the in-flight bytes read counter.

But like I already stated earlier, maybe simply checking if an overflow
happened through a greater than comparison is the simplest approach,
since this solution introduces even more complexity and corner-cases,
imho.

> It's a bug that has been there "forever", and the fix is too risky to
> have it in the middle of a pretest.  I don't think we understand well
> enough all of its implications, and reading from subprocesses is a
> very central and very delicate part of Emacs.  Sorry.

Ack. Even though this means that users running into this right now will
have to wait quite a while for a fix in a stable release, which is a
pity imho.

Thanks for all the feedback,
Matthias

PS. Sorry for the huge wall of text. :-( I just noticed how big this
    mail has gotten... yikes.

--
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu

Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Eli Zaretskii
> From: Matthias Dahl <[hidden email]>
> Cc: [hidden email]
> Date: Tue, 7 Nov 2017 15:18:02 +0100
>
> The amount of output read (in total) from a process is tracked via the
> following member variable of Lisp_Process:
>
>   /* Byte-count for process output read from `infd'.  */
>   unsigned long infd_num_bytes_read;
>
> That gives us at least 4 GiB worth of data that can be read back and
> tracked before it wraps around to 0.

If we want, we can make it EMACS_INT, which will give us as much as
Emacs can gobble.

> What might have happened now is that wait_proc->infd_num_bytes_read was
> already at the edge of the maximum representable value range and wrapped
> around during some recursive calls to wait_reading_process_output while
> we still have a value in initial_wait_proc_num_bytes_read that has not
> wrapped around. E.g.:
>
>   wait_proc->infd_num_bytes_read = (2^32)-10
>   initial_wait_proc_num_bytes_read = wait_proc->infd_num_bytes_read
>
>   ... recursive calls happen, output is read ...
>
>   wait_proc->infd_num_bytes_read = 256
>   initial_wait_proc_num_bytes_read = (2^32)-10
>
>   // this will be wrong
>   bytes_read = wait_proc->infd_num_bytes_read -
>                initial_wait_proc_num_bytes_read
>
> That causes a problem -- even though this case is unlikely, it is still
> possible and I don't think it should be dismissed.

That should be very rare (barring application-level bugs), and in any
case it is no worse than just returning 1.  So I don't think we should
give up better support of 90% or even 99% of use cases due to the
other 10% or 1%.

> gnulib's integer overflow helpers won't help in this case, at all.

Those helpers are meant to avoid the overflow itself, thus preventing
"undefined behavior" when a signed integer overflows.  They cannot,
and shouldn't be expected to, fix the reason for the overflow -- that
cannot be helped when it happens.

> > I don't agree with this line of reasoning.  We are not developing a
> > library whose users will be familiar only with the formal interface
> > definition and nothing else.  We are talking about code that is being
> > read, studied, and used by Emacs hack^H^H^H^Hdevelopers all the time.
>
> Even the C interface?

Especially in the C interface.

> It is my impression that the C parts of Emacs are usually what is
> being avoided at all costs.

It depends on who we are talking about.  Some of us (yours truly
included) work on the C level quite frequently.  Any relatively
radical new feature needs at least some C infrastructure.  I had my
share of writing code based on what I read in the function commentary
and some general common sense and familiarity with the internals, only
to find out later that they were incomplete or even prone to wrong
interpretation.  I'd like to minimize such occurrences as much as I
can.

> Besides that, I treated the C parts like implementation details and
> kept strictly to the documented interfaces, expecting no user to
> ever touch this.

But that's exactly the problem: we _don't_have_ documented interfaces
on the C level, at least not of the quality we would like to have.  Do
you really think that the commentary at the beginning of
wait_reading_process_output describes what the function does anywhere
close to completeness?  Far from it.

> > Having a non-trivial function whose behavior is hard to describe and
> > remember accurately makes using it error-prone, especially if the
> > deviant behavior happens only in some corner use case that is hard to
> > reproduce.
>
> That is why it is important that the documentation and implementation of
> a function actually align.

Of course.  But it's hard to keep them in sync, partly because it's
incomplete to begin with, and that makes it not easy to recognize when
the commentary needs to change due to some code changes.  And the
result is that I quite frequently find comments that are blatantly
incorrect: mention parameters no longer there or fail to mention new
ones that were added, for example.

> In this case, no user should actually expect the function to return
> the number of bytes read, just a positive or negative number or
> zero... just like it is stated.

I'd actually say that the commentary is incomplete because it doesn't
say what that "positive" value means.  Returning an opaque value from
a function is not useful, unless you can pass it later to the same
function to get it to do something special.

> But, again, that might be a matter of perspective. For me, what is
> documented is like a contract.

IME with Emacs code, we are a long way from interfaces that are
documented on a level that could be considered a contract.  For
starters, we don't even say clearly enough which functions can throw
to top level, without which a contract is not really a contract, don't
you agree?

> > Sorry, you lost me half-way through this description.  I actually
> > meant to return only the increment of how many bytes were read since
> > the last call, but I don't see why you'd need more than one counter
> > that will be reset to zero once its value is consumed.
>
> My suggestion was to have an in-flight bytes read counter, that will
> only track the number of bytes read while /any/ wait for that specific
> process is active.

Ah, I see.  I think this is an unnecessary complication.  The risk of
overflow, while it isn't zero, is too low to justify such measures and
the resulting complexity, IMO.

> > It's a bug that has been there "forever", and the fix is too risky to
> > have it in the middle of a pretest.  I don't think we understand well
> > enough all of its implications, and reading from subprocesses is a
> > very central and very delicate part of Emacs.  Sorry.
>
> Ack. Even though this means that users running into this right now will
> have to wait quite a while for a fix in a stable release, which is a
> pity imho.

We can publish the patch here, and then people who really want this
fixed ASAP can simply apply the patch.

Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Stefan Monnier
In reply to this post by Matthias Dahl
>   /* Byte-count for process output read from `infd'.  */
>   unsigned long infd_num_bytes_read;

BTW, we could count the number of (non-empty) "chunks" rather than the
number of bytes.

>  {
>    got_some_output = 1;
>    break;
>  }

Please try to use `true' and `false' for boolean values (there's still
a lot of code in src/*.c which uses 0 and 1, admittedly, but this should
slowly disappear over time).


        Stefan


Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Matthias Dahl
In reply to this post by Eli Zaretskii
Hello Eli...

Attached you will find the revised patches which now properly
set the number of bytes read.

Sorry for the delay, but real life intervened.

Turns out, things were _a lot_ easier than I thought since the
C standard actually does have well-defined rules for computation
on unsigned types, so that no overflow happens. And since it is
modular arithmetic, it perfectly fits our use-case. :-) Sometimes
the simplest solution is right in front of your eyes and you are
thinking way too complicated.

So sorry for the trouble. Had I known this earlier, I would not
have skipped that part -- even though I had the best intentions,
obviously.

On with the rest of your mail...

On 07/11/17 17:40, Eli Zaretskii wrote:

> If we want, we can make it EMACS_INT, which will give us as much as
> Emacs can gobble.

Thanks for the suggestion, I changed it to EMACS_UINT -- we need an
unsigned type.

> I had my
> share of writing code based on what I read in the function commentary
> and some general common sense and familiarity with the internals, only
> to find out later that they were incomplete or even prone to wrong
> interpretation.  I'd like to minimize such occurrences as much as I
> can.

I guess I was overly idealistic. Sorry for that.

Nevertheless, it would be nice to improve the situation and make sure
that future changes to the codebase also take the commentaries into
account... and put those themselves to the same high standard as the
implementation itself. Just my two cents...

> But that's exactly the problem: we _don't_have_ documented interfaces
> on the C level, at least not of the quality we would like to have.  Do
> you really think that the commentary at the beginning of
> wait_reading_process_output describes what the function does anywhere
> close to completeness?  Far from it.

I know. OTOH, wait_reading_process_output might be a bad example as it
really could benefit greatly from a refactor since it is way too big,
complex and unclear. And describing its function properly and concisely
could prove quite difficult.

But generally, I pretty much agree on all that you said in the rest of
your mail.

> I'd actually say that the commentary is incomplete because it doesn't
> say what that "positive" value means.  Returning an opaque value from
> a function is not useful, unless you can pass it later to the same
> function to get it to do something special.

I agree. It also leads the user to make assumptions about the value
that might or might not turn out true for all cases... without properly
reading the code.

Thanks again for all your great feedback and review. I hope we now have
something that can be applied to the master branch. If anything comes up
in terms of bugs or problems that might be related, please poke me if I
miss it on the list and I will (try to) fix it.

Have a nice weekend,
Matthias

--
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu

0001-Add-process-output-read-accounting.patch (1K) Download Attachment
0002-src-process.c-wait_reading_process_output-Fix-wait_p.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: wait_reading_process_ouput hangs in certain cases (w/ patches)

Matthias Dahl
In reply to this post by Stefan Monnier
Hello Stefan...

On 07/11/17 18:23, Stefan Monnier wrote:

> BTW, we could count the number of (non-empty) "chunks" rather than the
> number of bytes.

Thanks for the suggestion. In this case, I think just counting the bytes
keeps things simple and makes this value easier to understand when doing
a debugging session that benefits from this information.

Also, counting the chunks wouldn't really improve the situation much, if
I am not mistaken? It would just take a bit longer for a wrap-around to
happen.

>
>>  {
>>    got_some_output = 1;
>>    break;
>>  }
>
> Please try to use `true' and `false' for boolean values (there's still
> a lot of code in src/*.c which uses 0 and 1, admittedly, but this should
> slowly disappear over time).

That is actually legacy code right there. I did not introduce that
variable nor is it a boolean. Actually it is a integer in disguise that
stores how many bytes have been read. ;-) It is very unfortunate naming,
I agree.

The revised patches don't set it to 1 but do calculate the proper value,
so this is a non-issue now.

Thanks for taking the time and have a nice weekend,
Matthias

--
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu


123