Hello all,
recursively calling accept-process-output with the first call waiting for the output of a specific process, can hang Emacs (as-in: it is waiting forever but can be canceled of course) even though it should not be the case since data was read. This is actually not all that uncommon. One example of this is a hang seen with Magit opening its COMMIT_MSG buffer, reported here [1]. I've myself run into this problem continuously which is why I started to debug it in the first place. The hang with Magit happens in flyspell.el which waits for output from its spellchecker process through accept-process-output and specifies that specific process as wait_proc. Now depending on timing (race), wait_reading_process_output can call the pending timers... which in turn can call accept-process-output again. This almost always leads to the spellchecker output being read back in full, so there is no more data left to be read. Thus the original accept-process-output, which called wait_reading_process_output, will wait for the data to become available forever since it has no way to know that those have already been read. Naturally one could argue that a timeout should be specified when calling accept-process-output. But nevertheless I still think this is a breach of contract. The caller expects accept-process-output to return as promised when data has been read from the specified process and it clearly isn't always the case and can hang forever, depending on timing and the specifics of the data being read. The attached patches fix this by introducing process output read accounting -- simply counting the bytes read per process. And using that data to strategically check in wait_reading_process_output for data being read while we handed over control to timers and/or filters. I haven't seen any ill side-effects in my tests and it clearly fixes the problem seen in [1] as well as I would wager quite a few others that were probably seen by user's of all kinds of setups that seemed unpredictable and mysterious and never debugged. As a side-note: I decided against an artificial metric and went with simply counting the bytes read, since this does come in handy when doing debugging and being able to see how much data was read from a process during specific time intervals. Also, this still leaves the possibility that wait_reading_process_output could wait forever while being called without wait_proc and a timeout set. This could be mitigated as well by some sort of a tick counter that only increases when no wait_proc was specified and data from processes were processed. I decided against implementing that for now since imho the chances of that happening are marginal, if at all present. OTOH, the semantics in that case are not all that clear and would further add complexity to an already rather unhealthy function. I am naturally open to other opinions and implementing this as well if requested. :-) Any suggestions and/or comments are very welcome, as always. Thanks for the patience to read this longish post. :-) So long, Matthias [1] https://github.com/magit/magit/issues/2915 -- Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu ![]() ![]() |
> From: Matthias Dahl <[hidden email]>
> Date: Tue, 24 Oct 2017 20:52:20 +0200 > > recursively calling accept-process-output with the first call waiting > for the output of a specific process, can hang Emacs (as-in: it is > waiting forever but can be canceled of course) even though it should > not be the case since data was read. > > This is actually not all that uncommon. One example of this is a hang > seen with Magit opening its COMMIT_MSG buffer, reported here [1]. I've > myself run into this problem continuously which is why I started to > debug it in the first place. > > The hang with Magit happens in flyspell.el which waits for output from > its spellchecker process through accept-process-output and specifies > that specific process as wait_proc. Now depending on timing (race), > wait_reading_process_output can call the pending timers... which in > turn can call accept-process-output again. This almost always leads > to the spellchecker output being read back in full, so there is no > more data left to be read. Thus the original accept-process-output, > which called wait_reading_process_output, will wait for the data to > become available forever since it has no way to know that those have > already been read. I'm not sure I understand the situation where this happens; can you elaborate? Are you saying that a Lisp program calls accept-process-output and waits for a specific process to produce some output, and meanwhile some timer runs a function that calls accept-process-output with its 1st argument nil? What timers do that? I think this would be a bug in such timers. Or are you saying that a Lisp program calls accept-process-output with its 1st arg nil, and while it waits, a timer calls accept-process-output for a specific process? Or do you mean something else? Thanks. |
Hello Eli,
Thanks for taking the time to review the issue and patches. :-) On 25/10/17 16:53, Eli Zaretskii wrote: > I'm not sure I understand the situation where this happens; can you > elaborate? Sure. Let's take the Magit issue [1] as an example: When committing, Magit prepares a COMMIT_MSG buffer and does some process magic of its own which is pretty much irrelevant for this. At some point during that, while we are already in an instance of wait_reading_process_output (due to sit_for), the post-command-hooks are run. And here things get interesting. Eventually flyspell-post-command-hook is run which executes flyspell-word synchronously. That basically does write out a word that needs to be checked to the spellchecker process, waits for the results from stdin via accept-process-output and goes on. Of special note here is that it a) specifies a wait_proc (spellchecker process) and no timeout or whatsoever. The output from the spellchecker is usually there instantaneously, so that is actually unnoticeable, unless wait_reading_process_output, that was invoked through that specific accept-process-output, decides to run the timers. And here comes the catch: At this point, usually the spellchecker output is already available but not yet read. When the timers run, one of them calls accept-process-output again which will read the entire available output of the spellchecker process. Since there will be no more data on that fd unless some interaction happens with the process, our original call to accept-process-output/wait_reading_process_output will wait endlessly for the data to become available (due to wait_proc being set without a timeout). Thus, it appears that Magit hangs while in truth, flyspell hangs waiting for the spellchecker results to return that have already been read back. The gist of it is: If we have an active wait_reading_process_output call with a wait_proc set but no timeout that calls out to either timers or filters, it is entirely possible that those directly or indirectly call us again recursively, thus reading the output we are waiting for without us ever noticing it, if no further output becomes available in addition to what was read unnoticed... like it happens with flyspell. That is what my patches fix: They simply add a bytes read metric to each process structure that we can check for change at strategically relevant points and decide if we got some data back that went unnoticed and break out from wait_reading_process_output. I know, flyspell should do its business asynchronously and also specify a timeout since it is being run through hooks. Those are bugs by itself. But I also think that wait_reading_process_output violates its contract and is buggy in this regard as well, since it should properly function even if it calls out to filters or timers -- and it clearly does not and I would wager more hangs seen in the wild that weren't debugged, could be attributed to this very bug. I hope my rambling speech was somewhat helpful and clear and I could get the problem described sufficiently without being too confusing. :-) If there are any question marks left hanging over your head, please don't hesitate to ask and I will try my best to clear them up -- but it might end up being another longish mail, so be warned. ;) Thanks, Matthias [1] https://github.com/magit/magit/issues/2915 -- Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu |
> Cc: [hidden email]
> From: Matthias Dahl <[hidden email]> > Date: Thu, 26 Oct 2017 16:07:31 +0200 > > When committing, Magit prepares a COMMIT_MSG buffer and does some > process magic of its own which is pretty much irrelevant for this. > > At some point during that, while we are already in an instance of > wait_reading_process_output (due to sit_for), the post-command-hooks are > run. AFAIK, post-command-hooks cannot be run while we are in sit-for, but I guess this is not relevant to the rest of the description? > And here things get interesting. Eventually flyspell-post-command-hook > is run which executes flyspell-word synchronously. That basically does > write out a word that needs to be checked to the spellchecker process, > waits for the results from stdin via accept-process-output and goes on. > Of special note here is that it a) specifies a wait_proc (spellchecker > process) and no timeout or whatsoever. > > The output from the spellchecker is usually there instantaneously, so > that is actually unnoticeable, unless wait_reading_process_output, that > was invoked through that specific accept-process-output, decides to run > the timers. > > And here comes the catch: At this point, usually the spellchecker output > is already available but not yet read. When the timers run, one of them > calls accept-process-output again which will read the entire available > output of the spellchecker process. I understand that this timer calls accept-process-output with its argument nil, is that correct? If so, isn't that a bug for a timer to do that? Doing that runs the risk of eating up output from some subprocess for which the foreground Lisp program is waiting. So please point out the timer that does this, because I think that timer needs to be fixed. > The gist of it is: If we have an active wait_reading_process_output call > with a wait_proc set but no timeout that calls out to either timers or > filters, it is entirely possible that those directly or indirectly call > us again recursively, thus reading the output we are waiting for without > us ever noticing it, if no further output becomes available in addition > to what was read unnoticed... like it happens with flyspell. > > That is what my patches fix: They simply add a bytes read metric to each > process structure that we can check for change at strategically relevant > points and decide if we got some data back that went unnoticed and break > out from wait_reading_process_output. We already record the file descriptors on which we wait for process output, see compute_non_keyboard_wait_mask. Once wait_reading_process_output exits, it clears these records. So it should be possible for us to prevent accept-process-output calls issued by such runaway timers from waiting on the descriptors that are already "taken": if, when we set the bits in the pselect mask, we find that some of the descriptors are already watched by the same thread as the current thread, we could exclude them from the pselect mask we are setting up. Wouldn't that be a better solution? Because AFAIU, your solution just avoids an infinite wait, but doesn't avoid losing the process output, because it was read by the wrong Lisp code. Right? > But I also think that wait_reading_process_output violates its contract > and is buggy in this regard as well, since it should properly function > even if it calls out to filters or timers -- and it clearly does not and > I would wager more hangs seen in the wild that weren't debugged, could > be attributed to this very bug. I think the basic contract that is violated here is that the output from a subprocess is "stolen" by an unrelated call to accept-process-output, and we should prevent that if we can. > If there are any question marks left hanging over your head, please > don't hesitate to ask and I will try my best to clear them up -- but it > might end up being another longish mail, so be warned. ;) Well, I'd like to eyeball the timer which commits this crime. |
Hello Eli...
On 26/10/17 18:23, Eli Zaretskii wrote: > AFAIK, post-command-hooks cannot be run while we are in sit-for, but I > guess this is not relevant to the rest of the description? This probably comes from server.el (server-visit-files) because Magit uses emacsclient for some of its magic. I have attached a backtrace, taken during the hang. Unfortunately it is from a optimized build (would have needed to recompile just now, and I am a bit in a hurry) but it at least shows the callstack (more or less) nicely. > I understand that this timer calls accept-process-output with its > argument nil, is that correct? If so, isn't that a bug for a timer to > do that? Doing that runs the risk of eating up output from some > subprocess for which the foreground Lisp program is waiting. I haven't actually checked which timer it is, to be quite honest since I didn't think of it as a bug at all. Correct me if I am wrong, calling accept-process-output w/o arguments is expected to be quite harmless and can be useful. If you specify a specific process, you will most definitely wait at least as long as it takes for that process to produce any output. Nevertheless: If am not completely mistaken, there is no data lost at all. It is read and passed to the filter function which was registered by the interested party -- otherwise the default filter will simply append it to the buffer it belongs to. The only thing that is lost is that it was ever read at all and thus an endless wait begins. > So please point out the timer that does this, because I think that > timer needs to be fixed. If you still need that, I will do some digging and try to find it. > We already record the file descriptors on which we wait for process > output, see compute_non_keyboard_wait_mask. Once > wait_reading_process_output exits, it clears these records. So it > should be possible for us to prevent accept-process-output calls > issued by such runaway timers from waiting on the descriptors that are > already "taken": if, when we set the bits in the pselect mask, we find > that some of the descriptors are already watched by the same thread as > the current thread, we could exclude them from the pselect mask we are > setting up. Wouldn't that be a better solution? Because AFAIU, your > solution just avoids an infinite wait, but doesn't avoid losing the > process output, because it was read by the wrong Lisp code. Right? Maybe I am being totally blind and making a fool out of myself but I honestly don't see it. What you suggest could be dangerous as well, depending on how it is implemented and the circumstances. What fds get excluded in recursive calls? Only wait_proc ones? Or every one that is watched somewhere up in the callstack? Depending on what we do, we could end up with an almost empty list that doesn't get "ready" as easily as one would have expected by a naked accept-process-output call... and it could thus potentially stall as well... worst-case, I know. Just thinking out loud here. I would really need to check this, those are just my initial thoughts. > Well, I'd like to eyeball the timer which commits this crime. If you still do, let me know and I will try to track it down... So long, Matthias -- Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu |
Hello Eli...
Just to make sure we are not waiting for each other and have dead lock here. :-) Have you had any chance to look into this again? Or is there anything you are waiting for from my side? If you have just been busy, no problem at all. Have a nice weekend... and thanks again! Matthias -- Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu |
In reply to this post by Matthias Dahl
> Cc: [hidden email]
> From: Matthias Dahl <[hidden email]> > Date: Thu, 26 Oct 2017 20:56:03 +0200 > > > AFAIK, post-command-hooks cannot be run while we are in sit-for, but I > > guess this is not relevant to the rest of the description? > > This probably comes from server.el (server-visit-files) because Magit > uses emacsclient for some of its magic. I see post-command-hook being run from server-visit-files, but I don't see sit-for. The sit_for (not sit-for) in your backtrace is called from read_char, something that happens every time Emacs becomes idle. > I have attached a backtrace, taken during the hang. Unfortunately it is > from a optimized build (would have needed to recompile just now, and I > am a bit in a hurry) but it at least shows the callstack (more or less) > nicely. It lacks the Lisp backtrace ("xbacktrace" can produce it), and the fact that most arguments are either optimized out or complex data types shown as "..." makes the backtrace much less useful. > > I understand that this timer calls accept-process-output with its > > argument nil, is that correct? If so, isn't that a bug for a timer to > > do that? Doing that runs the risk of eating up output from some > > subprocess for which the foreground Lisp program is waiting. > > I haven't actually checked which timer it is, to be quite honest since I > didn't think of it as a bug at all. > > Correct me if I am wrong, calling accept-process-output w/o arguments > is expected to be quite harmless and can be useful. If you specify a > specific process, you will most definitely wait at least as long as > it takes for that process to produce any output. > > Nevertheless: If am not completely mistaken, there is no data lost at > all. It is read and passed to the filter function which was registered > by the interested party -- otherwise the default filter will simply > append it to the buffer it belongs to. > > The only thing that is lost is that it was ever read at all and thus > an endless wait begins. But if the wrong call to accept-process-output have read the process output, it could have also processed it and delivered the results to the wrong application, no? > > So please point out the timer that does this, because I think that > > timer needs to be fixed. > > If you still need that, I will do some digging and try to find it. I think we need a thorough and detailed understanding of what's going on in this case, before we can discuss the solutions, yes. IME, trying to fix a problem without a good understanding what it is that we are fixing tends to produce partial solutions at best, and new bugs at worst. So please reproduce this in an unoptimized build, and please also show the Lisp backtrace in this scenario. Then let's take it from there. > > We already record the file descriptors on which we wait for process > > output, see compute_non_keyboard_wait_mask. Once > > wait_reading_process_output exits, it clears these records. So it > > should be possible for us to prevent accept-process-output calls > > issued by such runaway timers from waiting on the descriptors that are > > already "taken": if, when we set the bits in the pselect mask, we find > > that some of the descriptors are already watched by the same thread as > > the current thread, we could exclude them from the pselect mask we are > > setting up. Wouldn't that be a better solution? Because AFAIU, your > > solution just avoids an infinite wait, but doesn't avoid losing the > > process output, because it was read by the wrong Lisp code. Right? > > Hm... at the moment I don't see where data is lost with my solution. > Maybe I am being totally blind and making a fool out of myself but I > honestly don't see it. Maybe there is no loss, but I'm not really sure your proposal solves the root cause, and without detailed understanding of what's exactly going on, we have no way of discussing this rationally. > > Well, I'd like to eyeball the timer which commits this crime. > > If you still do, let me know and I will try to track it down... I do. I believe we must understand this situation very well before we reason about its solution. |
Hello Eli...
On 28/10/17 11:28, Eli Zaretskii wrote: > It lacks the Lisp backtrace ("xbacktrace" can produce it), and the > fact that most arguments are either optimized out or complex data > types shown as "..." makes the backtrace much less useful. Sorry, like I said, I was in a hurry that was all I could provide at that time. Attached you will find a full backtrace from a proper debug build w/o optimization -- I usually only run those when really debugging something, otherwise I keep debug symbols around for an optimized build, that will at least give me a starting point and an idea when something goes wrong. > But if the wrong call to accept-process-output have read the process > output, it could have also processed it and delivered the results to > the wrong application, no? Given that fact that a filter is registered for a given process, and thus it is this filter that gets called whenever process output is ready and needs to be processed, a timer or filter would have to replace that filter with its own through set-process-filter for this to happen. And that is something I would clearly consider a bug, since no filter or timer should do something like that. Naturally there is also the weird case when accept-process-output was called from a hook by some package which expects that data and needs it but doesn't consider that a timer could get called during that time and the package itself has a timer setup that will also interact with that very same process, trying to read data back from some interaction with it. That will naturally fail as well... either way, with or without my patch. And again, I would consider this a bug. If a package interacts with its own processes, it should be careful and consider that timers/filters get called while accept-process-output is called and take that into account. > I think we need a thorough and detailed understanding of what's going > on in this case, before we can discuss the solutions, yes. IME, > trying to fix a problem without a good understanding what it is that > we are fixing tends to produce partial solutions at best, and new bugs > at worst. I fully agree with your opinion, I just thought I already properly explained in my lenghty mails what was going on behind the scenes and how my (rather simple) solution fixes it. > So please reproduce this in an unoptimized build, and please also show > the Lisp backtrace in this scenario. Then let's take it from there. Like said earlier, you will find that attached. But keep in mind, that will not be as helpful as you might hope since you will only see where the hangs occurs (I already stated that) but not why. At that point, the timers have already run. > I do. I believe we must understand this situation very well before we > reason about its solution. One example is semantic. Its idle timers use semantic-throw-on-input which calls accept-process-output without any arguments. Just as an example. So long, Matthias -- Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu |
Hello Eli...
Very friendly ping. :-) Are there any news on this issue or anything else I can help or provide you with? Thanks, Matthias -- Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu |
> Cc: [hidden email]
> From: Matthias Dahl <[hidden email]> > Date: Fri, 3 Nov 2017 09:52:39 +0100 > > Very friendly ping. :-) Thanks, I didn't forget. I'll get to that soon. > Are there any news on this issue or anything else I can help or provide > you with? I didn't yet have time to read your last message thoroughly, sorry. |
In reply to this post by Matthias Dahl
> Cc: [hidden email]
> From: Matthias Dahl <[hidden email]> > Date: Mon, 30 Oct 2017 10:48:05 +0100 > > > But if the wrong call to accept-process-output have read the process > > output, it could have also processed it and delivered the results to > > the wrong application, no? > > Given that fact that a filter is registered for a given process, and > thus it is this filter that gets called whenever process output is ready > and needs to be processed, a timer or filter would have to replace that > filter with its own through set-process-filter for this to happen. And > that is something I would clearly consider a bug, since no filter or > timer should do something like that. > > Naturally there is also the weird case when accept-process-output was > called from a hook by some package which expects that data and needs it > but doesn't consider that a timer could get called during that time and > the package itself has a timer setup that will also interact with that > very same process, trying to read data back from some interaction with > it. That will naturally fail as well... either way, with or without my > patch. And again, I would consider this a bug. OK. I think I'm okay with your patches, but IMO they need a minor improvement: instead of setting got_some_output to an arbitrary value of 1, the code should set it to the increment of the bytes read from the sub-process. This is what we do when we actually read from the process, and I think this scenario should behave the same. WDYT? With that change, it's okay to commit this to the master branch. Thanks. |
Hello Eli...
On 04/11/17 13:11, Eli Zaretskii wrote: > OK. I think I'm okay with your patches, but IMO they need a minor > improvement: instead of setting got_some_output to an arbitrary value > of 1, the code should set it to the increment of the bytes read from > the sub-process. This is what we do when we actually read from the > process, and I think this scenario should behave the same. WDYT? I know what you mean -- and even though this was bugging me as well when I implemented it, it is actually on purpose. First of all, I wanted to make as few assumptions as possible about how processes are used and be on the conservative/defensive side with this solution, to make sure not to introduce new bugs or corner-cases. In this particular case, I wanted to minimize any integer overflow problems. Granted, depending on how many bits unsigned long has, it is very (!) unlikely to cause any issues, but otoh that is usually how bugs get introduced in the first place. So if I had calculated the difference to set got_some_output properly, I would have had to take the integer overflow problematic into account. The current solution has exactly one corner-case which is, imho, extremely unlikely: You would have to run wait_reading_process_output for a full wrap-around to exactly the same byte count that it was started with in the first place. It is safe to assume that will practically really never happen. Also, no user of wait_reading_process_output uses the return value in such a way at all -- which I would also consider a bug since it is not what is officially documented for the return value in the first place. It simply says positive, negative or zero... without stating that it will actually return the bytes read. So, imho, I would lean toward the current solution since it is as simple as possible with as few corner-cases as possible. If you would like to keep the status quo though and to always return how many bytes were actually read, then I suggest a different approach altogether. Instead of keeping track how many bytes were read for the total lifetime of a process, we could track only how many bytes were read for an in-flight wait. The would require a byte counter and a counter for how many waits are currently in-flight, so the last one (or first one) could zero the byte counter again. That would make things quite a bit more complicated, imho... and without having a proper gain to show for it. > With that change, it's okay to commit this to the master branch. May I suggest, even though it is rather late, to also consider this for the emacs-26 branch? Since it is not a new feature but a bug fix to a problem people are actually running into (see Magit), I think it would be justified. Personally, I have run a patched emacs-26 all the time here without any problems. Thanks again for taking the time. So long, Matthias -- Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu |
> Cc: [hidden email]
> From: Matthias Dahl <[hidden email]> > Date: Mon, 6 Nov 2017 15:15:48 +0100 > > In this particular case, I wanted to minimize any integer overflow > problems. Granted, depending on how many bits unsigned long has, it is > very (!) unlikely to cause any issues, but otoh that is usually how > bugs get introduced in the first place. So if I had calculated the > difference to set got_some_output properly, I would have had to take > the integer overflow problematic into account. Sorry, I don't understand. When we call emacs_read in other cases in that function, the result is an ssize_t, a signed type of the same width as size_t. So why would you need to handle overflow, when reading via emacs_read doesn't? And even if the result could overflow, we have the likes of INT_ADD_WRAPV to make handling of that easier and safer. So I don't see why we should give up in this case. > Also, no user of wait_reading_process_output uses the return value > in such a way at all -- which I would also consider a bug since it > is not what is officially documented for the return value in the > first place. It simply says positive, negative or zero... without > stating that it will actually return the bytes read. I don't agree with this line of reasoning. We are not developing a library whose users will be familiar only with the formal interface definition and nothing else. We are talking about code that is being read, studied, and used by Emacs hack^H^H^H^Hdevelopers all the time. Having a non-trivial function whose behavior is hard to describe and remember accurately makes using it error-prone, especially if the deviant behavior happens only in some corner use case that is hard to reproduce. In many cases, the information about these corner cases is lost soon after the code is introduced, and since we have no formal requirements for what functions like this one should do, and no good coverage by the test suite, future changes are impossible to test reliably, in order to make sure they don't break in those corner cases. So the situation where "the function FOO returns the number of bytes read from the subprocess, except when this and that happens, in which case it returns just 1" -- such situation should be avoided at all costs, IME. And in this case, the cost is actually quite low, unless I'm missing something. > If you would like to keep the status quo though and to always return > how many bytes were actually read, then I suggest a different > approach altogether. Instead of keeping track how many bytes were > read for the total lifetime of a process, we could track only how > many bytes were read for an in-flight wait. The would require a byte > counter and a counter for how many waits are currently in-flight, so > the last one (or first one) could zero the byte counter again. Sorry, you lost me half-way through this description. I actually meant to return only the increment of how many bytes were read since the last call, but I don't see why you'd need more than one counter that will be reset to zero once its value is consumed. > > With that change, it's okay to commit this to the master branch. > > May I suggest, even though it is rather late, to also consider this > for the emacs-26 branch? It's a bug that has been there "forever", and the fix is too risky to have it in the middle of a pretest. I don't think we understand well enough all of its implications, and reading from subprocesses is a very central and very delicate part of Emacs. Sorry. Thanks for working on this. |
On 11/06/2017 08:34 AM, Eli Zaretskii wrote:
> When we call emacs_read in other cases in > that function, the result is an ssize_t, a signed type of the same > width as size_t. As a minor point, POSIX does not require that ssize_t be the same width as size_t, or that ptrdiff_t be the same width as size_t. Emacs has run (and as far as I know, still would run) on unusual platforms where ssize_t is narrower than size_t. |
> Cc: [hidden email]
> From: Paul Eggert <[hidden email]> > Date: Mon, 6 Nov 2017 10:24:42 -0800 > > On 11/06/2017 08:34 AM, Eli Zaretskii wrote: > > When we call emacs_read in other cases in > > that function, the result is an ssize_t, a signed type of the same > > width as size_t. > > As a minor point, POSIX does not require that ssize_t be the same width > as size_t, or that ptrdiff_t be the same width as size_t. Emacs has run > (and as far as I know, still would run) on unusual platforms where > ssize_t is narrower than size_t. Yes, I know. We could replace ssize_t with ptrdiff_t if needed. |
In reply to this post by Eli Zaretskii
Hello Eli...
On 06/11/17 17:34, Eli Zaretskii wrote: > Sorry, I don't understand. When we call emacs_read in other cases in > that function, the result is an ssize_t, a signed type of the same > width as size_t. So why would you need to handle overflow, when > reading via emacs_read doesn't? > > And even if the result could overflow, we have the likes of > INT_ADD_WRAPV to make handling of that easier and safer. > > So I don't see why we should give up in this case. The amount of output read (in total) from a process is tracked via the following member variable of Lisp_Process: /* Byte-count for process output read from `infd'. */ unsigned long infd_num_bytes_read; That gives us at least 4 GiB worth of data that can be read back and tracked before it wraps around to 0. When wait_reading_process_output starts, the following happens: unsigned long initial_wait_proc_num_bytes_read = (wait_proc) ? wait_proc->infd_num_bytes_read : 0; We save the current counter and compare it later on to see if data has been read from that process without us noticing it, e.g.: if (wait_proc && wait_proc->infd_num_bytes_read != initial_wait_proc_num_bytes_read) { got_some_output = 1; break; } What might have happened now is that wait_proc->infd_num_bytes_read was already at the edge of the maximum representable value range and wrapped around during some recursive calls to wait_reading_process_output while we still have a value in initial_wait_proc_num_bytes_read that has not wrapped around. E.g.: wait_proc->infd_num_bytes_read = (2^32)-10 initial_wait_proc_num_bytes_read = wait_proc->infd_num_bytes_read ... recursive calls happen, output is read ... wait_proc->infd_num_bytes_read = 256 initial_wait_proc_num_bytes_read = (2^32)-10 // this will be wrong bytes_read = wait_proc->infd_num_bytes_read - initial_wait_proc_num_bytes_read That causes a problem -- even though this case is unlikely, it is still possible and I don't think it should be dismissed. To get how many bytes have been read for this current call, we have to calculate the difference of initial_wait_proc_num_bytes_read and wait_proc->infd_num_bytes_read. That naturally fails if there has been a wrap around in wait_proc->infd_num_bytes_read. gnulib's integer overflow helpers won't help in this case, at all. That is the whole reason why I did not set got_some_output to the proper byte count but simply to 1, to avoid this integer overflow situation which, if dealt with properly, would have made things a bit more complex and difficult. Naturally you can still do a greater than comparison and choose the operands (more or less) properly that way. Since that calculation does happen at two different spots in the code and I generally perceived the bytes_read return value as unnecessary as well as the fact that I did not really like that whole idea, I forgo doing this and went with the simple return value of 1. But maybe this is the simplest of all solutions, keeping KISS in mind. > I don't agree with this line of reasoning. We are not developing a > library whose users will be familiar only with the formal interface > definition and nothing else. We are talking about code that is being > read, studied, and used by Emacs hack^H^H^H^Hdevelopers all the time. Even the C interface? It is my impression that the C parts of Emacs are usually what is being avoided at all costs. Besides that, I treated the C parts like implementation details and kept strictly to the documented interfaces, expecting no user to ever touch this. Nevertheless, I agree, that consistency is important and I should have made that more of a priority in this case, given what you said. > Having a non-trivial function whose behavior is hard to describe and > remember accurately makes using it error-prone, especially if the > deviant behavior happens only in some corner use case that is hard to > reproduce. That is why it is important that the documentation and implementation of a function actually align. In this case, no user should actually expect the function to return the number of bytes read, just a positive or negative number or zero... just like it is stated. But, again, that might be a matter of perspective. For me, what is documented is like a contract. > Sorry, you lost me half-way through this description. I actually > meant to return only the increment of how many bytes were read since > the last call, but I don't see why you'd need more than one counter > that will be reset to zero once its value is consumed. My suggestion was to have an in-flight bytes read counter, that will only track the number of bytes read while /any/ wait for that specific process is active. To be able to properly convey that is through another variable that will be non-zero if we have anything waiting for that process's output. Eventually that in-flight bytes read counter needs to be reset to zero again. This can only happen, if we have no other parties waiting for this process's output higher up in the call chain as otherwise we would loose information and make it difficult/impossible for them to detect that something was actually read. So, that "other variable" that is non-zero if we have anything waiting for that process's output needs to actually track the number of parties that are waiting, so we can actually know when we are the last one and can safely reset the in-flight bytes read counter. But like I already stated earlier, maybe simply checking if an overflow happened through a greater than comparison is the simplest approach, since this solution introduces even more complexity and corner-cases, imho. > It's a bug that has been there "forever", and the fix is too risky to > have it in the middle of a pretest. I don't think we understand well > enough all of its implications, and reading from subprocesses is a > very central and very delicate part of Emacs. Sorry. Ack. Even though this means that users running into this right now will have to wait quite a while for a fix in a stable release, which is a pity imho. Thanks for all the feedback, Matthias PS. Sorry for the huge wall of text. :-( I just noticed how big this mail has gotten... yikes. -- Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu |
> From: Matthias Dahl <[hidden email]>
> Cc: [hidden email] > Date: Tue, 7 Nov 2017 15:18:02 +0100 > > The amount of output read (in total) from a process is tracked via the > following member variable of Lisp_Process: > > /* Byte-count for process output read from `infd'. */ > unsigned long infd_num_bytes_read; > > That gives us at least 4 GiB worth of data that can be read back and > tracked before it wraps around to 0. If we want, we can make it EMACS_INT, which will give us as much as Emacs can gobble. > What might have happened now is that wait_proc->infd_num_bytes_read was > already at the edge of the maximum representable value range and wrapped > around during some recursive calls to wait_reading_process_output while > we still have a value in initial_wait_proc_num_bytes_read that has not > wrapped around. E.g.: > > wait_proc->infd_num_bytes_read = (2^32)-10 > initial_wait_proc_num_bytes_read = wait_proc->infd_num_bytes_read > > ... recursive calls happen, output is read ... > > wait_proc->infd_num_bytes_read = 256 > initial_wait_proc_num_bytes_read = (2^32)-10 > > // this will be wrong > bytes_read = wait_proc->infd_num_bytes_read - > initial_wait_proc_num_bytes_read > > That causes a problem -- even though this case is unlikely, it is still > possible and I don't think it should be dismissed. That should be very rare (barring application-level bugs), and in any case it is no worse than just returning 1. So I don't think we should give up better support of 90% or even 99% of use cases due to the other 10% or 1%. > gnulib's integer overflow helpers won't help in this case, at all. Those helpers are meant to avoid the overflow itself, thus preventing "undefined behavior" when a signed integer overflows. They cannot, and shouldn't be expected to, fix the reason for the overflow -- that cannot be helped when it happens. > > I don't agree with this line of reasoning. We are not developing a > > library whose users will be familiar only with the formal interface > > definition and nothing else. We are talking about code that is being > > read, studied, and used by Emacs hack^H^H^H^Hdevelopers all the time. > > Even the C interface? Especially in the C interface. > It is my impression that the C parts of Emacs are usually what is > being avoided at all costs. It depends on who we are talking about. Some of us (yours truly included) work on the C level quite frequently. Any relatively radical new feature needs at least some C infrastructure. I had my share of writing code based on what I read in the function commentary and some general common sense and familiarity with the internals, only to find out later that they were incomplete or even prone to wrong interpretation. I'd like to minimize such occurrences as much as I can. > Besides that, I treated the C parts like implementation details and > kept strictly to the documented interfaces, expecting no user to > ever touch this. But that's exactly the problem: we _don't_have_ documented interfaces on the C level, at least not of the quality we would like to have. Do you really think that the commentary at the beginning of wait_reading_process_output describes what the function does anywhere close to completeness? Far from it. > > Having a non-trivial function whose behavior is hard to describe and > > remember accurately makes using it error-prone, especially if the > > deviant behavior happens only in some corner use case that is hard to > > reproduce. > > That is why it is important that the documentation and implementation of > a function actually align. Of course. But it's hard to keep them in sync, partly because it's incomplete to begin with, and that makes it not easy to recognize when the commentary needs to change due to some code changes. And the result is that I quite frequently find comments that are blatantly incorrect: mention parameters no longer there or fail to mention new ones that were added, for example. > In this case, no user should actually expect the function to return > the number of bytes read, just a positive or negative number or > zero... just like it is stated. I'd actually say that the commentary is incomplete because it doesn't say what that "positive" value means. Returning an opaque value from a function is not useful, unless you can pass it later to the same function to get it to do something special. > But, again, that might be a matter of perspective. For me, what is > documented is like a contract. IME with Emacs code, we are a long way from interfaces that are documented on a level that could be considered a contract. For starters, we don't even say clearly enough which functions can throw to top level, without which a contract is not really a contract, don't you agree? > > Sorry, you lost me half-way through this description. I actually > > meant to return only the increment of how many bytes were read since > > the last call, but I don't see why you'd need more than one counter > > that will be reset to zero once its value is consumed. > > My suggestion was to have an in-flight bytes read counter, that will > only track the number of bytes read while /any/ wait for that specific > process is active. Ah, I see. I think this is an unnecessary complication. The risk of overflow, while it isn't zero, is too low to justify such measures and the resulting complexity, IMO. > > It's a bug that has been there "forever", and the fix is too risky to > > have it in the middle of a pretest. I don't think we understand well > > enough all of its implications, and reading from subprocesses is a > > very central and very delicate part of Emacs. Sorry. > > Ack. Even though this means that users running into this right now will > have to wait quite a while for a fix in a stable release, which is a > pity imho. We can publish the patch here, and then people who really want this fixed ASAP can simply apply the patch. |
In reply to this post by Matthias Dahl
> /* Byte-count for process output read from `infd'. */
> unsigned long infd_num_bytes_read; BTW, we could count the number of (non-empty) "chunks" rather than the number of bytes. > { > got_some_output = 1; > break; > } Please try to use `true' and `false' for boolean values (there's still a lot of code in src/*.c which uses 0 and 1, admittedly, but this should slowly disappear over time). Stefan |
In reply to this post by Eli Zaretskii
Hello Eli...
Attached you will find the revised patches which now properly set the number of bytes read. Sorry for the delay, but real life intervened. Turns out, things were _a lot_ easier than I thought since the C standard actually does have well-defined rules for computation on unsigned types, so that no overflow happens. And since it is modular arithmetic, it perfectly fits our use-case. :-) Sometimes the simplest solution is right in front of your eyes and you are thinking way too complicated. So sorry for the trouble. Had I known this earlier, I would not have skipped that part -- even though I had the best intentions, obviously. On with the rest of your mail... On 07/11/17 17:40, Eli Zaretskii wrote: > If we want, we can make it EMACS_INT, which will give us as much as > Emacs can gobble. Thanks for the suggestion, I changed it to EMACS_UINT -- we need an unsigned type. > I had my > share of writing code based on what I read in the function commentary > and some general common sense and familiarity with the internals, only > to find out later that they were incomplete or even prone to wrong > interpretation. I'd like to minimize such occurrences as much as I > can. I guess I was overly idealistic. Sorry for that. Nevertheless, it would be nice to improve the situation and make sure that future changes to the codebase also take the commentaries into account... and put those themselves to the same high standard as the implementation itself. Just my two cents... > But that's exactly the problem: we _don't_have_ documented interfaces > on the C level, at least not of the quality we would like to have. Do > you really think that the commentary at the beginning of > wait_reading_process_output describes what the function does anywhere > close to completeness? Far from it. I know. OTOH, wait_reading_process_output might be a bad example as it really could benefit greatly from a refactor since it is way too big, complex and unclear. And describing its function properly and concisely could prove quite difficult. But generally, I pretty much agree on all that you said in the rest of your mail. > I'd actually say that the commentary is incomplete because it doesn't > say what that "positive" value means. Returning an opaque value from > a function is not useful, unless you can pass it later to the same > function to get it to do something special. I agree. It also leads the user to make assumptions about the value that might or might not turn out true for all cases... without properly reading the code. Thanks again for all your great feedback and review. I hope we now have something that can be applied to the master branch. If anything comes up in terms of bugs or problems that might be related, please poke me if I miss it on the list and I will (try to) fix it. Have a nice weekend, Matthias -- Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu ![]() ![]() |
In reply to this post by Stefan Monnier
Hello Stefan...
On 07/11/17 18:23, Stefan Monnier wrote: > BTW, we could count the number of (non-empty) "chunks" rather than the > number of bytes. Thanks for the suggestion. In this case, I think just counting the bytes keeps things simple and makes this value easier to understand when doing a debugging session that benefits from this information. Also, counting the chunks wouldn't really improve the situation much, if I am not mistaken? It would just take a bit longer for a wrap-around to happen. > >> { >> got_some_output = 1; >> break; >> } > > Please try to use `true' and `false' for boolean values (there's still > a lot of code in src/*.c which uses 0 and 1, admittedly, but this should > slowly disappear over time). That is actually legacy code right there. I did not introduce that variable nor is it a boolean. Actually it is a integer in disguise that stores how many bytes have been read. ;-) It is very unfortunate naming, I agree. The revised patches don't set it to 1 but do calculate the proper value, so this is a non-issue now. Thanks for taking the time and have a nice weekend, Matthias -- Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu |
Free forum by Nabble | Edit this page |