5.0 regression: Script stuck when waiting in trap

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

5.0 regression: Script stuck when waiting in trap

mwnx
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS: -g -O2 -fdebug-prefix-map=/build/bash-Dl674z/bash-5.0=. -fstack-protector-strong -Wformat -Werror=format-security -Wall -Wno-parentheses -Wno-format-security
uname output: Linux cbb62a3d8393 4.15.0-50-generic #54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Machine Type: x86_64-pc-linux-gnu

Bash Version: 5.0
Patch Level: 3
Release Status: release

Description:
        Since bash 5.0, a subshell can get stuck (wait forever) in
        what looks like a pretty specific set of circumstances,
        namely when combining a group command or a function call
        with process substitution and attempting to `wait` for said
        group command or function from within a trap.

        I ran a git bisect to pinpoint when exactly the regression
        was introduced and it looks like it happened in [1].

        [1] d233b485 (tag: bash-5.0) bash-5.0 distribution sources and documentation

Repeat-By:
        $ cat <<'EOF' | bash
        (
            trap 'wait' EXIT
            { sleep 2; } > >(cat)
        ) &
        sleep 1
        kill $!
        wait
        EOF
        <ctrl-C after a few seconds of waiting>
        # The process is actually still alive, even after ctrl-C...
        $ pstree -p
        sh(1)-+-bash(6)---pstree(321)
              `-bash(316)---bash(318)---cat(320)

        Replacing `{ sleep 2; }` above with `f() { sleep 2; }; f`
        gives the same result. On the other hand, replacing it with
        plain `sleep 2`, or `(sleep 2)` eliminates the problem.

--
mwnx
GPG: AEC9 554B 07BD F60D 75A3  AF6A 44E8 E4D4 0312 C726
     ipfs.io/ipfs/QmV4tMgNmqewgcPgX4ktAbrA9jHeiM2DhEFSB4BKxwj75c

Reply | Threaded
Open this post in threaded view
|

Re: 5.0 regression: Script stuck when waiting in trap

Chet Ramey
On 6/2/19 7:55 AM, mwnx wrote:

> Bash Version: 5.0
> Patch Level: 3
> Release Status: release
>
> Description:
> Since bash 5.0, a subshell can get stuck (wait forever) in
> what looks like a pretty specific set of circumstances,
> namely when combining a group command or a function call
> with process substitution and attempting to `wait` for said
> group command or function from within a trap.


>
> Repeat-By:
> $ cat <<'EOF' | bash
> (
>    trap 'wait' EXIT
>    { sleep 2; } > >(cat)
> ) &
> sleep 1
> kill $!
> wait
> EOF
> <ctrl-C after a few seconds of waiting>
> # The process is actually still alive, even after ctrl-C...
> $ pstree -p
> sh(1)-+-bash(6)---pstree(321)
>      `-bash(316)---bash(318)---cat(320)

Here's what happens. The relevant change is that wait without options now
waits for the last process substitution, since that sets $! and is "known"
to the shell.

The sequence of events is approximately:

1. Subshell starts, forks to run process substitution, opens a pipe to the
   process substitution, and runs the group command. The group command
   means the redirection is performed by the subshell, not the `sleep',
   since the redirections persist for the entire group command.

2. Main shell starts, runs sleep, kills the subshell started in step 1.
   This doesn't kill the sleep or the cat. The sleep exits on its own.
   The cat continues to run.

3. The main shell waits for the subshell.

4. The subshell, having received a fatal signal, runs the exit trap and
   waits for the process substitution ($!). It doesn't have a chance to do
   anything with the the file descriptor open to the process substitution
   as the result of the redirection, and the `cat' continues to run because
   it doesn't get the signal. The wait doesn't complete.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://tiswww.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|

Re: 5.0 regression: Script stuck when waiting in trap

mwnx
On Mon, Jun 03, 2019 at 03:42:22PM -0400, Chet Ramey wrote:

> Here's what happens. The relevant change is that wait without options now
> waits for the last process substitution, since that sets $! and is "known"
> to the shell.
>
> The sequence of events is approximately:
>
> 1. Subshell starts, forks to run process substitution, opens a pipe to the
>    process substitution, and runs the group command. The group command
>    means the redirection is performed by the subshell, not the `sleep',
>    since the redirections persist for the entire group command.
>
> 2. Main shell starts, runs sleep, kills the subshell started in step 1.
>    This doesn't kill the sleep or the cat. The sleep exits on its own.
>    The cat continues to run.
>
> 3. The main shell waits for the subshell.
>
> 4. The subshell, having received a fatal signal, runs the exit trap and
>    waits for the process substitution ($!). It doesn't have a chance to do
>    anything with the the file descriptor open to the process substitution
>    as the result of the redirection, and the `cat' continues to run because
>    it doesn't get the signal. The wait doesn't complete.
>
> Chet

Thanks for the explanation. In view of the change you describe,
there is another behaviour that I think might qualify as a bug. I'll
give you my actual use case first.

I simply want to make sure all processes running inside a given
subshell are killed on exit. To that means, I set up the following
trap in the shell (and potentially, its subshells and so on):

    trap 'kill $(jobs -p) &>/dev/null || true; wait' EXIT

This was working fine in bash 4.x, despite `jobs -p` not returning
the process ID of process substitutions. But now that `wait` (with
no arguments) waits for process substitutions in addition to
"ordinary" foreground and background processes, the situation is
asymmetric, leading to my subshells freezing when killed while a
process substitution is running.

So, although I realise that complete backwards compatibility is not
to be expected since this is a new major version, it would in my
opinion be consistent to have `jobs`'s output include process
substitutions, and would ensure that the use case I describe remains
supported. If you agree with the sentiment, I could try and submit a
patch for the `jobs` command myself if you wish.

--
mwnx
GPG: AEC9 554B 07BD F60D 75A3  AF6A 44E8 E4D4 0312 C726
     ipfs.io/ipfs/QmV4tMgNmqewgcPgX4ktAbrA9jHeiM2DhEFSB4BKxwj75c

Reply | Threaded
Open this post in threaded view
|

Re: 5.0 regression: Script stuck when waiting in trap

Chet Ramey
On 6/4/19 4:34 PM, mwnx wrote:

> Thanks for the explanation. In view of the change you describe,
> there is another behaviour that I think might qualify as a bug. I'll
> give you my actual use case first.
>
> I simply want to make sure all processes running inside a given
> subshell are killed on exit. To that means, I set up the following
> trap in the shell (and potentially, its subshells and so on):
>
>     trap 'kill $(jobs -p) &>/dev/null || true; wait' EXIT
>
> This was working fine in bash 4.x, despite `jobs -p` not returning
> the process ID of process substitutions. But now that `wait` (with
> no arguments) waits for process substitutions in addition to
> "ordinary" foreground and background processes, the situation is
> asymmetric, leading to my subshells freezing when killed while a
> process substitution is running.

Not quite. `wait' without arguments waits for the last process
substitution, and the pid of that process is available in $! for the
cases you care about. If you are sure that your script hasn't started
any asynchronous processes since the last process substitution, you
can use $! directly. Otherwise, you can capture it into a variable
and use it in the `kill' command.

You should also ensure that you're using bash-5.0 with patch 4 applied,
since that is relevant to this issue.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://tiswww.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|

Re: 5.0 regression: Script stuck when waiting in trap

mwnx
On Tue, Jun 04, 2019 at 07:19:02PM -0400, Chet Ramey wrote:

> On 6/4/19 4:34 PM, mwnx wrote:
>
> > Thanks for the explanation. In view of the change you describe,
> > there is another behaviour that I think might qualify as a bug. I'll
> > give you my actual use case first.
> >
> > I simply want to make sure all processes running inside a given
> > subshell are killed on exit. To that means, I set up the following
> > trap in the shell (and potentially, its subshells and so on):
> >
> >     trap 'kill $(jobs -p) &>/dev/null || true; wait' EXIT
> >
> > This was working fine in bash 4.x, despite `jobs -p` not returning
> > the process ID of process substitutions. But now that `wait` (with
> > no arguments) waits for process substitutions in addition to
> > "ordinary" foreground and background processes, the situation is
> > asymmetric, leading to my subshells freezing when killed while a
> > process substitution is running.
>
> Not quite. `wait' without arguments waits for the last process
> substitution, and the pid of that process is available in $! for the
> cases you care about. If you are sure that your script hasn't started
> any asynchronous processes since the last process substitution, you
> can use $! directly. Otherwise, you can capture it into a variable
> and use it in the `kill' command.

So, I tried the following:

     (echo a > >(cat; sleep 5); echo b > >(cat; sleep 1); wait)

which does confirm what you're saying; the command returns after
only 1 second, not 5. However, `help wait` states the following
(emphasis mine):

   If ID is not given, waits for all currently active child
   *processes*, and the return status is zero.

And, placing a call to `pstree` just before the `wait` in the above
command prints the following:

    [...]-bash-+-bash---sleep
               |-bash---sleep
               `-pstree

Which is to say both process substitutions are child processes of
the waiting process, therefore `wait` not waiting for the first one
seems to me to contradict the above quote.

Or, perhaps I'm misunderstanding the meaning of the term "active" in
the above quote and for some reason the first process substitution
is not to be considered "active"... Even so, I still don't really
get the rationale behind only having `wait` wait for the last
process substitution alone. After all, it does wait for all other
kinds of processes irrespective of when they were started or how
many there are, so why the special treatment for process
substitutions? This is just really confusing and error-prone —or at
least it has been for me, and I fail to see any upsides it.

By the way, I did find a workaround for my use case; replacing the
following in my trap:

    kill $(jobs -p)

with this:

    eval "kill \$(ps -o pid= --ppid $BASHPID)"

But I still think it would be nice if `wait` and `jobs` behaved more
consistently. To sum up, here are the two points I find to be
inconsistent:

1. `wait` takes into account process substitutions (or at least the
   last one) but `jobs` does not.

2. `wait` takes into account only the last process substitution but
   takes into account all processes of other kinds, not just the
   last of their kind (and this is not documented in the help or the
   man page).

What are your thoughts on these points? I feel like we should at
least be able to agree on point 2., especially in light of the fact
that the current behaviour is not even documented.

> You should also ensure that you're using bash-5.0 with patch 4 applied,
> since that is relevant to this issue.

Yep, ubuntu 19.04 is still on bash 5.0.3 but I've been making sure
to use the latest version, compiled from master (5.0.7).

--
mwnx
GPG: AEC9 554B 07BD F60D 75A3  AF6A 44E8 E4D4 0312 C726
     ipfs.io/ipfs/QmV4tMgNmqewgcPgX4ktAbrA9jHeiM2DhEFSB4BKxwj75c

Reply | Threaded
Open this post in threaded view
|

Re: 5.0 regression: Script stuck when waiting in trap

Robert Elz
    Date:        Thu, 6 Jun 2019 09:57:24 +0200
    From:        mwnx <[hidden email]>
    Message-ID:  <20190606075724.GA9670@noisy>

  | After all, it does wait for all other
  | kinds of processes irrespective of when they were started or how
  | many there are,

Shells aren't required to keep track of any process that the script
doesn't bother to track.   Many do, but it isn't required.   If you
want to know when a process ends, you need to save its PID when it
is created (ie: reference $! - save it in some variable).   Once you've
done that, the shell is supposed to track it, and "wait" should include
that process.

Whether bash works that way with processes created for process substitutions
(which are a non-standard thing to do) I don't know however.

kre


Reply | Threaded
Open this post in threaded view
|

Re: 5.0 regression: Script stuck when waiting in trap

Chet Ramey
In reply to this post by mwnx
On 6/6/19 3:57 AM, mwnx wrote:

>> Not quite. `wait' without arguments waits for the last process
>> substitution, and the pid of that process is available in $! for the
>> cases you care about. If you are sure that your script hasn't started
>> any asynchronous processes since the last process substitution, you
>> can use $! directly. Otherwise, you can capture it into a variable
>> and use it in the `kill' command.
>
> So, I tried the following:
>
>      (echo a > >(cat; sleep 5); echo b > >(cat; sleep 1); wait)
>
> which does confirm what you're saying; the command returns after
> only 1 second, not 5. However, `help wait` states the following
> (emphasis mine):
>
>    If ID is not given, waits for all currently active child
>    *processes*, and the return status is zero.

OK, so the wording is the issue? What would work better? We could use
the "known to the shell" wording POSIX does, or "background jobs" or
even "running jobs" that some other shells use.


> Or, perhaps I'm misunderstanding the meaning of the term "active" in
> the above quote and for some reason the first process substitution
> is not to be considered "active"... Even so, I still don't really
> get the rationale behind only having `wait` wait for the last
> process substitution alone.

Because that's the one corresponding to $! (unless you've started
another background job since, which it doesn't check for but probably
should).

>
> By the way, I did find a workaround for my use case; replacing the
> following in my trap:
>
>     kill $(jobs -p)
>
> with this:
>
>     eval "kill \$(ps -o pid= --ppid $BASHPID)"
>
> But I still think it would be nice if `wait` and `jobs` behaved more
> consistently. To sum up, here are the two points I find to be
> inconsistent:
>
> 1. `wait` takes into account process substitutions (or at least the
>    last one) but `jobs` does not.

Because they're not jobs. They can't be the target of any of the
commands that operate on jobs, can't be moved between the foreground
and background, and don't have a job id. Plus the issues with the
controlling terminal.

>
> 2. `wait` takes into account only the last process substitution but
>    takes into account all processes of other kinds, not just the
>    last of their kind (and this is not documented in the help or the
>    man page).

So you'd now like wait without arguments to wait on all process
substitutions? I don't think that's what most people want. It seems
like a reasonable compromise between waiting for none (pre-bash-5.0
behavior) and waiting for all. If you are interested in waiting for
all the process substitutions, you can collect the PIDs by getting
$! after each one starts, and wait for them explicitly.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://tiswww.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|

Re: 5.0 regression: Script stuck when waiting in trap

Chet Ramey
In reply to this post by Robert Elz
On 6/6/19 9:12 AM, Robert Elz wrote:

> Whether bash works that way with processes created for process substitutions
> (which are a non-standard thing to do) I don't know however.

The shell doesn't really have to remember them at all -- ksh93 doesn't, for
instance -- but bash remembers the last one, since it sets $!.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://tiswww.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|

Re: 5.0 regression: Script stuck when waiting in trap

Chet Ramey
On 6/6/19 4:50 PM, Chet Ramey wrote:
> On 6/6/19 9:12 AM, Robert Elz wrote:
>
>> Whether bash works that way with processes created for process substitutions
>> (which are a non-standard thing to do) I don't know however.
>
> The shell doesn't really have to remember them at all -- ksh93 doesn't, for
> instance -- but bash remembers the last one, since it sets $!.

And I just checked, and, in bash-5.0, wait without arguments attempts to
wait for the rest of the process substitutions after waiting for the last
one. My bad, I thought I put that in after bash-5.0 was released, but
it's there.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://tiswww.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|

Re: 5.0 regression: Script stuck when waiting in trap

mwnx
In reply to this post by Chet Ramey
On Thu, Jun 06, 2019 at 03:18:16PM -0400, Chet Ramey wrote:
> >    If ID is not given, waits for all currently active child
> >    *processes*, and the return status is zero.
>
> OK, so the wording is the issue? What would work better? We could use
> the "known to the shell" wording POSIX does, or "background jobs" or
> even "running jobs" that some other shells use.

"running jobs" seems fine to me, but with bash 5 I guess it would
have to be extended to "running jobs and the last started process
substitution if one exists and it is still running".

> > But I still think it would be nice if `wait` and `jobs` behaved more
> > consistently. To sum up, here are the two points I find to be
> > inconsistent:
> >
> > 1. `wait` takes into account process substitutions (or at least the
> >    last one) but `jobs` does not.
>
> Because they're not jobs. They can't be the target of any of the
> commands that operate on jobs, can't be moved between the foreground
> and background, and don't have a job id. Plus the issues with the
> controlling terminal.

Okay, that makes sense.

> > 2. `wait` takes into account only the last process substitution but
> >    takes into account all processes of other kinds, not just the
> >    last of their kind (and this is not documented in the help or the
> >    man page).
>
> So you'd now like wait without arguments to wait on all process
> substitutions?

Well, I guess I kind of liked the old behaviour (i.e. not waiting
for process substitutions at all). It just seemed less confusing.

> I don't think that's what most people want.

Maybe not indeed...

> It seems like a reasonable compromise between waiting for none
> (pre-bash-5.0 behavior) and waiting for all. If you are interested
> in waiting for all the process substitutions, you can collect the
> PIDs by getting $! after each one starts, and wait for them
> explicitly.

...but I don't really see the use case. If the PID can be retrieved
from `$!`, why not just `wait $!`? Because one might want to wait
for all background jobs _and_ the last process substitution? Sure,
this might help in this case:

    ... >(...) & ...; wait

But it won't in the following case, with a second process
substitution:

    ... >(...) >(...) & ...; wait

I guess what I'm trying to get at is that this new feature of `wait`
is lacking in generality, and that makes it confusing _and_ not all
that useful in, well, the general case.

And, by the way, I get an error message with the following command
in bash 5.0.7:

     $ echo >(sleep 5) >(sleep 1); pstree -p $BASHPID; wait
     /dev/fd/63 /dev/fd/62
     bash(10730)-+-bash(30435)---sleep(30437)
                 |-bash(30436)---sleep(30439)
                 `-pstree(30438)
     bash: wait_for: No record of process 30435

I also found out, while tinkering around, that it is not possible to
wait for any process substitution but the last one, even though this
should technically be possible in the underlying C implementation
since in my test examples the process substitution in question is
indeed a (direct) child process of the shell:

     $ echo > >(sleep 5); a=$!; echo > >(sleep 1); pstree -p $BASHPID; wait $a $!
     bash(30240)-+-bash(30241)---sleep(30243)
                 |-bash(30242)---sleep(30245)
                 `-pstree(30244)
     ./bash: wait: pid 30241 is not a child of this shell

This behaviour is present in both bash 4 and 5. Whether this is to
be considered a bug/undesirable or not...

Hopefully, I haven't come across as overly argumentative and
critical, and if I have, it is only because I actually really like
bash and have been using it daily, as a shell or as a scripting
language, for more than ten years. So, for what it's worth, thank
you very much for all the great work that you've put into
maintaining and improving bash for all these years.

--
mwnx
GPG: AEC9 554B 07BD F60D 75A3  AF6A 44E8 E4D4 0312 C726
     ipfs.io/ipfs/QmV4tMgNmqewgcPgX4ktAbrA9jHeiM2DhEFSB4BKxwj75c

Reply | Threaded
Open this post in threaded view
|

Re: 5.0 regression: Script stuck when waiting in trap

Chet Ramey
On 6/9/19 8:58 AM, mwnx wrote:

> On Thu, Jun 06, 2019 at 03:18:16PM -0400, Chet Ramey wrote:
>>>    If ID is not given, waits for all currently active child
>>>    *processes*, and the return status is zero.
>>
>> OK, so the wording is the issue? What would work better? We could use
>> the "known to the shell" wording POSIX does, or "background jobs" or
>> even "running jobs" that some other shells use.
>
> "running jobs" seems fine to me, but with bash 5 I guess it would
> have to be extended to "running jobs and the last started process
> substitution if one exists and it is still running".

Well, that's more wordy than I like, but it's on the right track.


>>> 2. `wait` takes into account only the last process substitution but
>>>    takes into account all processes of other kinds, not just the
>>>    last of their kind (and this is not documented in the help or the
>>>    man page).
>>
>> So you'd now like wait without arguments to wait on all process
>> substitutions?
>
> Well, I guess I kind of liked the old behaviour (i.e. not waiting
> for process substitutions at all). It just seemed less confusing.

Before this, there was no way to wait for a process substitution at all,
and if you're going to allow wait to wait for a single one, then you
might was well allow wait without arguments to wait for it as well.


>> It seems like a reasonable compromise between waiting for none
>> (pre-bash-5.0 behavior) and waiting for all. If you are interested
>> in waiting for all the process substitutions, you can collect the
>> PIDs by getting $! after each one starts, and wait for them
>> explicitly.
>
> ...but I don't really see the use case. If the PID can be retrieved
> from `$!`, why not just `wait $!`? Because one might want to wait
> for all background jobs _and_ the last process substitution?

Because `wait' without arguments waits for all processes "known to the
shell" (POSIX wording), and the last process substitution is certainly
known.

Sure,

> this might help in this case:
>
>     ... >(...) & ...; wait
>
> But it won't in the following case, with a second process
> substitution:
>
>     ... >(...) >(...) & ...; wait
>
> I guess what I'm trying to get at is that this new feature of `wait`
> is lacking in generality, and that makes it confusing _and_ not all
> that useful in, well, the general case.

There are three cases.

1. A process substitution that is associated with a file descriptor that
   persists between commands (exec 4< <(some command)).

2. A process substitution that is associated with a file descriptor that
   exists for the duration of a shell command. This is most often a shell
   function or builtin, but a group command is a special case.

3. A process substitution that is associated with a command that runs in
   a subshell (e.g., ls > >(some other command)).

The shell will wait for all process substitutions in case 1, some, but not
all, in case 2, and none in case 3 (since those are not children of the
shell).

The original report was about a deadlock caused by case 2.

>
> And, by the way, I get an error message with the following command
> in bash 5.0.7:
>
>      $ echo >(sleep 5) >(sleep 1); pstree -p $BASHPID; wait
>      /dev/fd/63 /dev/fd/62
>      bash(10730)-+-bash(30435)---sleep(30437)
>                  |-bash(30436)---sleep(30439)
>                  `-pstree(30438)
>      bash: wait_for: No record of process 30435
>
> I also found out, while tinkering around, that it is not possible to
> wait for any process substitution but the last one, even though this
> should technically be possible in the underlying C implementation
> since in my test examples the process substitution in question is
> indeed a (direct) child process of the shell:

It's a limitation of the current implementation. The shell closes file
descriptors to process substitutions as soon as it no longer needs them,
since not doing so will generally result in that process never exiting.
It doesn't remember the process substitution after closing the last file
descriptor to it, though it could, and should for the general case. It
will wait for running process substitutions to which it has open file
descriptors.

The shell assumes that the majority of commands used in process
substitutions do not persist very long after having their standard input
pipe closed, and will exit very quickly after consuming the available input
and reading EOF.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://tiswww.cwru.edu/~chet/