This document is my attempt to understand (and explain) Al Viro's
"shared subtrees" RFC:
http://marc.theaimsgroup.com/?l=linux-fsdevel&m=110565591630267&w=2
(Note this has since been implemented by Ram Pai, and is included in recent
kernels. The following may be out of date.)
One way to introduce this is to imagine what we might add to the "mount" man
page:
The mount --bind operation creates a clone of a file hierarchy so
that any operations performed on one copy are instantly visible
from the other. New mounts beneath one copy, however, do not
automatically propagate to the other copy. To force mounts to
propagate automatically, mark the tree as shareable before copying
it with --bind:
mount --mark-shared olddir
mount --bind olddir newdir
Mounts under olddir will then also appear atomically under newdir,
and vice versa. If you want the sharing to occur in only one
direction, you can subsequently mark one directory as a "slave":
mount --make-slave newdir
after which mounts under olddir will also appear under newdir,
but not the reverse. To break the relationship completely,
mount --make-private newdir
and to undo the marking of olddir as shareable,
mount --make-private olddir
These operations also behave similarly for copies of hierarchies
made by clone() with CLONE_NEWNS set.
The problem with the above discussion is that it doesn't explain how to handle
a lot of corner cases; e.g. if I mount --bind something shareable underneath a
mountpoint that itself already is propagating mounts to other mountpoints, then
what happens? Mostly it's what you expect, but the details need to be
specified carefully, which is what Al Viro's message does.
A few notes may help in reading Viro's detailed description:
1. When he talks about vfsmounts being contained in and/or owned
by p-nodes, all the vfsmounts in question are clones of each
other--they all have the same root dentry.
2. Note that "being contained in a p-node" and "being owned by a
p-node" are two different things: a vfsmount that is contained
in the same p-node is completely equal to others in the p-node
as far as mounts under them are concerned--a mount in any of
the vfsmounts contained in the p-node is reflected in all the
others. But when a vfsmount is *owned* by a p-node, the
propagation only happens in one direction: mounts made under a
vfsmount in the p-node also show up under the owned vfsmount,
but mounts under the owned vfsmounts aren't propagated back to
the owning p-node.
The "p-node" terminology isn't really necessary; to me it's more intuitive to
start with the "propagates-to" relation.
Let A and B be vfsmounts with the same root dentry. Such pairs of vfsmounts
are created, for example, when we "mount --bind" or clone with CLONE_NEWNS.
Then we write A->B to mean "mounts onto mountpoints anywhere in A will also be
automatically made at the same point in B". We say that "mounts under A
propagate to mounts under B", or just "A propagates to B". We assume that this
relation has two fundamental properties:
1. If A->B and B->C, then A->C. (Transitivity)
2. If A->C and B->C, then either A->B or B->A. (So you can't inherit
mounts from two different vfsmounts unless one already inherits
from the other.)
It's also convenient to allow A->A; in practice this doesn't mean anything in
terms of propagation, except (as we shall see) it's the way that we mark
vfsmounts as "shareable" before they're actually cloned. So we'll call a
vfsmount A such that A->A a "shareable" vfsmount. Note that if we choose B in
property 2 above to be the same as A, we get a third property
3. If A->C, then A->A.
In other words, if mounts under A propagate to any other vfsmount, then A is
shareable.
Now we're ready to explain how to perform the operations described above.
"mount --mark-shared dir": Add the relation A->A for every vfsmount
A under "dir".
"mount --bind olddir newdir": Let A be the vfsmount at olddir. Make a
copy A_1 of A, and graft A_1 into place at newdir as usual. However,
if A->A, then also add A->A_1, A_1->A, and A_1->A_1, thus setting up
propagation between A and A_1 and making A_1 shareable. (Note: this is
more complicated if newdir is in a shareable vfsmount; we ignore this
case for now.)
"mount --make-slave dir": remove any relation A->B with A in the given
tree (so that mounts no longer propagate out of that tree).
"mount --make-private dir": remove any relation A->B with either A or
B (or both) in the given tree.
Note that the latter two operations also make all the vfsmounts in question
unshareable. (Just take "A = B" in the statements above.)
We should have some idea what should happen when we do a new mount beneath a
shareable vfsmount--the same mount should be replicated under any vfsmount that
the target mount propagates to--but we need to work it out this procedure in
detail.
Let A be the vfsmount we're mounting, and let B be the vfsmount we're mounting
onto. Let B_1,...,B_n be the vfsmounts that B propagates to (so B->B_i for
each i). (Note that B itself is among the B_i, so say without loss of
generality that B = B_1.) Then we clone A to copies A_1,...,A_n and mount each
one at the same point in the corresponding vfsmount B_1,...,B_n.
This is an obvious enough interpretation of what we mean by propagating mounts.
However, we also want propagation to be recursive--if a tree is marked
shareable then we want not only mounts on the tree to be propagated, we also
want mounts on those mounts to be propagated, and so on recursively. So, for
each relation between B_1,...,B_n, we also add a corresponding relation between
the A_1,...,A_n. When we're done, we'll have A_i->A_j if and only if B_i->B_j.
Finally, one last wrinkle--if we're doing a --bind mount, and A itself is
shareable, then we also add the relations A->A_1, A_1->A, and A_1->A_1, as in
the description of "mount --bind" above. Note that we do this *only* for A_1,
not for the other copies A_i.
This covers most of the important points.
To finish, Viro's "p-node" terminology may benefit from some explanation. Let
A be a shareable vfsmount, so A->A. We think of any vfsmount B such that A->B
and B->A as "equivalent" to A. By transitivity (property 1 above), any
vfsmounts that are equivalent to A are also shareable and are equivalent to
each other. We define the "p-node" containing A to be the set of all such
equivalent vfsmounts. Any shareable vfsmount is a member of a p-node, though
it may be the only member.
There may also be vfsmounts B which A propagates to (so A->B) but which aren't
equivalent to A (so B->A is not true). However, the set of all vfsmounts that
A propagates to can be split up into p-nodes, and the set of such p-nodes forms
a tree. Actually, this is a slight lie--the leaves of the tree don't
themselves have to be shareable, so might not be in p-nodes. But every other
node of the tree must be (by property 3 above).
The set of all vfsmounts with the same root dentry is therefore divided into a
forest of trees of p-nodes (and, possibly, of p-nodeless vfsmounts at the
leaves).
When Viro says that a p-node p owns another p-node q, he means that q is
a child of p in this tree (but not a grandchild or other descendent).
Thus we can derive the propagation relationship from the tree of p-nodes by the
rules Viro gives at the beginning of his RFC: propagation occurs between all
vfsmounts in a p-node and passes from p-nodes to any p-nodes and vfsmounts the
own, etc.
--
Belabouring some technical points:
Implicit in the description of the various operations above is the claim that
they each preserve properties (1) and (2). This claim might require some
proof.
make-slave: Remove all A->B such that A is in the subtree, as specified above.
Property 1: If A->B and B->C, then A is not contained in the subtree.
Therefore A->C, since that relation existed before and it was not removed.
Property 2: Let A->C and B->C. Then neither A nor B is in the subtree, so A->B
or B->A still hold.
make private: Remove all A->B such that A or B is in the subtree.
Proofs of both properties are similarly trivial.
make shared: This creates only relations of the form A->A, which are obviously
OK.
mount --bind: Trickier, in part because my description above is a lie: in
addition to adding the relations A->A_1 and A_1->A, we also need to add all the
relations which are a consequence of this relation and transitivity. That
done, the result trivially satisfies property 1.
Property 2: Assume A->C and B->C; we want to establish A->B or B->A. Write M
for the original source vfsmount. Note that among the vfsmounts involved in
this process, all either existed beforehand, and propagate to M, or were
created during the mount process, and propagate from M. We already know
property 2 for the descendents of M and for M's ancestors. The problem is to
establish it when A, B, and C are a combination of the two.
If A->M and B->M, then property 2 for the ancestors of M implies that A->B or
B->A.
If A->M and M->B, then by transitivity A->B.
Similarly if M->A and B->M, then B->A.
Finally if M->A and M->B, then M->C also, so property 2 is a result of property
2 for descendents of M.
(This is really a much more trivial fact than the long discussion would imply:
it's intuitively obvious, for example, if you realize that you can describe the
mount --bind operation as cloning the p-node subtree rooted in the target
vfsmount and grafting it onto the p-node tree which the source vfsmount is a
member of, at the point of that source vfsmount. Adding a tree as a child to
another tree obviously results in a tree.)