This document is my attempt to understand (and explain) Al Viro's "shared subtrees" RFC: http://marc.theaimsgroup.com/?l=linux-fsdevel&m=110565591630267&w=2 (Note this has since been implemented by Ram Pai, and is included in recent kernels. The following may be out of date.) One way to introduce this is to imagine what we might add to the "mount" man page: The mount --bind operation creates a clone of a file hierarchy so that any operations performed on one copy are instantly visible from the other. New mounts beneath one copy, however, do not automatically propagate to the other copy. To force mounts to propagate automatically, mark the tree as shareable before copying it with --bind: mount --mark-shared olddir mount --bind olddir newdir Mounts under olddir will then also appear atomically under newdir, and vice versa. If you want the sharing to occur in only one direction, you can subsequently mark one directory as a "slave": mount --make-slave newdir after which mounts under olddir will also appear under newdir, but not the reverse. To break the relationship completely, mount --make-private newdir and to undo the marking of olddir as shareable, mount --make-private olddir These operations also behave similarly for copies of hierarchies made by clone() with CLONE_NEWNS set. The problem with the above discussion is that it doesn't explain how to handle a lot of corner cases; e.g. if I mount --bind something shareable underneath a mountpoint that itself already is propagating mounts to other mountpoints, then what happens? Mostly it's what you expect, but the details need to be specified carefully, which is what Al Viro's message does. A few notes may help in reading Viro's detailed description: 1. When he talks about vfsmounts being contained in and/or owned by p-nodes, all the vfsmounts in question are clones of each other--they all have the same root dentry. 2. Note that "being contained in a p-node" and "being owned by a p-node" are two different things: a vfsmount that is contained in the same p-node is completely equal to others in the p-node as far as mounts under them are concerned--a mount in any of the vfsmounts contained in the p-node is reflected in all the others. But when a vfsmount is *owned* by a p-node, the propagation only happens in one direction: mounts made under a vfsmount in the p-node also show up under the owned vfsmount, but mounts under the owned vfsmounts aren't propagated back to the owning p-node. The "p-node" terminology isn't really necessary; to me it's more intuitive to start with the "propagates-to" relation. Let A and B be vfsmounts with the same root dentry. Such pairs of vfsmounts are created, for example, when we "mount --bind" or clone with CLONE_NEWNS. Then we write A->B to mean "mounts onto mountpoints anywhere in A will also be automatically made at the same point in B". We say that "mounts under A propagate to mounts under B", or just "A propagates to B". We assume that this relation has two fundamental properties: 1. If A->B and B->C, then A->C. (Transitivity) 2. If A->C and B->C, then either A->B or B->A. (So you can't inherit mounts from two different vfsmounts unless one already inherits from the other.) It's also convenient to allow A->A; in practice this doesn't mean anything in terms of propagation, except (as we shall see) it's the way that we mark vfsmounts as "shareable" before they're actually cloned. So we'll call a vfsmount A such that A->A a "shareable" vfsmount. Note that if we choose B in property 2 above to be the same as A, we get a third property 3. If A->C, then A->A. In other words, if mounts under A propagate to any other vfsmount, then A is shareable. Now we're ready to explain how to perform the operations described above. "mount --mark-shared dir": Add the relation A->A for every vfsmount A under "dir". "mount --bind olddir newdir": Let A be the vfsmount at olddir. Make a copy A_1 of A, and graft A_1 into place at newdir as usual. However, if A->A, then also add A->A_1, A_1->A, and A_1->A_1, thus setting up propagation between A and A_1 and making A_1 shareable. (Note: this is more complicated if newdir is in a shareable vfsmount; we ignore this case for now.) "mount --make-slave dir": remove any relation A->B with A in the given tree (so that mounts no longer propagate out of that tree). "mount --make-private dir": remove any relation A->B with either A or B (or both) in the given tree. Note that the latter two operations also make all the vfsmounts in question unshareable. (Just take "A = B" in the statements above.) We should have some idea what should happen when we do a new mount beneath a shareable vfsmount--the same mount should be replicated under any vfsmount that the target mount propagates to--but we need to work it out this procedure in detail. Let A be the vfsmount we're mounting, and let B be the vfsmount we're mounting onto. Let B_1,...,B_n be the vfsmounts that B propagates to (so B->B_i for each i). (Note that B itself is among the B_i, so say without loss of generality that B = B_1.) Then we clone A to copies A_1,...,A_n and mount each one at the same point in the corresponding vfsmount B_1,...,B_n. This is an obvious enough interpretation of what we mean by propagating mounts. However, we also want propagation to be recursive--if a tree is marked shareable then we want not only mounts on the tree to be propagated, we also want mounts on those mounts to be propagated, and so on recursively. So, for each relation between B_1,...,B_n, we also add a corresponding relation between the A_1,...,A_n. When we're done, we'll have A_i->A_j if and only if B_i->B_j. Finally, one last wrinkle--if we're doing a --bind mount, and A itself is shareable, then we also add the relations A->A_1, A_1->A, and A_1->A_1, as in the description of "mount --bind" above. Note that we do this *only* for A_1, not for the other copies A_i. This covers most of the important points. To finish, Viro's "p-node" terminology may benefit from some explanation. Let A be a shareable vfsmount, so A->A. We think of any vfsmount B such that A->B and B->A as "equivalent" to A. By transitivity (property 1 above), any vfsmounts that are equivalent to A are also shareable and are equivalent to each other. We define the "p-node" containing A to be the set of all such equivalent vfsmounts. Any shareable vfsmount is a member of a p-node, though it may be the only member. There may also be vfsmounts B which A propagates to (so A->B) but which aren't equivalent to A (so B->A is not true). However, the set of all vfsmounts that A propagates to can be split up into p-nodes, and the set of such p-nodes forms a tree. Actually, this is a slight lie--the leaves of the tree don't themselves have to be shareable, so might not be in p-nodes. But every other node of the tree must be (by property 3 above). The set of all vfsmounts with the same root dentry is therefore divided into a forest of trees of p-nodes (and, possibly, of p-nodeless vfsmounts at the leaves). When Viro says that a p-node p owns another p-node q, he means that q is a child of p in this tree (but not a grandchild or other descendent). Thus we can derive the propagation relationship from the tree of p-nodes by the rules Viro gives at the beginning of his RFC: propagation occurs between all vfsmounts in a p-node and passes from p-nodes to any p-nodes and vfsmounts the own, etc. -- Belabouring some technical points: Implicit in the description of the various operations above is the claim that they each preserve properties (1) and (2). This claim might require some proof. make-slave: Remove all A->B such that A is in the subtree, as specified above. Property 1: If A->B and B->C, then A is not contained in the subtree. Therefore A->C, since that relation existed before and it was not removed. Property 2: Let A->C and B->C. Then neither A nor B is in the subtree, so A->B or B->A still hold. make private: Remove all A->B such that A or B is in the subtree. Proofs of both properties are similarly trivial. make shared: This creates only relations of the form A->A, which are obviously OK. mount --bind: Trickier, in part because my description above is a lie: in addition to adding the relations A->A_1 and A_1->A, we also need to add all the relations which are a consequence of this relation and transitivity. That done, the result trivially satisfies property 1. Property 2: Assume A->C and B->C; we want to establish A->B or B->A. Write M for the original source vfsmount. Note that among the vfsmounts involved in this process, all either existed beforehand, and propagate to M, or were created during the mount process, and propagate from M. We already know property 2 for the descendents of M and for M's ancestors. The problem is to establish it when A, B, and C are a combination of the two. If A->M and B->M, then property 2 for the ancestors of M implies that A->B or B->A. If A->M and M->B, then by transitivity A->B. Similarly if M->A and B->M, then B->A. Finally if M->A and M->B, then M->C also, so property 2 is a result of property 2 for descendents of M. (This is really a much more trivial fact than the long discussion would imply: it's intuitively obvious, for example, if you realize that you can describe the mount --bind operation as cloning the p-node subtree rooted in the target vfsmount and grafting it onto the p-node tree which the source vfsmount is a member of, at the point of that source vfsmount. Adding a tree as a child to another tree obviously results in a tree.)