Notes on fs/locks.c as of 2.6.17. The fl_link field of each lock may be on either of two lists, global to locks.c: file_lock_list: all locks that have been granted and are currently being held on some file. blocked_list: locks not yet granted, but which somebody is waiting on. For each lock on the blocked list, the fl_next field points to the currently held lock that it conflicts with. All the locks waiting on the same lock are linked together using the fl_block field to a list with head the blocking lock's fl_block field; new waiters are added to the tail of this list. For locks on the file_lock_list, fl_next is used to link the locks held on a given inode into a singly-linked list, referenced from the inode's i_flock. Locks on the i_flock list are in the order: leases, FLOCK locks, POSIX locks, with the POSIX locks grouped first by owner, and then, within each owner, ordered by byte range. The set of posix locks for a given owner is always represented by a series of nonoverlapping locks. When a new lock is applied it will split or merge with existing locks for the same owner as necessary. POSIX locks do not conflict when they are owned by the same pid, and must be removed when a pid closes a file. So for these purposes and others we need a notion of "ownership". However, we don't use the pid--we actually treat tasks that share file tables (tasks cloned with CLONE_FILES) as the same for locking purposes. So we actually store a pointer to the file table to fl_owner. (TODO: What are the posix requirements for threads vs. processes? Etc.) And even that isn't a sufficient notion of ownership for the case of locks acquired by lockd or nfsd on behalf of a process on a remote client. So posix_same_owner actually calls the fl_compare_owner() operation to determine ownership, only falling back on comparing fl_owner if the operation has not been provided. More about fl_lmops and fl_ops: all take BLK. according to Documentation/filesystems/Locking, none may block except fl_release_private: file_lock_operations: void fl_insert(struct file_lock *) void fl_remove(struct file_lock *) void fl_copy_lock(struct file_lock *, struct file_lock *) void fl_release_private(struct file_lock *) lock_manager_operations: int fl_compare_owner(struct file_lock *) int fl_notify(struct file_lock *) Only nlm actually uses the latter struct, filling it in with: nlmsvc_same_owner: returns true iff both fl_owners and fl_pids match. (XXX: but what are those two fields used for!?!) nlmsvc_notify_blocked: sets up some data structures so nlm can do something later (XXX) Similarly, only user of file_lock_operations appears to be nlm, with nlmclnt_lock_ops, which defines only fl_copy_lock and fl_release_private. XXX: Stopped revising here. Next: 1. understand nlm's interaction with locks.c 2. check rest for correctness. Why posix_locks_deadlock works: Each (fl_owner, fl_pid) is assumed to have at most one lock in the global blocked_list. The fl_next field in that lock points to the (granted and currently held) lock that this lock is waiting for. So the list_for_each loop here just replaces (blocked_owner,blocked_pid) by the (owner,pid) of the unique lock (if such exists) that the current (blocked_owner,blocked_pid) is waiting on. It repeats this until it either finds a loop or finds a (blocked_owner, blocked_pid) pair that isn't waiting on any lock. Consider the graph whose nodes are (owner, pid) pairs (either corresponding to held locks or blocked locks), and where there is an edge (o,p)->(o',p') iff there is a blocked lock l waiting on a held lock l' with the corresponding owners and pids. The above algorithm ensures that a newly added lock won't be part of a cycle, which guarantees that the graph will remain acyclic if it was before. Unlocking also won't add cycles; it just removes a node, or replaces a held lock by one of the locks blocking on it, collapsing two nodes into one. flock_lock_file: Basic outline: Under the BLK, search through locks held on this inode for a FLOCK lock with the same filp. If it has the same type (F_RDLCK, F_WRLCK), return 0; there's no need to do anything. Otherwise, just remove it. Return now if you were doing an unlock. Otherwise, cond_sched() to give someone else a chance, reacquire the BKL, then search the i_flock list again for a conflicting FLOCK(). If found, return -EAGAIN or insert a blocked lock. Otherwise, insert the new lock. * note "man flock" on my system at least seems to imply that flock() can upgrade a lock atomically; whereas from the code this is clearly untrue. * line 675: That first "break" is not a "continue", because of an assumption that FLOCK locks are listed before POSIX locks, so if we find a POSIX lock then we know we've already gone too far. __posix_lock_file: Basic outline: Take the BKL. Search the i_flock list for a conflicting posix lock; if found, and not called with FL_SLEEP set, return -EAGAIN; if FL_SEEP set then return -EDEADLOCK or insert a blocked lock and return -EAGAIN. If the FL_ACCESS flag was set, return 0 now, as we've confirmed there aren't any conflict locks. Search the i_flock list again, this time for posix locks with the same (fl_owner, fl_pid). Insert a lock between a lower and higher lock, or expand an existing lock, as necessary. (In the case of a lock of a different type (F_RDLCK, F_WRLCK, F_UNLCK), we may also be required to break up locks, replace existing locks, wake waiters, etc. I haven't attempted to understand all the cases.) If we didn't find locks with the same owner, go ahead and insert lock now. Drop the BKL, clean up, and return. How are races between trucation/read/write and lock/unlock handled when there's mandatory locking? I don't understand how a single call to locks_mandatory_area can do it since it doesn't hold any lock on return. Lifetimes of the lock structures themselves? (E.g., those returned by posix_test_lock?) flock_locks_conflict: * Why the LOCK_MAND tests? locks_remove_posix: "Can't use posix_lock_file here" comment only makes sense before Trond's patch. Leases ^^^^^^ See the fcntl(2) man page (in particular, discussion of F_SETLEASE and friends). break_lease: __setlease: Since we have a filp, we know on entry to __setlease that no one holds an exclusive lease on the file. lease_modify: Why zero out all the f_owner stuff when unlocking? lease_get_mtime: note comment applies to nfs *server*: so the server reports an updated mtime even in the absence of writes if somebody (possibly some other client) has an exclusive lease, under the theory that they may be in the process of modying the file, and hence that other clients should invalidate their caches (hence should explicitly break the lease?) time_out_leases: for all F_INPROGRESS leases (are those always at the head of the list of leases?) that have expired, clear the F_INPROGRESS bit, and downgrade or (in the F_UNLCK case) This could later result in a downgrade to read or an unlock. In the latter case, this will result in an fl_remove() call. --------- Email from Trond: "For posix locks we should pretty much be ignoring the value of fl_pid. It gets returned to the user in a F_GETLK call, but all the low-level functions (such as posix_same_owner) should be testing the value of fl_owner instead. "The exception here is nlmsvc_same_owner(), which has to follow the NLM byte range spec but doesn't rely on locks_remove_posix. The NFSv4 server code may do something cranky too, but that again won't rely on locks_remove_posix."