Add slightly more cluster repository documentation

Summary: Ref T10751. There are still some missing support tools here, but explain some of this a little better. Test Plan: Read documentation. Reviewers: chad Reviewed By: chad Maniphest Tasks: T10751 Differential Revision: https://secure.phabricator.com/D15764
2025-03-11 03:44:48 +01:00 · 2016-04-19 20:15:39 -07:00 · 2016-04-19 20:15:39 -07:00 · 48b015a3fa
commit 48b015a3fa
parent bab3690b54
1 changed files with 90 additions and 4 deletions
--- a/src/docs/user/cluster/cluster_repositories.diviner
+++ b/src/docs/user/cluster/cluster_repositories.diviner
@ -19,19 +19,19 @@ advantages of doing this are:

 This configuration is complex, and many installs do not need to pursue it.

-This configuration is not currently supported with Subversion.
+This configuration is not currently supported with Subversion or Mercurial.


 Repository Hosts
 ================

 Repository hosts must run a complete, fully configured copy of Phabricator,
-including a webserver. If you make repositories available over SSH, they must
-also run a properly configured `sshd`.
+including a webserver. They must also run a properly configured `sshd`.

 Generally, these hosts will run the same set of services and configuration that
 web hosts run. If you prefer, you can overlay these services and put web and
-repository services on the same hosts.
+repository services on the same hosts. See @{article:Clustering Introduction}
+for some guidance on overlaying services.

 When a user requests information about a repository that can only be satisfied
 by examining a repository working copy, the webserver receiving the request
@ -57,6 +57,17 @@ If it isn't, they block the read until they can complete a fetch.
 Before responding to a write, replicas obtain a global lock, perform the same
 version check and fetch if necessary, then allow the write to continue.

+Additionally, repositories passively check other nodes for updates and
+replicate changes in the background. After you push a change to a repositroy,
+it will usually spread passively to all other repository nodes within a few
+minutes.
+
+Even if passive replication is slow, the active replication makes acknowledged
+changes sequential to all observers: after a write is acknowledged, all
+subsequent reads are guaranteed to see it. The system does not permit stale
+reads, and you do not need to wait for a replication delay to see a consistent
+view of the repository no matter which node you ask.
+

 HTTP vs HTTPS
 =============
@ -84,6 +95,81 @@ Other mitigations are possible, but securing a network against the NSA and
 similar agents of other rogue nations is beyond the scope of this document.


+Monitoring Replication
+======================
+
+You can review the current status of a repository on cluster nodes in
+{nav Diffusion > (Repository) > Manage Repository > Cluster Configuration}.
+
+This screen shows all the configured devices which are hosting the repository
+and the available version.
+
+**Version**: When a repository is mutated by a push, Phabricator increases
+an internal version number for the repository. This column shows which version
+is on disk on the corresponding node.
+
+After a change is pushed, the node which received the change will have a larger
+version number than the other nodes. The change should be passively replicated
+to the remaining nodes after a brief period of time, although this can take
+a while if the change was large or the network connection between nodes is
+slow or unreliable.
+
+You can click the version number to see the corresponding push logs for that
+change. The logs contain details about what was changed, and can help you
+identify if replication is slow because a change is large or for some other
+reason.
+
+**Writing**: This shows that the node is currently holding a write lock. This
+normally means that it is actively receiving a push, but can also mean that
+there was a write interruption. See "Write Interruptions" below for details.
+
+
+Write Interruptions
+===================
+
+A repository cluster can be put into an inconsistent state by an interruption
+in a brief window immediately after a write.
+
+Phabricator can not commit changes to a working copy (stored on disk) and to
+the global state (stored in a database) atomically, so there is a narrow window
+between committing these two different states when some tragedy (like a
+lightning strike) can befall a server, leaving the global and local views of
+the repository state divergent.
+
+In these cases, Phabricator fails into a "frozen" state where further writes
+are not permitted until the failure is investigated and resolved.
+
+TODO: Complete the support tooling and provide recovery instructions.
+
+
+Loss of Leaders
+===============
+
+A more straightforward failure condition is the loss of all servers in a
+cluster which have the most up-to-date copy of a repository. This looks like
+this:
+
+  - There is a cluster setup with two nodes, X and Y.
+  - A new change is pushed to server X.
+  - Before the change can propagate to server Y, lightning strikes server X
+    and destroys it.
+
+Here, all of the "leader" nodes with the most up-to-date copy of the repository
+have been lost. Phabricator will refuse to serve this repository because it
+can not serve it consistently, and can not accept writes without data loss.
+
+The most straightforward way to resolve this issue is to restore any leader to
+service. The change will be able to replicate to other nodes once a leader
+comes back online.
+
+If you are unable to restore a leader or unsure that you can restore one
+quickly, you can use the monitoring console to review which changes are
+present on the leaders but not present on the followers by examining the
+push logs.
+
+TODO: Complete the support tooling and provide recovery instructions.
+
+
 Backups
 ======