Geek Blight - Distributed versus centralized version control systems

Distributed versus centralized version control systems

Posted on 2007-06-10T20:55Z. Updated on 2012-09-30T06:58Z.

Version control systems, sometimes called revision control systems or source code management systems, are programs whose purpose is to let you track changes made to a set of, usually, plain text files. They are mainly used to track changes in the source code of programs, but they may be used for other purposes. They’re very useful and, if you’re a programmer and don’t use any, you should consider starting to use them. It doesn’t matter if you work alone or with more people, or if your project is very small or very big. A version control system will be helpful in the vast majority of cases, and it works like a time machine.

Nowadays there are several version control systems to choose from. In general, most of them fall into two categories: distributed or centralized. And they differ in how you are expected to work, make changes and publish those changes. Still, they have some points in common.

Common concepts

There’s usually a repository, a place in which your changes are recorded. Accessing it, you can view a log of the changes or recover previous versions or revisions, of the source code. New revisions are created when you commit a group of changes you’ve made. Let’s suppose you’re adding a function to a C program. You may add the prototype to a header file and the function definition to another source file. After that, you commit your changes and a new revision is created. There’s a revision before the function was added, and there’s a revision after the function was added. Together with the changes, most systems usually store the current time and date, the responsible of those changes and a log message you provide, which typically has a short and descriptive summary line and optionally a body explaining the changes in more detail if needed.

Centralized

The two most famous centralized version control systems are CVS and Subversion. They are called centralized because all the collaborators of a project work against one central repository. I have only used Subversion, because I’m relatively new to version control and when I started using these systems Subversion was already receiving possitive criticism and was beginning to replace CVS in many projects, most notably KDE. Recently, Subversion replaced CVS as the version control system offered at SourceForge.net.

Like I said, everybody works against a central repository and this repository is located in a different directory to the one holding the files you work with, which is called the "working directory". The repository can be located in the same machine or in a different machine accessed by SSH or by a specific system protocol or by other means. The repository must be available everytime you want to commit your changes. Who can commit changes to a repository? In the simplest case, anyone having write access to the repository directories either because you own it or because your user group can write to it. Or maybe there’s an authentication mechanism in place and you need to provide a username and password. For example, Subversion can use its own server (svnserve) to give access to a repository over the network and it’s easily configured, using a text file, to require authentication to commit changes (file conf/passwd inside the repository).

You are expected to make changes and commit them to the repository solving any conflicts that arise. This happens sometimes when, while you changed a file, someone commited a change to that same file and the program can’t automatically apply your changes to it. After every commit or group of commits, you must run a command to update your working directory and bring to it the changes made by others.

That’s the working routine: modify-commit-update. In my humble opinion, this adapts very well to enterprise-like situations, where a group of developers in a flat hierarchy are working on a project and, with some exceptions, each developer is working on a different thing. In these situations, it’s unusual to have different project branches so there’s almost no need for branch merges (a weak point of Subversion according to many experts) and there’s almost no bureaucracy because everybody trusts everybody and, by having commit privileges to the repository, nobody needs to approve your changes. Everybody is committing changes and receiving changes all the time, via the central repository.

Distributed

There are several distributed version control systems and probably the most famous ones are git, Mercurial and Monotone, but there are others. Some time ago I had a look at git and Mercurial and chose Mercurial. Being no expert, I thought git was more complicated and had less documentation, and Mercurial had been chosen over git by the OpenSolaris developers, among other factors.

With these systems there’s no central repository. Everybody has one or more and there’s almost no distinction between working copies and repositories. To be more specific, each working copy has its own repository. The directory you’re working on at a given moment holds the source files and the repository. You make changes and commit them, creating a new revision or, as Mercurial calls it, changeset. Usually, there’s somebody who manages and owns the "official" repository. For example, in the Linux kernel Linus Torvalds manages the kernel development and everybody tries to get their changes into his repository, because that’s the official kernel.

The mechanism to distribute changes is different to that of centralized systems. Changesets are pushed to or pulled from repositories. If somebody has given you push privileges, you can push your changesets into their tree. More frequently, I think, you ask somebody to pull changes from your repository. This is what Torvalds does. He frequently pulls changes from people he trusts. Let’s suppose you are working with a repository in which the most recent changeset or revision is number 1. You start commiting a new feature to the program and commit changesets A, B and C. Your project’s history is 1-A-B-C. Somebody you work with started working that day with revision 1, like you did, and commited changes D and E. The other project history is 1-D-E. Then, they ask you to pull their changes. The common practice in this system is to clone or copy your repository with 1-A-B-C (just in case problems arise) and pull from them. When doing this, you create two branches, both starting at revision 1. You then merge both braches, maybe resolving conflicts in the way, and end up in revision 2, which combines the changes from both branches, joining them. If everything goes well, you can push everything, including the merge, to the "official" repository and everybody should pull from it. That’s the working routine: modify-commit-etc-pull-merge.

The advantages of this scheme should be obvious. First, this system scales much better with the number of people. It should also work better when there’s a hierarchy. Branching and merging fit much better into this model. Finally, I think it’s simpler when you work alone or with only a few people and setting up a repository is complicated or is not a possibility. I remember working at college with a good friend of mine (who I think reads this blog — Hi, Álvaro!). We had to create a PHP website and sending each other our changes was a somehow chaotic process. After the initial days we became used to it and coped with the situation, but it would have been much easier if Mercurial had existed back then. One of us would have had the "official" version and we would have emailed each other the changesets. I remember changes to the same files were desperating because we had to do the merges by hand. Tools like KDiff3 that exist nowadays would have completely automated this process, and Mercurial has a nice feature to bundle one or more changesets in a platform-independent file that can be sent over by email. Each one of us worked at home and we had no spare machines nor time nor energy to set up a CVS repository. A distributed version control system would have been the ideal solution in our situation.

Torvalds' controversy

Recently there was some controversy over this topic because Linus Torvalds talked about it in a meeting and expressed his preference for distributed version control over centralized systems, ditching Subversion as a dumb idea. I think he always creates controversies because he has strong opinions and uses strong words to voice them. Still, some people were very upset. He tried to argue about two things. First, that a distributed version control system scales much better with the number of people. I won’t dispute that. I think it’s clearly true. However, he tried to explain that a distributed solution removed a lot of bureaucracy, and I think that’s not true in many situations. It’s true that, in a project with a moderately high number of people or contributors, who gets commit privileges to a central repository is always a matter of discussion, causes problems and involves too much politics and bureaucracy. On the other hand, in an enterprise-like situation it’s obvious who has commit privileges: the ones working on the project. Bureaucracy over. So I don’t think we should discard centralized systems at all. You set up privileges at the beginning and never discuss about it. In an open source project with an enterprise-like organization, with no clear leader and in which 99% of the contributions come from a core group of developers, those are the ones with commit privileges. For the other 1% of changes, let the contributors mail you patches. In my opinion, it depends on the project, but many times you can use a centralized system and have less bureaucracy, because you don’t have to be pushing or pulling all the time, you don’t need to revise other people’s changes if you don’t want to, etc.

Finally, I think people can adapt to situations. If you foresee that Subversion is going to be a better option, use Subversion. People will adapt to it with no problems. If you foresee a distributed solution is going to be better, use it. The safest and most flexible option is a distributed system, in my humble opinion. It’s the safe bet, it’s the one I use more nowadays, but not the only one and not always the best one.

Load comments