SoftwareSwirl: 2009-08

Thursday, August 13, 2009

Git, Mercurial, and Bazaar—simplicity through inflexibility

Git, Mercurial, and Bazaar, the most prominent DAG-based distributed version control systems, have rightly earned a reputation for making branching and merging easy enough to do every day. And their proponents often deride Subversion for making branching easy but merging difficult.

What is the magic pixie dust that allows DVCSs to merge so effortlessly? And why is Subversion too stupid to get it right? (And how can the author of this blog possibly use the words "simplicity" or "inflexibility" in the same breath as "git"???)

Partisanship disclaimer: I like both Subversion and git and use them both daily, and I expect that I would like Mercurial and Bazaar if I had time to work with them. I have contributed patches to the testing frameworks of both Subversion and git, and I am the maintainer of cvs2svn/cvs2git. In other words, I'm a fan of all of the VCSs discussed in this article.

Contrary to what you might think, DVCSs' merging prowess is not because those projects have rocket scientists developing infallible merge algorithms. Their algorithms work better, but not radically so. Merge conflicts are an unavoidable aspect of parallel development, and nasty ones will occur regardless of what VCS is being used.

Rather, it is because the DAG-based DVCSs cannot represent complicated histories and therefore do not have to merge them.

Inflexible DVCSs

To understand merging in DAG-based VCSs, it is necessary to understand the Fundamental Law of DAG-based VCSs:

Every change made by every commit that is an ancestor of a given revision is assumed to be incorporated into that revision.

The Fundamental Law simplifies merging because it vastly restricts the types of histories that can be represented in the VCS. For example, the DAG cannot represent the history of

Cherry picking, in which some (but not all) commits from one branch are incorporated into another branch
Rebases in which the new parent is not a descendant of the old parent
Rebases in which the order of the commits is changed
Merging changes from branch to branch at a less-than-full-tree granularity; for example, merging the changes to file1 but not the changes to file2

Of course I am not claiming that the DVCSs do not support these operations. Rather, I am claiming that their support, where it exists, is broken because the DAG cannot properly record the histories of such operations. That is why, for example, it is problematic to rebase branches that have been published (though for some kinds of rebases there may be a solution). It is also why it is common to have conflicts when merging two branches if some content has already been cherry-picked from one branch to the other.

Flexible Subversion

Subversion, on the other hand, completely earned its reputation for making merging difficult—primarily because it didn't record merges at all. That made merging a nightmare of manual record-keeping or reliant on third-party tools like svnmerge.py. But that was before Subversion release 1.5.

Starting with release 1.5, Subversion, ironically, supports a much more flexible model of merging than the DAG-based DVCSs. Changes from any commit can be merged to any branch at the single-file level of granularity, enabling all of the operations listed above and some even weirder things (for example, a change that was originally applied to one file can be "merged" onto a completely different file). If your workflow demands this sort of thing, Subversion might hold significant advantages for you.

But there are also many disadvantage to Subversion's flexibility:

Subversion's merging model is more complicated than that of DAG-based VCSs, and therefore more complicated to implement and less predictable.
It is much harder to visualize the history of a Subversion project (contrast that to DVCSs, whose history can be displayed as a single DAG).
Subversion merges are innately slow, because of the large quantities of metadata that have to be manipulated.
The bookkeeping of SVN merge info requires more user conscientiousness, and mistakes are not as easy to spot and fix.

Simplicity vs. flexibility

Of course, each of these systems has dramatic strengths and weaknesses related to implementation quality, user interface, etc. But on the single criterion of how merge history is recorded, I claim that git, Mercurial, and Bazaar are fundamentally simpler than Subversion because they are less flexible.

So, does Subversion's flexibility mean that it is better than the DVCSs? Definitely not. Many projects don't need the flexibility offered by Subversion—or even worse, they mistakenly think that they need it. For such projects, the simpler DAG-based model is adequate and will make branching and merging go more smoothly. Other projects, particularly those that rely heavily on cherry-picking, might consider whether Subversion, or perhaps a DVCS like darcs that has a different model, might be a better fit. And there are certainly other considerations, besides the history model, that will influence the decision of which VCS to use.

Me? I'm going to continue using both.

Sunday, August 9, 2009

Rebase with history -- implementation ideas

In my last blog posts, I discussed and idea for retaining history when rebasing and a concrete example of how it improves on traditional rebasing without history. In this third post in the series, I would like to discuss how rebase with history could be implemented.

First, a reminder of what a simple rebase-with-history looks like:

M0----M1----M2           M0----M1----M2
  \                        \           \
   \               -->      \           B1'---B2'
    \                        \         /     /
     B1----B2                 -------B1----B2

The work B1+B2 on a feature branch as been rebased onto the current version M2 of the master branch, producing two rebased revisions B1' and B2'. A traditional rebase-without-history would discard commits B1 and B2, whereas the point of rebase-with-history is to retain them.

A rebase-with-history can be done manually, today, in any of the DAG-based DVCSs. The algorithm is basically:

Ensure that the old parent of the rebased branch (M0 in the example above) is a direct ancestor of the new parent (M2).
Create a new temporary branch at M2.
For each commit that is to be rebased, merge that commit onto the temporary branch, resolving conflicts if necessary. This creates merge commits B1' then B2'.
Designate the last new merge commit (B2') at the head of the temporary branch to be the new head of the feature branch.

Avoiding information overload

Earlier I argued that the full DAG resulting from a rebase-with-history contains more detail than the average user wants to see. After all, many people are happy with the current style of rebase-without-history, which completely discards B1 and B2. So how can we reduce the visual clutter without losing the historical information that is needed to support future merges? In other words, assume that the VCS wants to store the above DAG, but the typical user typically wants to see this:

M0----M1----M2
              \
               B1'---B2'

How can this dualism be implemented?

It is clear that the commits B1 and B2 have to be retained in the DAG rather than recorded in some kind of supplemental metadata. Otherwise, many VCS commands would have to be modified to be aware of this metadata and treat B1 and B2 as if they were in the DAG. It is also necessary to prevent B1 and B2 from being garbage collected. So it is simpler to retain them in the DAG proper.

It is also clear that somehow marking the pre-rebase commits B1 and B2 themselves "historical" is not correct. After all, those commits may be integral to other branches in which they are not "historical". Perhaps the rebased version of the branch will turn out to be a false start and never even be pushed upstream. So it is not the commits themselves that are "historical"; it is only their relationship with the rebased commits B1' and B2' that is "historical".

And that, of course, suggests the correct solution. Commit B1' has two parents, M2 and B1. M2 is always interesting and should be recorded as a normal parent. But B1 is interesting mainly as the historical predecessor of B1', but usually not as part of the visible history in the upstream repository. Therefore, within the B1' node, B1 should be recorded as a parent but with a special "historical" flag.

Similarly, commit B2' has two parents--B1', which is a "normal" parent, and B2, which should be recorded as a "historical" parent.

The "historical" flag would be mainly a hint to GUI-related commands that the linked-to parent can be suppressed in most user views. When walking the DAG for display purposes, "historical" parents would simply not be followed. The historical commits B1 and B2 are only connected to the branch head via "historical" parent links, so they will not be traversed. By this mechanism, the DAG of the branch is displayed just like the DAG of a rebase-without-history. Of course, if commits B1 and B2 are non-historical ancestors of other branches, they will be displayed normally when the ancestry of those branches is displayed.

The GUI-related commands should have an extra --full-history flag to override the default behavior, causing historical parents to be displayed along with non-historical parents. (Optionally, historical links could be flagged somehow to allow them to be colored differently, etc.)

CAVEAT: Please recall that this whole discussion only applies to the "feature branch tracking upstream" class of rebase operations, namely those where:

the order of commits in the branch is unchanged, and
the new location of the branch is a descendant of the old location of the branch.

However, this is a common and important class of rebases. Other types of rebases (e.g., those that change the order of commits) cannot be done with history retention, because there is no way to represent such history within a DAG model. (darcs uses different technology and therefore doesn't share this limitation.)

Feedback: I would be very happy to receive feedback about this proposal for "rebase with history". Would it be helpful? Are there corner-cases that would prevent its implementation? Where has it been discussed before? (There's nothing new under the sun, you know...) Please add comments!

[Modified 2009-08-13 to add an outline of the rebase-with-history algorithm.]

Upstream rebase Just Works™ if history is retained

The git-rebase documentation has a section entitled "Recovering from Upstream Rebase" describing how to straighten out the mess that results from an upstream repository doing a rebase. In my last blog post I described how simple rebases can be done with history, and claimed that this makes it possible to share a repository that has been rebased. In this post, I would like to show how the example from the git-rebase manual page work out if rebase history is retained.

First, the example without history. The premise is that you are working on a topic branch that is dependent on a subsystem that somebody else is working on:

m---m---m---m---m---m---m---m  master
     \
      o---o---o---o---o  subsystem
                       \
                        *---*---*  topic

Then the subsystem developer rebases subsystem onto master:

m---m---m---m---m---m---m---m  master
     \                       \
      o---o---o---o---o       o'--o'--o'--o'--o'  subsystem
                       \
                        *---*---*  topic

If you now merge the topic branch to the subsystem branch, a mess results:

m---m---m---m---m---m---m---m  master
     \                       \
      o---o---o---o---o       o'--o'--o'--o'--o'--M  subsystem
                       \                         /
                        *---*---*-..........-*--*  topic

The first problem here is that git has no idea that the two versions of the subsystem revisions, namely "o-o-o-o-o" and "o'-o'-o'-o'-o'", actually contain the same logical changes. Therefore, the merge that creates commit M can result in spurious conflicts. Specifically, git will try to combine the changes in "o-o-o-o-o" with those in "o'-o'-o'-o'-o'", and may or may not be able to make sense out of the situation.

Second, in some (admittedly rather unlikely) situations, git's ignorance can cause problems with future merges.

Finally, there is no easy way to convert the merged topic branch into a series of patches that apply cleanly to the rebased subsystem branch and can therefore be submitted upstream. In this case, the only available patch that applies cleanly to the rebased subsystem branch is the merge patch M, which is a single commit that squashes together the entire topic branch and is therefore difficult to review.

In fact, in the git world you typically wouldn't merge the topic branch into the subsystem branch. Instead, you would rebase the topic branch onto the rebased subsystem branch:

m---m---m---m---m---m---m---m  master
                             \
                              o'--o'--o'--o'--o'  subsystem
                                               \
                                                *'--*'--*'  topic

This gives a usable result, but requires extra effort by the topic branch maintainer (and the maintainers of any other branches that rely on the subsystem branch or the topic branch). Because this rebase is also prone to conflicts unless you explicitly tell git what interval of revisions to rebase, using a command like git rebase --onto subsystem SHA1 topic. SHA1 refers to the last commit in the pre-rebase "o-o-o-o-o" series, and you have to determine it yourself. Rebase-with-history relieves the user from having to do this kind of manual bookkeeping.

If, instead, the subsystem maintainer had retained history when rebasing as described in my earlier post, the DAG would look like this:

m---m---m---m---m---m---m---m  master
     \                       \
      \                       o'--o'--o'--o'--o'  subsystem
       \                     /   /   /   /   /
        --------------------o---o---o---o---o
                                             \
                                              *---*---*  topic

Now you are free to merge the topic branch with the subsystem branch:

m---m---m---m---m---m---m---m  master
     \                       \
      \                       o'--o'--o'--o'--o'  subsystem
       \                     /   /   /   /   / \
        --------------------o---o---o---o---o   ---------
                                             \           \
                                              *---*---*---x  topic

The fact that both the old and new versions of the subsystem revisions are still in the DAG is not a problem, because the dependencies make it clear to git that the old version is already incorporated into the new version, so git will never try to merge them again. But this is also not ideal, because it still doesn't allow the topic branch to be converted into a tidy patch series.

Even better, you can even rebase your changes onto the subsystem branch, preferably also retaining rebase history:

m---m---m---m---m---m---m---m  master
     \                       \
      \                       o'--o'--o'--o'--o'  subsystem
       \                     /   /   /   /   / \
        --------------------o---o---o---o---o   \
                                             \   \
                                              \   *'--*'--*'  topic
                                               \ /   /   /
                                                *---*---*

Whether you retain history during this rebase or not, the rebase is simple because git can figure out for itself what range of revisions to rebase. The result is also a well-formed repository, and will not create any problems for other developers who might have based their work on your topic branch. And when you are ready, you can submit the patches "*'-*'-*'" as-is to the subsystem maintainer.

Remember that just because the old and new versions of rebased revisions are included in the DAG doesn't mean that they have to be displayed in the UI all of the time. If there were a way to mark the old versions "uninteresting", then the "interesting" part of the DAG could be displayed just like in the rebase-without-history case:

m---m---m---m---m---m---m---m  master
                             \
                              o'--o'--o'--o'--o'  subsystem
                                               \
                                                *'--*'--*'  topic

but the retention of the hidden pre-rebase commits enables future rebases and merges to be carried out smoothly.

[Modified 2009-08-16 based on discussion the git mailing list; thanks especially to Björn Steinbrink for his comments.]

Saturday, August 8, 2009

A truce in the merge vs. rebase war?

There is a bit of a cultural war going on between the various DVCS communities. Git users love to use "git rebase" to rewrite the history in their local repository before pushing a perfect set of patches upstream, even though they realize that using rebase on work that has already been published causes lots of trouble. Bazaar users, on the other hand, tend to insist that history should be immutable and that merging is preferred over rebasing, and are shocked, SHOCKED at the libertine ways of their git counterparts.

In this blog post, I want to talk about the correspondence between rebasing and merging, and hope to show that there is a middle ground that has some of the advantages of both extremes. I will restrict myself to discussing a very specific kind of rebasing, namely tracking upstream changes while working on a feature branch.

In this post I will use ASCII art to represent repository DAGs; in these drawings, time progresses from left to right.

Suppose somebody is working on a feature branch that is logically developed as several smaller steps B1, B2, ..., and meanwhile development is continuing on master as M1, M2, ....

The problem with repeated merges from master to branch

After the feature branch has been developed for a while, the DAG looks like this.

M0----M1----M2
  \
   \
    \
     B1----B2

If master is merged to the feature branch, we get

M0----M1----M2
  \           \
   \           \
    \           \
     B1----B2----X1

where X1 is a merge commit that "corrects" B1+B2 for the changes that have occurred in master as M1+M2. A little while later, after work has proceeded on both master and the feature branch, the DAG looks like this

M0----M1----M2----M3----M4
  \           \
   \           \
    \           \
     B1----B2----X1----B3----B4

and the next merge changes it to

M0----M1----M2-------M3-------M4
  \           \                 \
   \           \                 \
    \           \                 \
     B1----B2----X1----B3----B4----X2

Where X2 is a merge commit that "corrects" B1+B2+X1+B3+B4 for the changes that occurred in master as M3 and M4.

The problem is that it is very difficult for a reviewer to check changes that are submitted in this form.

The reviewer could review the overall difference between M4 and X2, but then all of the changes would be mashed together and he wouldn't have the benefits of the commit messages explaining the individual steps.

The reviewer could review the changes patch by patch, but

He is most familiar with the current state of master. But he has to analyze the earlier patches in the context of older versions of master (M0 and M2).
The merge patches X1 and X2 are especially difficult to analyze. X2, for example, reflects the changes needed to adapt multiple patches (B1+B2+X1+B3+B4) to multiple changes on master (M1+M2+M3+M4). X1 cannot be checked without understanding the changes made between two historical master versions (M0..M2). Can you remember exactly what happened in your code base from four weeks ago until two weeks ago?

What's more, the patch series contains information that is of marginal usefulness to a reviewer. Why should the reviewer care how a logical change would have been made against a historical master revision, when his decision is actually whether to apply a change to the current master revision?

It is true that no information is lost, but the information is not retained in a very convenient form.

The problems with rebasing

The same situation, after the first rebase, now looks like this:

M0----M1----M2
              \
               \
                \
                 B1'---B2'

The original patches B1 and B2 have been transformed into patches B1' and B2', which are hopefully logically equivalent.

After further work, the DAG looks like

M0----M1----M2----M3----M4
              \
               \
                \
                 B1'---B2'---B3----B4

which is rebased a second time to

M0----M1----M2----M3----M4
                          \
                           \
                            \
                             B1''--B2''--B3'---B4'

This patch series is all very neat and clean and very easy to review. The patches can all be understood within a single context -- the current state of master -- that the reviewer is presumably most familiar with. The rebased patch series is optimized for the reviewer.

The problem is that rebasing discards history, perhaps not in the strict sense that intermediate commits are gone, but in the sense that there is no useful record of how the rebased patches are related to the original patches.

Specifically, the history of the individual patches has been lost (e.g., B1 -> B1' -> B1''). This loss is particularly unfortunate because these transformations were presumably done automatically by the VCS, which might have introduced errors. The author's original intention, as expressed in the original hand-written patches (B1, B2, B3, B4), has been lost. There is no way for the reviewer to drill down and tell the author, for example, that "when rebasing from B2' to B2'', you forgot to take into account change foo in M3".

Perhaps the worst problem with traditional rebasing is that it prevents collaboration. Somebody who pulls from a repository before and after a rebasing will experience conflicts because the two histories contradict each other. Thus the standard caveat "don't rebase within a published repository", which can be reworded to "don't collaborate on work that you might later want to rebase".

How should rebase really work?

For the simplest type of rebasing discussed above, it is simple to get both reviewing convenience and historical accuracy by changing rebase to retain a little bit more information about what it did.

Consider again the first rebasing:

M0----M1----M2          M0----M1----M2
  \                                   \
   \               -->                 \
    \                                   \
     B1----B2                            B1'---B2'

In fact, consider the first step of the first rebasing, in which B1 is "replayed" on top of M2:

M0----M1----M2           M0----M1----M2
  \                        \           \
   \               -->      \           B1'
    \                        \
     B1----B2                 B1----B2

This really involves merging the changes of (M1+M2) with the changes of B1. B1' is really a merge commit. So let's record it like a merge commit, with both of its ancestors:

M0----M1----M2           M0----M1----M2
  \                        \           \
   \               -->      \           B1'
    \                        \         /
     B1----B2                 -------B1----B2

The second step of the rebasing, similarly, merges B2 with (M1+M2+B1'), giving

M0----M1----M2           M0----M1----M2
  \                        \           \
   \               -->      \           B1'---B2'
    \                        \         /     /
     B1----B2                 -------B1----B2

The semantics of the resulting DAG is correct; B1' really does include all of the changes from M0+M1+M2 and also the changes from B1. Similarly, B2' includes the changes from B1' and from B2.

The resulting DAG is certainly more complicated than the one resulting from a traditional rebase, but it includes both the easy-to-review patches B1' and B2' and also the history of the individual patches B1 -> B1' and B2 -> B2'. It is an honest representation of the repository's history.

The key difference between this procedure and the simple merge described above is that the merge is done patch by patch. This avoids the creation of merge patches like X1 and X2 above that mix together the merge work needed for multiple patches.

After further work, the DAG looks like this:

M0----M1----M2----M3----M4
  \           \
   \           B1'---B2'---B3----B4
    \         /     /
     -------B1----B2

Now we rebase a second time:

M0----M1----M2----M3----M4
  \           \           \
   \           \           B1''--B2''--B3'---B4'
    \           \         /     /     /     /
     \           -------B1'---B2'---B3----B4
      \                /     /
       --------------B1----B2

The DAG is yet more complicated, but it is still correct and has the desired properties. It is easy to read off the easy-to-review patches (B1'', B2'', B3', B4'), and it is also easy to read off the history of individual patches (e.g., B1 -> B1' -> B1'').

Now, if somebody pulls from the repository between the two rebases and then again after the second rebase, the right thing will happen. The retention of the merge information allows rebased work to be shared. Indeed, one collaborator can do one rebase and another can do the next.

Maybe neither merging nor rebasing is optimal for maintaining a long-lived feature branch.

Is this overkill?

The DAGs resulting from these history-retaining rebases are quite complicated. Does anybody really want to see all this complexity in their main upstream repository?

Probably not, or at least not usually. So I see two alternatives:

Retain the rebase history while working on the feature, but discard it when submitting patches upstream. This approach has the advantage that multiple people can collaborate on the feature without stepping on each others feet, but the upstream repository is spared the intermediate complexity. It also doesn't require any changes to the tools, except maybe some simple scripts to help with the iterative merge needed to do a history-retaining rebase.
Retain the rebase history permanently, but suppress the display of the pre-rebase commits in most cases. This would require tool changes to allow commits to be marked "uninteresting" and to omit them from most types of operations.

Benchmarking build systems

I've long been on a quest to find a replacement for "make" for building projects. I will be writing more about this, but for now I would like to report some preliminary results from benchmarking some build systems, including a newcomer to the scene.

The benchmarking effort was inspired by by discovery of a couple of excellent blog posts done by Noel Llopis in 2005. I especially wanted to extend his results to cover non-recursive use of make (PDF) and fabricate. Happily, Noel published the scripts that he used for his work, so I downloaded them and added two new benchmarks.

There are many other build tools (and build tool advocates!) so I made a public git repository to hold the benchmarking tools. I welcome patches containing benchmarks for other build tools! ~~Personally, I would be very interested to see benchmarks of waf, which claims to be much faster than scons.~~ [I have added data for waf to the results table below.]

Here I would like to report some preliminary results. Briefly, the test "project" consists of 50 libraries; each library consists of 100 C++ header files and 100 C++ source files; each header+source defines one trivial class; and the source files each include 15 other headers from the same library plus 5 headers from other, earlier-numbered libraries. The interlibrary dependencies only go one way, making it easy to define a self-consistent build order. Altogether the project includes about 10000 files, meaning that it is quite large but not enormous.

For comparison, I repeated some of the blog author's original measurements on my computer (a Linux notebook), plus a few new ones. I only bothered with the "full" build (build everything from scratch) and the global "do-nothing" build (run the build again when there is nothing to do). The builds that I added are as follows:

A "make_nonrecursive" build that uses "gcc -MM" to generate include-file dependencies for each file, and includes all of the dependency files and the library-wide Makefiles into a single project-wide Makefile that knows about all of the dependencies. See "Recursive Make Considered Harmful" (PDF) for more information about this style of Makefile.

I added two fabricate benchmarks, one using atimes_runner and one using strace_runner. In both cases the build file does the obvious loop-over-libraries, loop-over-sourcefiles and invokes g++ using fabricate.run() once for each file. The tests use fabricate trunk r21.

Here are my results:

Build system	Full compilation	Do-nothing build
make (recursively invoked)	194 s	3.7 s
make_nonrecursive	319 s	1.7 s
scons	312 s	52 s
waf	207 s	2.7 s
fabricate, using atimes_runner	4245 s	17.7 s
fabricate, using strace_runner	251 s	18.9 s
fabricate, using strace_runner and mtime_hasher	328 s	352 s [1]

[1] I.e., this combination didn't work; it rebuilt everything.

Summary

make_nonrecursive is slower than recursive make for the full build, but this is probably mostly because make_recursive uses "gcc -MM" to generate the dependencies rather than "makedepend" as used by the recursive make. This is an implementation difference that could be changed. For the do-nothing build, make_nonrecursive is faster.

fabricate with the atimes_runner is so excruciatingly slow at doing a full build that it cannot seriously be considered for a project of this size. In retrospect, it is clear that it has to be slow for large projects, because before and after every command it has to monkey around with the atimes of every single file in the project. Thus the time for building a project with N files goes something like O(N²).

A full build using fabricate and strace_runner is much better, almost as fast as make.

For do-nothing builds, fabricate is considerably slower than make, though not as slow as scons. A do-nothing build is the extreme case of an incremental build, which is the bread-and-butter of daily development. So the fabricate project should definitely invest time to speed up do-nothing builds.

For some reason, fabricate with strace_runner and mtime_hasher didn't work for me. The do-nothing build in fact rebuilt everything, making it even slower than the corresponding full build.

EDIT, 2009-09-15: Reformatted results in tabular form, and added numbers for waf.

SoftwareSwirl