Saturday, August 8, 2009

A truce in the merge vs. rebase war?

There is a bit of a cultural war going on between the various DVCS communities. Git users love to use "git rebase" to rewrite the history in their local repository before pushing a perfect set of patches upstream, even though they realize that using rebase on work that has already been published causes lots of trouble. Bazaar users, on the other hand, tend to insist that history should be immutable and that merging is preferred over rebasing, and are shocked, SHOCKED at the libertine ways of their git counterparts.

In this blog post, I want to talk about the correspondence between rebasing and merging, and hope to show that there is a middle ground that has some of the advantages of both extremes. I will restrict myself to discussing a very specific kind of rebasing, namely tracking upstream changes while working on a feature branch.

In this post I will use ASCII art to represent repository DAGs; in these drawings, time progresses from left to right.

Suppose somebody is working on a feature branch that is logically developed as several smaller steps B1, B2, ..., and meanwhile development is continuing on master as M1, M2, ....

The problem with repeated merges from master to branch

After the feature branch has been developed for a while, the DAG looks like this.

M0----M1----M2
  \
   \
    \
     B1----B2

If master is merged to the feature branch, we get

M0----M1----M2
  \           \
   \           \
    \           \
     B1----B2----X1

where X1 is a merge commit that "corrects" B1+B2 for the changes that have occurred in master as M1+M2. A little while later, after work has proceeded on both master and the feature branch, the DAG looks like this

M0----M1----M2----M3----M4
  \           \
   \           \
    \           \
     B1----B2----X1----B3----B4

and the next merge changes it to

M0----M1----M2-------M3-------M4
  \           \                 \
   \           \                 \
    \           \                 \
     B1----B2----X1----B3----B4----X2

Where X2 is a merge commit that "corrects" B1+B2+X1+B3+B4 for the changes that occurred in master as M3 and M4.

The problem is that it is very difficult for a reviewer to check changes that are submitted in this form.

The reviewer could review the overall difference between M4 and X2, but then all of the changes would be mashed together and he wouldn't have the benefits of the commit messages explaining the individual steps.

The reviewer could review the changes patch by patch, but

  • He is most familiar with the current state of master. But he has to analyze the earlier patches in the context of older versions of master (M0 and M2).
  • The merge patches X1 and X2 are especially difficult to analyze. X2, for example, reflects the changes needed to adapt multiple patches (B1+B2+X1+B3+B4) to multiple changes on master (M1+M2+M3+M4). X1 cannot be checked without understanding the changes made between two historical master versions (M0..M2). Can you remember exactly what happened in your code base from four weeks ago until two weeks ago?
What's more, the patch series contains information that is of marginal usefulness to a reviewer. Why should the reviewer care how a logical change would have been made against a historical master revision, when his decision is actually whether to apply a change to the current master revision?

It is true that no information is lost, but the information is not retained in a very convenient form.


The problems with rebasing

The same situation, after the first rebase, now looks like this:

M0----M1----M2
              \
               \
                \
                 B1'---B2'

The original patches B1 and B2 have been transformed into patches B1' and B2', which are hopefully logically equivalent.

After further work, the DAG looks like

M0----M1----M2----M3----M4
              \
               \
                \
                 B1'---B2'---B3----B4

which is rebased a second time to

M0----M1----M2----M3----M4
                          \
                           \
                            \
                             B1''--B2''--B3'---B4'

This patch series is all very neat and clean and very easy to review. The patches can all be understood within a single context -- the current state of master -- that the reviewer is presumably most familiar with. The rebased patch series is optimized for the reviewer.

The problem is that rebasing discards history, perhaps not in the strict sense that intermediate commits are gone, but in the sense that there is no useful record of how the rebased patches are related to the original patches.

Specifically, the history of the individual patches has been lost (e.g., B1 -> B1' -> B1''). This loss is particularly unfortunate because these transformations were presumably done automatically by the VCS, which might have introduced errors. The author's original intention, as expressed in the original hand-written patches (B1, B2, B3, B4), has been lost. There is no way for the reviewer to drill down and tell the author, for example, that "when rebasing from B2' to B2'', you forgot to take into account change foo in M3".

Perhaps the worst problem with traditional rebasing is that it prevents collaboration. Somebody who pulls from a repository before and after a rebasing will experience conflicts because the two histories contradict each other. Thus the standard caveat "don't rebase within a published repository", which can be reworded to "don't collaborate on work that you might later want to rebase".


How should rebase really work?

For the simplest type of rebasing discussed above, it is simple to get both reviewing convenience and historical accuracy by changing rebase to retain a little bit more information about what it did.

Consider again the first rebasing:

M0----M1----M2          M0----M1----M2
  \                                   \
   \               -->                 \
    \                                   \
     B1----B2                            B1'---B2'

In fact, consider the first step of the first rebasing, in which B1 is "replayed" on top of M2:

M0----M1----M2           M0----M1----M2
  \                        \           \
   \               -->      \           B1'
    \                        \
     B1----B2                 B1----B2

This really involves merging the changes of (M1+M2) with the changes of B1. B1' is really a merge commit. So let's record it like a merge commit, with both of its ancestors:

M0----M1----M2           M0----M1----M2
  \                        \           \
   \               -->      \           B1'
    \                        \         /
     B1----B2                 -------B1----B2

The second step of the rebasing, similarly, merges B2 with (M1+M2+B1'), giving

M0----M1----M2           M0----M1----M2
  \                        \           \
   \               -->      \           B1'---B2'
    \                        \         /     /
     B1----B2                 -------B1----B2

The semantics of the resulting DAG is correct; B1' really does include all of the changes from M0+M1+M2 and also the changes from B1. Similarly, B2' includes the changes from B1' and from B2.

The resulting DAG is certainly more complicated than the one resulting from a traditional rebase, but it includes both the easy-to-review patches B1' and B2' and also the history of the individual patches B1 -> B1' and B2 -> B2'. It is an honest representation of the repository's history.

The key difference between this procedure and the simple merge described above is that the merge is done patch by patch. This avoids the creation of merge patches like X1 and X2 above that mix together the merge work needed for multiple patches.

After further work, the DAG looks like this:

M0----M1----M2----M3----M4
  \           \
   \           B1'---B2'---B3----B4
    \         /     /
     -------B1----B2

Now we rebase a second time:

M0----M1----M2----M3----M4
  \           \           \
   \           \           B1''--B2''--B3'---B4'
    \           \         /     /     /     /
     \           -------B1'---B2'---B3----B4
      \                /     /
       --------------B1----B2

The DAG is yet more complicated, but it is still correct and has the desired properties. It is easy to read off the easy-to-review patches (B1'', B2'', B3', B4'), and it is also easy to read off the history of individual patches (e.g., B1 -> B1' -> B1'').

Now, if somebody pulls from the repository between the two rebases and then again after the second rebase, the right thing will happen. The retention of the merge information allows rebased work to be shared. Indeed, one collaborator can do one rebase and another can do the next.

Maybe neither merging nor rebasing is optimal for maintaining a long-lived feature branch.

Is this overkill?

The DAGs resulting from these history-retaining rebases are quite complicated. Does anybody really want to see all this complexity in their main upstream repository?

Probably not, or at least not usually. So I see two alternatives:

  • Retain the rebase history while working on the feature, but discard it when submitting patches upstream. This approach has the advantage that multiple people can collaborate on the feature without stepping on each others feet, but the upstream repository is spared the intermediate complexity. It also doesn't require any changes to the tools, except maybe some simple scripts to help with the iterative merge needed to do a history-retaining rebase.
  • Retain the rebase history permanently, but suppress the display of the pre-rebase commits in most cases. This would require tool changes to allow commits to be marked "uninteresting" and to omit them from most types of operations.

See also my followup posts:

Post a Comment