Saturday, August 8, 2009

A truce in the merge vs. rebase war?

There is a bit of a cultural war going on between the various DVCS communities. Git users love to use "git rebase" to rewrite the history in their local repository before pushing a perfect set of patches upstream, even though they realize that using rebase on work that has already been published causes lots of trouble. Bazaar users, on the other hand, tend to insist that history should be immutable and that merging is preferred over rebasing, and are shocked, SHOCKED at the libertine ways of their git counterparts.

In this blog post, I want to talk about the correspondence between rebasing and merging, and hope to show that there is a middle ground that has some of the advantages of both extremes. I will restrict myself to discussing a very specific kind of rebasing, namely tracking upstream changes while working on a feature branch.

In this post I will use ASCII art to represent repository DAGs; in these drawings, time progresses from left to right.

Suppose somebody is working on a feature branch that is logically developed as several smaller steps B1, B2, ..., and meanwhile development is continuing on master as M1, M2, ....

The problem with repeated merges from master to branch

After the feature branch has been developed for a while, the DAG looks like this.

M0----M1----M2
  \
   \
    \
     B1----B2

If master is merged to the feature branch, we get

M0----M1----M2
  \           \
   \           \
    \           \
     B1----B2----X1

where X1 is a merge commit that "corrects" B1+B2 for the changes that have occurred in master as M1+M2. A little while later, after work has proceeded on both master and the feature branch, the DAG looks like this

M0----M1----M2----M3----M4
  \           \
   \           \
    \           \
     B1----B2----X1----B3----B4

and the next merge changes it to

M0----M1----M2-------M3-------M4
  \           \                 \
   \           \                 \
    \           \                 \
     B1----B2----X1----B3----B4----X2

Where X2 is a merge commit that "corrects" B1+B2+X1+B3+B4 for the changes that occurred in master as M3 and M4.

The problem is that it is very difficult for a reviewer to check changes that are submitted in this form.

The reviewer could review the overall difference between M4 and X2, but then all of the changes would be mashed together and he wouldn't have the benefits of the commit messages explaining the individual steps.

The reviewer could review the changes patch by patch, but

  • He is most familiar with the current state of master. But he has to analyze the earlier patches in the context of older versions of master (M0 and M2).
  • The merge patches X1 and X2 are especially difficult to analyze. X2, for example, reflects the changes needed to adapt multiple patches (B1+B2+X1+B3+B4) to multiple changes on master (M1+M2+M3+M4). X1 cannot be checked without understanding the changes made between two historical master versions (M0..M2). Can you remember exactly what happened in your code base from four weeks ago until two weeks ago?
What's more, the patch series contains information that is of marginal usefulness to a reviewer. Why should the reviewer care how a logical change would have been made against a historical master revision, when his decision is actually whether to apply a change to the current master revision?

It is true that no information is lost, but the information is not retained in a very convenient form.


The problems with rebasing

The same situation, after the first rebase, now looks like this:

M0----M1----M2
              \
               \
                \
                 B1'---B2'

The original patches B1 and B2 have been transformed into patches B1' and B2', which are hopefully logically equivalent.

After further work, the DAG looks like

M0----M1----M2----M3----M4
              \
               \
                \
                 B1'---B2'---B3----B4

which is rebased a second time to

M0----M1----M2----M3----M4
                          \
                           \
                            \
                             B1''--B2''--B3'---B4'

This patch series is all very neat and clean and very easy to review. The patches can all be understood within a single context -- the current state of master -- that the reviewer is presumably most familiar with. The rebased patch series is optimized for the reviewer.

The problem is that rebasing discards history, perhaps not in the strict sense that intermediate commits are gone, but in the sense that there is no useful record of how the rebased patches are related to the original patches.

Specifically, the history of the individual patches has been lost (e.g., B1 -> B1' -> B1''). This loss is particularly unfortunate because these transformations were presumably done automatically by the VCS, which might have introduced errors. The author's original intention, as expressed in the original hand-written patches (B1, B2, B3, B4), has been lost. There is no way for the reviewer to drill down and tell the author, for example, that "when rebasing from B2' to B2'', you forgot to take into account change foo in M3".

Perhaps the worst problem with traditional rebasing is that it prevents collaboration. Somebody who pulls from a repository before and after a rebasing will experience conflicts because the two histories contradict each other. Thus the standard caveat "don't rebase within a published repository", which can be reworded to "don't collaborate on work that you might later want to rebase".


How should rebase really work?

For the simplest type of rebasing discussed above, it is simple to get both reviewing convenience and historical accuracy by changing rebase to retain a little bit more information about what it did.

Consider again the first rebasing:

M0----M1----M2          M0----M1----M2
  \                                   \
   \               -->                 \
    \                                   \
     B1----B2                            B1'---B2'

In fact, consider the first step of the first rebasing, in which B1 is "replayed" on top of M2:

M0----M1----M2           M0----M1----M2
  \                        \           \
   \               -->      \           B1'
    \                        \
     B1----B2                 B1----B2

This really involves merging the changes of (M1+M2) with the changes of B1. B1' is really a merge commit. So let's record it like a merge commit, with both of its ancestors:

M0----M1----M2           M0----M1----M2
  \                        \           \
   \               -->      \           B1'
    \                        \         /
     B1----B2                 -------B1----B2

The second step of the rebasing, similarly, merges B2 with (M1+M2+B1'), giving

M0----M1----M2           M0----M1----M2
  \                        \           \
   \               -->      \           B1'---B2'
    \                        \         /     /
     B1----B2                 -------B1----B2

The semantics of the resulting DAG is correct; B1' really does include all of the changes from M0+M1+M2 and also the changes from B1. Similarly, B2' includes the changes from B1' and from B2.

The resulting DAG is certainly more complicated than the one resulting from a traditional rebase, but it includes both the easy-to-review patches B1' and B2' and also the history of the individual patches B1 -> B1' and B2 -> B2'. It is an honest representation of the repository's history.

The key difference between this procedure and the simple merge described above is that the merge is done patch by patch. This avoids the creation of merge patches like X1 and X2 above that mix together the merge work needed for multiple patches.

After further work, the DAG looks like this:

M0----M1----M2----M3----M4
  \           \
   \           B1'---B2'---B3----B4
    \         /     /
     -------B1----B2

Now we rebase a second time:

M0----M1----M2----M3----M4
  \           \           \
   \           \           B1''--B2''--B3'---B4'
    \           \         /     /     /     /
     \           -------B1'---B2'---B3----B4
      \                /     /
       --------------B1----B2

The DAG is yet more complicated, but it is still correct and has the desired properties. It is easy to read off the easy-to-review patches (B1'', B2'', B3', B4'), and it is also easy to read off the history of individual patches (e.g., B1 -> B1' -> B1'').

Now, if somebody pulls from the repository between the two rebases and then again after the second rebase, the right thing will happen. The retention of the merge information allows rebased work to be shared. Indeed, one collaborator can do one rebase and another can do the next.

Maybe neither merging nor rebasing is optimal for maintaining a long-lived feature branch.

Is this overkill?

The DAGs resulting from these history-retaining rebases are quite complicated. Does anybody really want to see all this complexity in their main upstream repository?

Probably not, or at least not usually. So I see two alternatives:

  • Retain the rebase history while working on the feature, but discard it when submitting patches upstream. This approach has the advantage that multiple people can collaborate on the feature without stepping on each others feet, but the upstream repository is spared the intermediate complexity. It also doesn't require any changes to the tools, except maybe some simple scripts to help with the iterative merge needed to do a history-retaining rebase.
  • Retain the rebase history permanently, but suppress the display of the pre-rebase commits in most cases. This would require tool changes to allow commits to be marked "uninteresting" and to omit them from most types of operations.

See also my followup posts:

5 comments:

Anonymous said...

Is this overkill indeed? Such a process would allow the publication of any rebase branch (since a new rebase would result into yes another new branch), but is that advisable? See here for the "rules" Linus expect us to follow when it comes to rebase and merges ( http://www.mail-archive.com/dri-devel@lists.sourceforge.net/msg39091.html ).
I like the idea, I am unsure at how it would be used (private branches only?)

Alan said...

I've been trying to understand "rebasing", and how I might be able to use it. So I loved this article, because I think it clarified some things for me. However, rebasing seems unusable to me.

To put it in terms of your example, I'm developing a feature, making small steps B1, B2, etc. At some point I'd like to pull upstream changes (M1, M2, etc.) and rebase my changes onto the end of them. There is little or no natural conflict between my changes and the work on the mainline.

As your article explained, rebasing apparently works by running a series of merge operations, one at a time, to migrate my changes onto the end of the mainline changes. But there's a problem: there's an extremely high probability that change B1 "conflicts with" B2. What I mean is, it's very likely that in the course of working on my feature I might have changed the same line of a file two separate times.

Normally, when B1, B2, etc. are considered as a sequential line of development, this is no problem. And if I were to do a single big merge of all of the "M" changes into the end of my "B" changes, there would be no problem. Indeed, the first step of the rebase works fine: B1 goes in without a hitch. But as soon as the second merge tries to add in B2, it thinks there's a conflict.

This means that if I were to try to use this I would be having to manually resolve many perceived conflicts, when in fact there were no true conflicts. It seems the whole notion of implementing (or thinking about) rebasing in terms of a series of separate merge steps is fundamentally doomed to suffer from this problem.

Since I'm new to all of this, perhaps there's something I'm still missing in my understanding. But, what do you think?

- arb

Michael Haggerty said...

flyboy: It is not correct that "as soon as the second merge tries to add in B2, it thinks there's a conflict." Because B1', the result of the first step of the merge, has the original B1 as one of its ancestors, git knows that the changes in B1 are already included in B1'. Therefore it does a three-way merge of B1' and B2 using B1 as a merge base, and there should not be any conflicts from "B1 conflicting with B2".

(Of course there can still be a conflict at this stage if the changes in B2 conflict with changes introduced by M1+M2, but such conflicts reflect true problems that require human attention.)

Alan said...

Thank you so much for your reply to my comment.

I apologize for my confusion. I was using Mercurial and its rebase extension, rather than git. Somehow in searching google for "rebase" I found your article and neglected to notice that you were talking about git. I guess git and Mercurial work differently in this respect: I certainly observed the behavior I described while trying to use Mercurial.

Or perhaps there's still something I'm missing, because as I say the way Mercurial works seems unusable to me. Perhaps I should post this question on a Mercurial mailing list.

Mr. Frost said...

This scheme, while being overly complex also makes using the more powerful interactive rebase impossible.

If you absolutely _had_ to retain history a better approach would be to 'merge' (not really a merge, just listing a parent) ORIG_HEAD with the final rebase commit. This way the entire original branch is preserved, but there is only a single two-parent commit at the end of the (possibly re-ordered/squashed/ammended) rebase.