tag:blogger.com,1999:blog-67716538746869982112024-02-02T16:45:20.603+01:00SoftwareSwirlThoughts about software developmentMichael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.comBlogger19125tag:blogger.com,1999:blog-6771653874686998211.post-59702015521686736892022-09-12T05:59:00.000+02:002022-09-12T05:59:29.295+02:00Beyond three-way merge
<div class="document" id="beyond-three-way-merge">
<p>In this article, I explain three-way merge, scalar merge, and Git's
"recursive merge" algorithm. I analyze how recursive merge fares in
various simple criss-cross merge scenarios, and find several cases
where it produces questionable results or unnecessary conflicts.</p>
<div class="section" id="background">
<h4>Background</h4>
<p>This article is going to be for the real merge nerds. To understand
it, you will have to understand what a <a class="reference external" href="https://git-scm.com/docs/git-merge-base">merge base</a> is and why it is
significant, and possibly also something about <a class="reference external" href="https://github.com/mhagger/git-imerge">git-imerge</a> and
<a class="reference external" href="https://softwareswirl.blogspot.com/2012/12/the-conflict-frontier-of-nightmare-merge.html">incremental merge diagrams</a>.</p>
<p>A big part of using version control in general, and Git in particular,
is merging branches together. Consider the following Git history,
where letters and numbers denote commits and time progresses from left
to right (→):</p>
<pre class="literal-block">
---0---1---2---3 ← main
\
A---B---C ← branch
</pre>
<p>If the changes that have been made on <tt class="docutils literal">branch</tt> (<tt class="docutils literal">A</tt>, <tt class="docutils literal">B</tt>, and
<tt class="docutils literal">C</tt>) are OK, then the next task is to merge <tt class="docutils literal">branch</tt> into
<tt class="docutils literal">main</tt>. The problem is that other work (<tt class="docutils literal">1</tt>, <tt class="docutils literal">2</tt>, and <tt class="docutils literal">3</tt>) has
been added to <tt class="docutils literal">main</tt> since the work on <tt class="docutils literal">branch</tt> was started. So
the changes on <tt class="docutils literal">main</tt> and the changes on <tt class="docutils literal">branch</tt> have to be
merged together, producing a merge commit <tt class="docutils literal">M</tt>:</p>
<pre class="literal-block">
---0---1---2---3
\ \
\ M
\ /
A---B---C
</pre>
<p>Usually, this is done using the <a class="reference external" href="https://en.wikipedia.org/wiki/Merge_(version_control)#Three-way_merge">three-way merge algorithm</a>.
Three-way merge considers only three states of the project: the tip of
each branch (i.e., <tt class="docutils literal">3</tt> and <tt class="docutils literal">C</tt>), and the "merge base". The merge
base is, roughly speaking, the most recent that was on both of the
branches that are being merged. In this case, it is commit <tt class="docutils literal">0</tt>.</p>
<p>So let's simplify the diagram by omitting intermediate commits:</p>
<pre class="literal-block">
---0 ⋯ 3
⋱ \
C - M
</pre>
<p>and then deform it to a rectangular grid, which will make it easier to
recognize patterns:</p>
<pre class="literal-block">
0 ⋯ 3
⋮ |
C---M
</pre>
<p>Now, time progresses downwards and to the right (↘)</p>
<div class="section" id="scalar-merge">
<h5>Scalar merge</h5>
<p>Let's simplify the situation still further by discussing not a merge
of source trees, but rather a "<a class="reference external" href="https://tonyg.github.io/revctrl.org/ScalarMerge.html">scalar merge</a>". A scalar merge is a
merge of "scalars". Scalars have no internal structure. They are
either identical or different, and if they're different then there is
no way to subdivide them. For example, if every file had only a single
line, then the files would act like scalars.</p>
<p>Note that in real-life merges, files are (approximately) broken into
hunks with differences, and the hunks are treated like scalars. So if
we can get a better understanding of scalar merges, then we can apply
what we learn to actual merges.</p>
</div>
<div class="section" id="three-way-merge">
<h5>Three-way merge</h5>
<p>How does three-way merge work? The rules for three-way merges are as
follows. I will use lower-case letters to indicate the <em>contents</em> of
files.</p>
<ul>
<li><p class="first">No change on either branch:</p>
<pre class="literal-block">
a---a a---a
| | → | |
a---? a---a
</pre>
<p>That is, if the contents didn't change on either branch, then the
result of the merge is the same as the original contents.</p>
</li>
<li><p class="first">Change on one branch (from contents <tt class="docutils literal">a</tt> to contents <tt class="docutils literal">b</tt>) but not
on the other:</p>
<pre class="literal-block">
a---b a---b
| | → | |
a---? a---b
a---a a---a
| | → | |
b---? b---b
</pre>
<p>In this scenario, one of the branches changed the contents from
<tt class="docutils literal">a</tt> to <tt class="docutils literal">b</tt>, but the other branch didn't change the contents. In
this case, the result of the merge is <tt class="docutils literal">b</tt>.</p>
<p>Note that these pairs of diagrams are mirror images of each other,
reflected along the diagonal. A general property of three-way
merge is that it doesn't care which branch is which. From now on I
won't bother showing diagrams that are mirror-images of each
other.</p>
<p>This rule is the bread and butter of three-way merge. Look for it in
the examples below.</p>
</li>
<li><p class="first">The same change on both branches:</p>
<pre class="literal-block">
a---b a---b
| | → | |
b---? b---b
</pre>
<p>In this scenario, both branches changed the contents from <tt class="docutils literal">a</tt> to
<tt class="docutils literal">b</tt>. In this case, three-way merge selects <tt class="docutils literal">b</tt> as the result of
the merge.</p>
<p>This is a slightly more questionable rule than the previous one.
Why was the same change made on two different branches? Often this
scenario arises due to cherry-picking. Even though this rule is
less robust than the previous rule, in real life there would be a
lot more merge conflicts without it.</p>
</li>
</ul>
<p>There is only one scalar merge pattern that three-way merge can't
resolve:</p>
<ul>
<li><p class="first">Different changes on the two branches:</p>
<pre class="literal-block">
a---b a---b
| | → | |
c---? c---X ← conflict
</pre>
<p>In this scenario, one branch changed the contents from <tt class="docutils literal">a</tt> to
<tt class="docutils literal">b</tt>, but the other changed the contents from <tt class="docutils literal">a</tt> to <tt class="docutils literal">c</tt>.
There's no way for three-way merge to resolve this merge, so it
results in a conflict (which I will denote with <tt class="docutils literal">X</tt>). In real
life, conflicts have to be resolved by a human.</p>
</li>
</ul>
</div>
</div>
<div class="section" id="criss-cross-merges">
<h4>Criss-cross merges</h4>
<p>But it is not always the case that a merge has a single merge base.
Consider, for example, a criss-cross merge:</p>
<pre class="literal-block">
---0-------1---2---3---4 ← main
\ \ / \
\ \ / \
\ \ / \
\ ⋅ ? ← desired merge commit
\ / \ /
\ / \ /
\ / \ /
A---B---C---D ← branch
</pre>
<p>In this scanario, part of <tt class="docutils literal">branch</tt> has been merged into <tt class="docutils literal">main</tt>,
but also part of <tt class="docutils literal">main</tt> has been merged into <tt class="docutils literal">branch</tt>. This merge
has two merge bases, <tt class="docutils literal">1</tt> and <tt class="docutils literal">A</tt>. What happens if we now try to
merge the rest of <tt class="docutils literal">branch</tt> (commits <tt class="docutils literal">B</tt> and <tt class="docutils literal">D</tt>) into <tt class="docutils literal">main</tt>?
This is tricky because there are some changes (namely <tt class="docutils literal">1</tt> and <tt class="docutils literal">A</tt>)
that are on both branches, but neither merge base includes both of
those changes.</p>
<p>By default, Git tries to perform a merge like this using "recursive
merge". This algorithm first does a synthetic helper merge <tt class="docutils literal">H</tt>
between the two merge bases, <tt class="docutils literal">1</tt> and <tt class="docutils literal">A</tt>:</p>
<pre class="literal-block">
---0-------1---2---3---4 ← main
\ \ ⋅
\ \ ⋅
\ \ ⋅
\ H
\ / ⋅
\ / ⋅
\ / ⋅
A---B---C---D ← branch
</pre>
<p>Then it computes the three-way merge between <tt class="docutils literal">4</tt> and <tt class="docutils literal">D</tt> using
<tt class="docutils literal">H</tt> as the merge base:</p>
<pre class="literal-block">
---0-------1---2---3---4 ← main
\ \ ⋅ \
\ \ ⋅ \
\ \ ⋅ \
\ H M
\ / ⋅ /
\ / ⋅ /
\ / ⋅ /
A---B---C---D ← branch
</pre>
<p>The helper merge <tt class="docutils literal">H</tt> is useful because it includes the changes from
commits <tt class="docutils literal">1</tt> and <tt class="docutils literal">A</tt>, but doesn't include any commits twice. So it
is a reasonable choice of merge base for the final merge <tt class="docutils literal">M</tt>.</p>
</div>
<div class="section" id="recursive-merge">
<h4>Recursive merge</h4>
<p>Let's simplify the criss-cross merge to its bare minimum:</p>
<pre class="literal-block">
---A-------B---D---F ← main
\ \ / \
\ \ / \
\ \ / \
\ ⋅ ? ← desired merge commit
\ / \ /
\ / \ /
\ / \ /
C---E---G ← branch
</pre>
<p>Now let's rearrange it on a rectangular grid to make the patterns
easier to recognize:</p>
<pre class="literal-block">
A---B---D
| | |
C---⋅---F
| | |
E---G---?
</pre>
<p>By default, Git uses the "<a class="reference external" href="https://git-scm.com/docs/git-merge#Documentation/git-merge.txt-recursive">recursive merge</a>" algorithm to deal with
merges that have multiple merge bases. How does recursive merge deal
with this particular scenario? And, more importantly, does it always
do the right thing?</p>
<p>Recursive merge first tries to create a helper merge <tt class="docutils literal">H</tt> by doing a
three-way merge of <tt class="docutils literal">B</tt> and <tt class="docutils literal">C</tt> using <tt class="docutils literal">A</tt> as the merge base:</p>
<pre class="literal-block">
A---B---D
| | |
C--(H)--F
| | |
E---G---?
</pre>
<p>Since this is done by three-way merge, this helper merge only depends
on the upper-left square:</p>
<pre class="literal-block">
A---B
| |
C---?
</pre>
<p>Then it does a three-way merge between <tt class="docutils literal">F</tt> and <tt class="docutils literal">G</tt>, using <tt class="docutils literal">H</tt> as the
merge base, producing <tt class="docutils literal">M</tt>:</p>
<pre class="literal-block">
A---B---D
| | |
C--(H)--F
| | |
E---G---M
</pre>
<p>This is also a three-way merge, this time involving the lower-right
square:</p>
<pre class="literal-block">
H---F
| |
G---?
</pre>
<p>Each of the merges (<tt class="docutils literal">H</tt> and <tt class="docutils literal">M</tt>) is done using three-way merge,
following the rules from the previous section. <strong>But there is one very
curious and important twist</strong>:</p>
<p>If the helper merge <tt class="docutils literal">H</tt> results in a conflict, it might be that the
recursive merge can succeed anyway. For example, in this scenario:</p>
<pre class="literal-block">
a---b---d a---b---d a---b---d
| | | | | | | | |
c---⋅---f → c--(X)--f → c--(X)--f
| | | | | | | | |
e---f---? e---f---? e---f---f
</pre>
<p>Note that the first merge conflicts, because it doesn't match any of
our three-way merge patterns:</p>
<pre class="literal-block">
a---b
| |
c--(X)
</pre>
<p>But recursive merge continues anyway. Now it tries to perform the
three-way merge from the lower-right square:</p>
<pre class="literal-block">
(X)--f
| |
f---?
</pre>
<p>But this <em>is</em> a recognized three-way merge pattern, and its result
doesn't depend on what "(X)" is. So the resolution is <tt class="docutils literal">f</tt>, because
<tt class="docutils literal">f</tt> is the value at the tip of each branch:</p>
<pre class="literal-block">
(X)--f
| |
f---f
</pre>
<p>The bottom line is that recursive merge always resolves any
criss-cross merge where the two tips have the same contents to those
contents, regardless of what happened earlier:</p>
<pre class="literal-block">
*---*---* *---*---*
| | | | | |
*---⋅---f → *---⋅---f
| | | | | |
*---f---? *---f---f
</pre>
</div>
<div class="section" id="menagerie-of-merges">
<h4>Menagerie of merges</h4>
<p>How well does recursive merge match real-life expectations?</p>
<p>There are some scenarios that recursive merge handles just fine. There
are others where it gives questionable non-conflict results, and still
others where it produces a conflict even though the user's intention
was arguably possible to infer. Let's go through some examples.</p>
<div class="section" id="scenarios-where-recursive-merge-succeeds">
<h5>Scenarios where recursive merge succeeds</h5>
<p>This section presents several scenarios where recursive merge gives a
result that seems to align with what the user probably intended:</p>
<ul>
<li><p class="first">Same change made on both branches, staggered:</p>
<pre class="literal-block">
a---a---b a---a---b
| | | | | |
b---⋅---b → b--(b)--b
| | | | | |
b---b---? b---b---b
</pre>
</li>
<li><p class="first">Different changes made on each branch, staggered:</p>
<pre class="literal-block">
a---b---b a---b---b
| | | | | |
a---⋅---b → a--(b)--b
| | | | | |
c---d---? c---d---d
</pre>
<p>In this case the user has already resolved the combination of
<tt class="docutils literal">b</tt> and <tt class="docutils literal">c</tt>, namely <tt class="docutils literal">d</tt>, so that should be the overall
result.</p>
</li>
<li><p class="first">Change <tt class="docutils literal">b</tt> preferred over change <tt class="docutils literal">d</tt>:</p>
<pre class="literal-block">
a---b---c a---b---c
| | | | | |
a---⋅---c → a--(b)--c
| | | | | |
d---b---? d---b---c
</pre>
<p>In this case the user has already decided that when merging <tt class="docutils literal">b</tt>
and <tt class="docutils literal">d</tt>, the result <tt class="docutils literal">b</tt> is preferred. So the result of
applying change <tt class="docutils literal">c</tt> to <tt class="docutils literal">b</tt> should be <tt class="docutils literal">c</tt>, just as it is in
the first row.</p>
</li>
<li><p class="first">Same resolution on both branches. Note that in this case the
helper merge conflicts, but the final merge nevertheless works out
as hoped because it matches the third three-way merge pattern:</p>
<pre class="literal-block">
a---b---b a---b---b
| | | | | |
c---⋅---d → c--(X)--d
| | | | | |
c---d---? c---d---d
</pre>
</li>
<li><p class="first">Same final state on both branches:</p>
<pre class="literal-block">
a---b---d a---b---d
| | | | | |
c---⋅---f → c--(X)--f
| | | | | |
e---f---? e---f---f
</pre>
</li>
</ul>
</div>
<div class="section" id="scenarios-where-recursive-merge-gives-questionable-results">
<h5>Scenarios where recursive merge gives questionable results</h5>
<p>There are some scenarios where recursive merge gives an answer (i.e.,
not a conflict), but it is not clear that the answer is the correct
one:</p>
<ul>
<li><p class="first">Change <tt class="docutils literal">b</tt> made then reverted on one branch; kept on the other
branch. Note that there are two patterns in this family:</p>
<pre class="literal-block">
a---b---a a---b---a
| | | | | |
a---⋅---a → a--(b)--a
| | | | | |
b---b---? b---b---a
a---b---a a---b---a
| | | | | |
b---⋅---b → b--(b)--b
| | | | | |
b---b---? b---b---b
</pre>
<p>Worryingly, recursive merge gives different answers for the two
scenarios, even though they are logically pretty similar. In both
of these scenarios, the pre-existing merges are consistent with
three-way merges. Indeed, they might have been done automatically
without any user attention. But they each use the following, less
reliable three-way merge rule for one of those pre-existing
merges:</p>
<pre class="literal-block">
a---b
| |
* *
| |
b---b
</pre>
<p>There are two plausible narratives here:</p>
<ul class="simple">
<li>Maybe change <tt class="docutils literal">b</tt> was implemented then reverted on one branch,
but then it was independently implemented on the other branch,
possibly for another reason. In this case, it would be correct
to keep <tt class="docutils literal">b</tt> as the merge resolution.</li>
<li>Maybe change <tt class="docutils literal">b</tt> was cherry-picked from one branch to the other,
then on one branch it was discovered that the change was bad, so
it was reverted. In that case, one would want the revert back to
<tt class="docutils literal">a</tt> to be used as the merge resolution.</li>
</ul>
<p>Since the correct result depends on facts that can't be known to
the merge algorithm, it would probably be better to treat this as
a merge conflict and ask the user to intervene.</p>
</li>
<li><p class="first">Same change made then reverted on both branches:</p>
<pre class="literal-block">
a---b---a a---b---a
| | | | | |
b---⋅---b → b--(b)--b
| | | | | |
a---b---? a---b---b
</pre>
<p>In this case, arguably since change <tt class="docutils literal">b</tt> was made then reverted on
each of the branches, then the user merges are not needed, and the
final result should be <tt class="docutils literal">a</tt>. An alternative interpretation (which I
think is less persuasive) is that the pre-done merges in both
cases favored contents <tt class="docutils literal">b</tt> over <tt class="docutils literal">a</tt>, so the final result should
also favor <tt class="docutils literal">b</tt>. I think that the result should either be a
conflict or <tt class="docutils literal">a</tt>, whereas recursive merge gives <tt class="docutils literal">b</tt>.</p>
</li>
</ul>
</div>
<div class="section" id="scenarios-where-recursive-merge-unnecessarily-produces-conflicts">
<h5>Scenarios where recursive merge unnecessarily (?) produces conflicts</h5>
<p>There are also cases where the intention of the user seems relatively
clear, but the recursive merge strategy produces conflicts. In some of
these cases, it could be that git could do better than the recursive
merge strategy, requiring the user to resolve fewer conflicts
manually.</p>
<p>Most of these scenarios involve reverted changes:</p>
<ul>
<li><p class="first">Change <tt class="docutils literal">b</tt> was made and reverted on one branch; a different change
was made on the other branch:</p>
<pre class="literal-block">
a---b---a a---b---a
| | | | | |
c---⋅---c → c--(X)--c
| | | | | |
c---d---? c---d---X
</pre>
<p>Arguably, since <tt class="docutils literal">b</tt> was reverted, merge <tt class="docutils literal">d</tt> is irrelevant: it only
shows how to merge <tt class="docutils literal">b</tt> and <tt class="docutils literal">c</tt>, but we don't want <tt class="docutils literal">b</tt> in the final
result. So the correct resolution in this scenario should be <tt class="docutils literal">c</tt>:</p>
<pre class="literal-block">
a---b---a a---b---a
| | | | | |
c---⋅---c → c---⋅---c
| | | | | |
c---d---? c---d---c
</pre>
</li>
<li><p class="first">Different changes made and reverted on each branch:</p>
<pre class="literal-block">
a---b---a a---b---a
| | | | | |
c---⋅---c → c--(X)--c
| | | | | |
a---b---? a---b---X
</pre>
<p>This scenario is similar to the previous case, but now <em>both</em>
user-supplied merges are irrelevant because they merge changes <tt class="docutils literal">b</tt>
and <tt class="docutils literal">c</tt> respectively, neither of which we want in the final
result. Arguably, the result should be <tt class="docutils literal">a</tt>:</p>
<pre class="literal-block">
a---b---a a---b---a
| | | | | |
c---⋅---c → c---⋅---c
| | | | | |
a---b---? a---b---a
</pre>
</li>
<li><p class="first">Change <tt class="docutils literal">b</tt> made and reverted on one branch, but it had no effect on
the other branch:</p>
<pre class="literal-block">
a---b---a a---b---a
| | | | | |
c---⋅---c → c--(X)--c
| | | | | |
d---d---? d---d---X
</pre>
<p>In this scenario, change <tt class="docutils literal">b</tt> was made then reverted on one branch,
meaning that it should probably have no effect on the result.
Moreover, when it was merged with the changes <tt class="docutils literal">c</tt> and <tt class="docutils literal">d</tt> made on
the other branch, it was also ignored in favor of the other
branch's tip <tt class="docutils literal">d</tt>. So arguably, the correct resolution of this
merge should be <tt class="docutils literal">d</tt>:</p>
<pre class="literal-block">
a---b---a a---b---a
| | | | | |
c---⋅---c → c---⋅---c
| | | | | |
d---d---? d---d---d
</pre>
</li>
<li><p class="first">Change <tt class="docutils literal">b</tt> made and reverted on one branch:</p>
<pre class="literal-block">
a---b---a a---b---a
| | | | | |
c---⋅---c → c--(X)--c
| | | | | |
d---e---? d---e---X
</pre>
<p>This is a slightly less clear extension of the previous case. In
this scenario, change <tt class="docutils literal">b</tt> was made and reverted on one branch.
Therefore, it, and merge <tt class="docutils literal">e</tt>, are both irrelevant to the final
result, and the correct resolution should arguably be <tt class="docutils literal">d</tt>:</p>
<pre class="literal-block">
a---b---a a---b---a
| | | | | |
c---⋅---c → c---⋅---c
| | | | | |
d---e---? d---e---d
</pre>
</li>
</ul>
<div class="section" id="the-example-that-motivated-this-study">
<h6>The example that motivated this study</h6>
<p>There is one more interesting scenario, where recursive merge produces
a conflict, but the correct merge resolution is pretty clear. This
does not involve any reverts. This is the example that motivated this
whole investigation:</p>
<ul>
<li><p class="first">The helper merge conflicts, but the exact same conflict has
already been resolved by the user in one of the pre-existing
merges:</p>
<pre class="literal-block">
a---b---d a---b---d
| | | | | |
c---⋅---e → c--(X)--e
| | | | | |
c---f---? c---f---X
</pre>
<p>In this case, recursive merge is unable to compute the helper
merge because of a conflict between the change made in <tt class="docutils literal">b</tt> and
the change made in <tt class="docutils literal">c</tt>. But recursive merge doesn't notice that
that conflict has already been resolved by the user (twice!) in
the form of commits <tt class="docutils literal">f</tt> and <tt class="docutils literal">e</tt>. Using this fact, we can fill
in the middle blank:</p>
<pre class="literal-block">
a---b---d a---b---d
| | | | | |
c---⋅---e → c--(f)--e
| | | | | |
c---f---? c---f---?
</pre>
<p>Now note that merge <tt class="docutils literal">e</tt> already includes all of the changes from
both branches: <tt class="docutils literal">b</tt>, <tt class="docutils literal">c</tt>, and also <tt class="docutils literal">d</tt>. But now the lower
right corner can be done by three-way merge:</p>
<pre class="literal-block">
a---b---d a---b---d
| | | | | |
c--(f)--e → c--(f)--e
| | | | | |
c---f---? c---f---e
</pre>
<p>Therefore, <tt class="docutils literal">e</tt> can be taken as the correct resolution in this
scenario, and no conflict need be resolved by the user.</p>
</li>
</ul>
</div>
</div>
</div>
<div class="section" id="seven-way-merge">
<h4>Seven-way merge</h4>
<p>I think that these criss-cross scenarios could theoretically be
handled better by a new "Seven-way merge algorithm" that takes into
account the patterns of seven commits that can appear in a merge with
two merge bases. It would match those commits to a table of seven-way
merge rules, and return the appropriate result (which will sometimes
be a conflict). This algorithm could be applied one hunk at a time. I
think that there is a good chance that it would result in fewer
incorrect merges and fewer unnecessary conflicts.</p>
<p>One big question is, would this approach generalize to tackling even
more complicated merges? For example, what if the merge has even more
than two merge bases? Can seven-way merges be composed?</p>
</div>
<div class="section" id="how-often-does-it-come-up">
<h4>How often does it come up?</h4>
<p>Seven-way merge would probably only be useful for merges with multiple
merge bases. Do these come up often enough to make the effort
worthwhile?</p>
<p>Maybe not. I sampled a few largish repositories (see below), and only
on the order of 1% of merges in those repos have multiple merge bases.
And not even all of those merges are criss-cross merges (some have
three or more merge bases).</p>
<p>On the other hand, merge conflicts are a very frustrating aspect of
working with Git, and almost everybody has experienced merge conflicts
that were too difficult to resolve. Moreover, when a
multiple-merge-base merge fails, it can fail in a most baffling way,
with conflict markers nested inside of other conflict markers, that is
almost impossible to make sense out of. 7-way merge might be able to
produce more understandable conflicts in some of these cases. Finally,
some workflows generate more criss-cross merges than we might be
accustomed to (I've heard of a very large repository in which more
like 10% of merges have multiple merge bases). Is it possible that
smarter handling of criss-cross merges could remove some of this
frustration?</p>
<p>If we decide to pursue this, I would suggest the following as next
steps:</p>
<ol class="arabic simple">
<li>Choose some public repositories with various workflows (i.e.,
don't just pick the best-organized projects). Perhaps include both
well-maintained projects but also "run of the mill" projects.
Locate merges in the history that had more than one merge base.
Check if they had conflicts. Also check whether there were diff
hunks that didn't conflict, but where users changed the automatic
merge resolution (sometimes these will be incorrect merges).</li>
<li>Implement smarter handling of criss-cross merges. Instead of trying
to find a clever algorithm that does the right thing in all
situations, I think that it would be better to recognize some of
the above patterns using a table that also tells how to handle the
corresponding scenario. At first this can be a hacky, inefficient
implementation.</li>
<li>Apply the standard recursive merge algorithm and the new
criss-cross merge algorithm to the sample of real-life merges.
Compare the numbers of conflicts and the numbers of seemingly
incorrect merges. Decide which of the patterns should be enabled
and which should be treated as conflicts.</li>
<li>If the results are encouraging, implement whatever is decided on
in Git, so that users get a little bit better help with their merges.</li>
</ol>
</div>
<div class="section" id="appendix-numbers-of-merge-bases">
<h4>Appendix: numbers of merge bases</h4>
<p>What fraction of merge commits have multiple merge bases? I ran the
following to count the number of non-octopus merges with each number
of merge bases in the project's default branch:</p>
<pre class="literal-block">
for f in $(git rev-list --min-parents=2 --max-parents=2 HEAD)
do
echo "$f" $(git merge-base --all $f^1 $f^2)
done | awk '{print NF-1}' | sort -n | uniq -c
</pre>
<p>I ran the above command in <tt class="docutils literal">git/git</tt>, <tt class="docutils literal">torvalds/linux</tt>, and a large
proprietary repository.</p>
<p>Note that 7-way merge as strictly described above would only apply to
merges with exactly two merge bases.</p>
<div class="section" id="git-git">
<h5><tt class="docutils literal">git/git</tt></h5>
<p>In <tt class="docutils literal">git/git</tt>, 1.39% of merges have exactly two merge bases:</p>
<table border="1" class="colwidths-given docutils">
<colgroup>
<col width="50%" />
<col width="50%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head">Merge bases</th>
<th class="head">Count</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>0</td>
<td>6</td>
</tr>
<tr><td>1</td>
<td>15416</td>
</tr>
<tr><td>2</td>
<td>223</td>
</tr>
<tr><td>3</td>
<td>69</td>
</tr>
<tr><td>4</td>
<td>40</td>
</tr>
<tr><td>5</td>
<td>42</td>
</tr>
<tr><td>6</td>
<td>26</td>
</tr>
<tr><td>7</td>
<td>26</td>
</tr>
<tr><td>8</td>
<td>17</td>
</tr>
<tr><td>9</td>
<td>17</td>
</tr>
<tr><td>10</td>
<td>20</td>
</tr>
<tr><td>11</td>
<td>15</td>
</tr>
<tr><td>12</td>
<td>13</td>
</tr>
<tr><td>13</td>
<td>7</td>
</tr>
<tr><td>14</td>
<td>10</td>
</tr>
<tr><td>15</td>
<td>8</td>
</tr>
<tr><td>16</td>
<td>5</td>
</tr>
<tr><td>17</td>
<td>3</td>
</tr>
<tr><td>18</td>
<td>1</td>
</tr>
<tr><td>19</td>
<td>4</td>
</tr>
<tr><td>20</td>
<td>6</td>
</tr>
<tr><td>21</td>
<td>2</td>
</tr>
<tr><td>22</td>
<td>6</td>
</tr>
<tr><td>23</td>
<td>6</td>
</tr>
<tr><td>24</td>
<td>2</td>
</tr>
<tr><td>26</td>
<td>3</td>
</tr>
<tr><td>28</td>
<td>3</td>
</tr>
<tr><td>29</td>
<td>1</td>
</tr>
<tr><td>34</td>
<td>1</td>
</tr>
<tr><td>38</td>
<td>1</td>
</tr>
<tr><td>40</td>
<td>1</td>
</tr>
<tr><td>42</td>
<td>1</td>
</tr>
<tr><td>47</td>
<td>1</td>
</tr>
<tr><td>65</td>
<td>1</td>
</tr>
</tbody>
</table>
</div>
<div class="section" id="torvalds-linux">
<h5><tt class="docutils literal">torvalds/linux</tt></h5>
<p>In <tt class="docutils literal">torvalds/linux</tt>, 2.51% of merges have exactly two merge bases:</p>
<table border="1" class="colwidths-given docutils">
<colgroup>
<col width="50%" />
<col width="50%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head">Merge bases</th>
<th class="head">Count</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>0</td>
<td>3</td>
</tr>
<tr><td>1</td>
<td>63182</td>
</tr>
<tr><td>2</td>
<td>1634</td>
</tr>
<tr><td>3</td>
<td>212</td>
</tr>
<tr><td>4</td>
<td>81</td>
</tr>
<tr><td>5</td>
<td>37</td>
</tr>
<tr><td>6</td>
<td>21</td>
</tr>
<tr><td>7</td>
<td>17</td>
</tr>
<tr><td>8</td>
<td>9</td>
</tr>
<tr><td>9</td>
<td>5</td>
</tr>
<tr><td>10</td>
<td>7</td>
</tr>
<tr><td>12</td>
<td>2</td>
</tr>
<tr><td>13</td>
<td>1</td>
</tr>
<tr><td>29</td>
<td>1</td>
</tr>
<tr><td>52</td>
<td>1</td>
</tr>
<tr><td>53</td>
<td>1</td>
</tr>
<tr><td>55</td>
<td>1</td>
</tr>
<tr><td>78</td>
<td>1</td>
</tr>
<tr><td>85</td>
<td>1</td>
</tr>
<tr><td>93</td>
<td>2</td>
</tr>
</tbody>
</table>
</div>
<div class="section" id="example-large-proprietary-repository">
<h5>Example large proprietary repository</h5>
<p>In this repository, 0.88% of merges have exactly two merge bases:</p>
<table border="1" class="colwidths-given docutils">
<colgroup>
<col width="50%" />
<col width="50%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head">Merge bases</th>
<th class="head">Count</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>0</td>
<td>4</td>
</tr>
<tr><td>1</td>
<td>330375</td>
</tr>
<tr><td>2</td>
<td>2948</td>
</tr>
<tr><td>3</td>
<td>29</td>
</tr>
<tr><td>4</td>
<td>5</td>
</tr>
<tr><td>7</td>
<td>1</td>
</tr>
<tr><td>13</td>
<td>1</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com0tag:blogger.com,1999:blog-6771653874686998211.post-51989727680001174552014-03-03T06:41:00.001+01:002014-03-03T06:43:28.923+01:00My secret tip for GSoC success
<div class="document" id="my-secret-tip-for-gsoc-success">
<p>If you are applying for Google Summer of Code, there is one thing you
can do to make yourself beloved in your project (and isn't bad advice
in general, either):</p>
<blockquote>
<strong>DON'T WASTE OTHER PEOPLE'S TIME</strong></blockquote>
<p>I don't say this in a cantankerous, "get off of my lawn" way. I mean
it literally.</p>
<div class="section" id="time">
<h4>Time</h4>
<p>It is important to understand the motivations of a typical open-source
developer. Most people contribute to open-source in their free time,
out of joy. The time we choose to work on open source is time we
could otherwise spend hanging out with friends and family, or
programming for money, or reading books, or watching television, or
just sleeping. We steal time from the rest of our lives because we
love to code and we want to make the world a little bit better by
creating great software <a class="footnote-reference" href="#id5" id="id1">[1]</a>.</p>
<p>In the open-source world, <strong>time is the coin of the realm</strong>.</p>
<p>Now evaluate GSoC from an OS contributor's point of view:</p>
<ul class="simple">
<li>Google will pay a student to work full time on my project. <em>Great!</em></li>
<li>Maybe the student will turn into a regular contributor, and stick
around after GSoC is over. <em>Awesome!</em></li>
<li>Maybe I can help the student become a better developer. <em>Sounds
good!</em></li>
</ul>
<p>but on the other hand...</p>
<ul class="simple">
<li>It will probably take as much time to mentor the student as it would
take me to implement the feature myself.</li>
<li>I will have to spend that time mentoring (which I may or may not
enjoy), rather than coding (which I love).</li>
<li>The student might turn out to be a dud, who sucks up lots of my time
without producing anything.</li>
</ul>
<p>Note that all of the downsides of mentoring a student have to do with
its cost in <strong>time</strong>. The way to be an awesome GSoC student is to
<strong>optimize the way you use your mentor's time</strong>.</p>
</div>
<div class="section" id="recipe-for-success">
<h4>Recipe for success</h4>
<p>That brings us back to the advice of this article:</p>
<blockquote>
<strong>DON'T WASTE OTHER PEOPLE'S TIME</strong></blockquote>
<p>Mentors will generally be happy to give you the help you need <a class="footnote-reference" href="#id6" id="id2">[2]</a>.
They will point you to the documentation and code that you need to
study. They will be delighted to discuss large-scale issues of code
design (e.g., where do your changes fit in to the larger project?)
They will be willing to review your code. All of these things are
good uses of your mentor's time.</p>
<p>Your side of the bargain is to use the mentor's time as efficiently as
possible. You need to extract the most benefit out of every
interaction with your mentor.</p>
<ul class="simple">
<li>If your mentor points you to some documentation, <em>read it</em>.
<strong>Memorize it!</strong> Read other documentation that is nearby. Don't ever,
<em>ever</em>, ask another question that is covered by that piece of
documentation. If the documentation is unclear, it is OK to ask for a
clarification, but then <em>fix the documentation</em> so that your mentor
never has to answer the same question again.</li>
<li>If your mentor points you to some code, <em>study it</em>. <strong>Understand
it.</strong> Figure out not just <em>how</em> but also <em>why</em> it does what it does.
Read the other code that calls it, and the other code that it calls.
Run <tt class="docutils literal">git blame</tt> on the code and read the commit messages from when
the code was implemented. If you think you have found a problem or
bug in the code, <em>fix it</em> if you can. If you can't fix it, then at
least <em>investigate it thoroughly</em> and write a complete bug report to
the mailing list explaining what you think is wrong and where you
got stuck. Add a test to the project's test suite that demonstrates
the problem.</li>
<li>Don't waste your mentor's time on trivia. Follow your project's
coding guidelines about submitting patches, code formatting, etc.
<strong>Make your code look like existing code!</strong> If your mentor points
out an error, no matter how trivial you might think it to be, <em>fix
the error and never make the same error again</em> <a class="footnote-reference" href="#id7" id="id3">[3]</a>.</li>
<li>Help your mentor save time in other areas. <strong>Get involved in the
project like any other developer.</strong> Help review other people's
patches. Help answer user questions on the mailing list or on IRC.
Write documentation and tests. Improve the project's website.</li>
</ul>
<p>None of this is to say that you shouldn't ask for help when you need
it. For example, it is a <em>good idea</em> to describe your plans and ask
for feedback before you start coding--such high-level interactions are
a very effective use of your mentor's time and will actually <em>save
time</em> later. But remember that you have a finite budget of your
mentor's time, so <strong>use it wisely!</strong></p>
</div>
<div class="section" id="summary">
<h4>Summary</h4>
<p>Always keep in mind your role as a GSoC student: <strong>you are an
apprentice</strong> <a class="footnote-reference" href="#id8" id="id4">[4]</a>. This means you
will be expected to spend some time running errands and sweeping
floors. In return, you will have a great opportunity to learn from
the world's best developers on a one-on-one basis, get invaluable
experience, and have lots of time doing what you love--writing great
code!</p>
</div>
<div class="section" id="notes">
<h4>Notes</h4>
<table class="docutils footnote" frame="void" id="id5" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>By the way, if these are not the reasons that you want to
participate in GSoC, then maybe you are in the wrong place to
begin with.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id6" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>Indeed, you have the right to <strong>politely</strong> insist that your
mentor gives you the time that you need. After all, your
mentor has voluntarily signed up with the GSoC to be a mentor,
and mentoring a student is, well, the basic obligation.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id7" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id3">[3]</a></td><td>If you think that the project should actually move its curly
braces to the same line as the if statement (or vice versa),
<strong>you are wrong</strong>. Seasoned developers know that there are
only two important things about code formatting: <em>all code
within a project must be formatted the same way</em>, and <em>all
debates about code formatting are a waste of everybody's time</em>.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id8" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id4">[4]</a></td><td>Though ironically, you might be the only person in your project
actually getting paid for your time!</td></tr>
</tbody>
</table>
</div>
</div>
Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com0tag:blogger.com,1999:blog-6771653874686998211.post-87469787299416872372013-05-15T05:33:00.000+02:002013-05-15T05:33:42.744+02:00"Obsolete markers" in Mercurial<span style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;">I just heard on <a href="http://episodes.gitminutes.com/2013/05/gitminutes-07-martin-geisler-on.html">GitMinutes #07</a> that Mercurial will soon have <a href="http://hg-lab.logilab.org/doc/mutable-history/html/obs-concept.html">a feature</a> that records the relationship between old and new versions of changesets when a changeset is rewritten. This is a topic that I <a href="http://softwareswirl.blogspot.com/2009/08/upstream-rebase-just-works-if-history.html">find very interesting</a>, so I did some reading.</span><br style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;" /><br style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;" /><span style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;">Each "obsolete marker" is a many-to-one record stating that changesets [A1,A2,...] obsolete changeset A. This information can be shared between repositories. The obsolete changeset is retained as long as any non-obsolete changesets depend on it, after which it can be discarded. (Though it wasn't clear to me how repository 1 can know whether repository 2 contains commits based on a commit that has been marked obsolete in repository 1; perhaps human-to-human communication is required?)</span><br style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;" /><br style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;" /><span style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;">Unlike regular parent relationships, these markers have <i>no</i> implications for commit ancestry. That is, the fact that changeset A' obsoletes changeset A does not imply that commit A' includes all of the ancestors of commit A. </span><span style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;">Because of this, obsolete markers can be used even if changesets are reordered, split, or squashed--actions that cannot be represented in an ancestry-only DAG.</span><br style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;" /><br style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;" /><span style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;">If I understand correctly, obsolete markers should really be thought of as relationships between changesets (i.e., the deltas introduced by commits) as opposed to the commits (and tree states) themselves. It appears that they are intended mostly for transitional use before the history has solidified, and not as a permanent part of the record.</span><br style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;" /><br style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;" /><span style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;">Because obsolete markers introduce a new type of "history", they complicate the history model. There are new types of conflicts that can arise because of them, and such conflicts have to be resolved in new ways.</span><br style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;" /><br style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;" /><span style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;">I'm very curious how obsolete markers pan out in practice. It's definitely a feature that is worth keeping an eye on.</span><br />
<span style="background-color: white; font-family: arial, sans-serif; font-size: 12.727272033691406px; line-height: 16.363636016845703px;"><br /></span>Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com0tag:blogger.com,1999:blog-6771653874686998211.post-69172351643997358832013-05-08T14:11:00.000+02:002013-07-15T09:43:07.741+02:00git-imerge: a practical introduction<div class="document" id="git-imerge-a-practical-introduction">
<strong>git-merge on steroids / git-rebase for the masses</strong><br />
This article is a practical introduction to incremental merging using
<tt class="docutils literal"><span class="pre">git-imerge</span></tt>: why you need it, what it does, and how to use it.<br />
<em>(Don't fear; this article is not as long as it looks. Most of it
consists of big ASCII-art illustrations and example program output.)</em><br />
<div class="section" id="why-incremental-merge">
<h4>
Why incremental merge?</h4>
The traditional alternatives for combining two branches both have
serious problems.<br />
<div class="section" id="git-merge-pain">
<h5>
<tt class="docutils literal">git merge</tt> pain</h5>
<ul>
<li><div class="first">
You need to resolve <strong>one big conflict</strong> that mixes up a lot of
little changes on both sides of the merge. (Resolving big conflicts
is <strong>hard</strong>!)</div>
</li>
<li><div class="first">
Merging is <strong>all-or-nothing</strong>:</div>
<ul>
<li><div class="first">
There is <strong>no way to save</strong> a partly-done merge, so</div>
<ul class="simple">
<li>You can't record your progress.</li>
<li>You can't switch to another branch temporarily.</li>
<li>If you make a mistake, you cannot revert only part of the merge.</li>
</ul>
If you cannot resolve the whole conflict, there is nothing to do
but start over.<br />
</li>
<li><div class="first">
There is <strong>no way to test</strong> a partly-done merge--the code won't
even build until the conflict is completely resolved.</div>
</li>
</ul>
</li>
<li><div class="first">
It is <strong>difficult to collaborate</strong> on a merge with a colleague.</div>
</li>
</ul>
</div>
<div class="section" id="git-rebase-pain">
<h5>
<tt class="docutils literal">git rebase</tt> pain</h5>
<ul class="simple">
<li>It is very problematic to rebase work that has been published.<ul>
<li>Rebasing is <strong>unfriendly to collaboration</strong>.</li>
</ul>
</li>
<li>You have to reconcile each of the branch commits with <em>all</em> of the
changes on master.</li>
<li>Rebasing is <strong>all-or-nothing</strong>:<ul>
<li>It is awkward to interrupt a rebase while it is in progress:
you're not even on a branch, and rebase doesn't preserve the
relationship between old commits and new.</li>
</ul>
</li>
<li>Rebasing <strong>discards history</strong> (namely the old version of the
branch).</li>
<li>Rebasing often requires similar conflicts to be <strong>resolved multiple
times</strong>.</li>
</ul>
</div>
</div>
<div class="section" id="incremental-merge">
<h4>
Incremental merge</h4>
<tt class="docutils literal"><span class="pre">git-imerge</span></tt> implements a new merging method, <strong>incremental merge</strong>,
that reduces the pain of merging to its essential minimum.
<tt class="docutils literal"><span class="pre">git-imerge</span></tt>:<br />
<ul class="simple">
<li>Presents conflicts pairwise: <strong>you only ever need to resolve one
commit from each branch at a single time</strong>.<ul>
<li>Small conflicts (<em>much</em> easier to resolve than large conflicts).</li>
<li>You can view commit messages and individual diffs to see what each
commit was trying to do.</li>
</ul>
</li>
<li><strong>Records all intermediate merges</strong> with their correct ancestry, as
commits in your repository. An incremental merge that is in
progress<ul>
<li>...can be <strong>interrupted</strong>.</li>
<li>...can be <strong>pushed to a server</strong>.</li>
<li>...can be <strong>pulled by a colleague</strong>, worked on, and pushed again.</li>
</ul>
</li>
<li><strong>Never shows the same conflict twice</strong>. Once a conflict has been
resolved, it is stored in the DAG to make future merges easier.</li>
<li>Lets you <strong>test every intermediate state</strong>. If there is a problem,
you can use "git bisect" to find the exact pairwise merge that was
faulty. You can redo that merge and continue the incremental merge
from there (retaining earlier pairwise merges).</li>
<li>Is <strong>largely automated</strong> and <strong>surprisingly efficient</strong>.</li>
<li>Allows the final merge to be <strong>simplified for the permanent
record</strong>, omitting the intermediate results.</li>
</ul>
</div>
<div class="section" id="the-basic-idea">
<h4>
The basic idea</h4>
Suppose you want to merge "branch" into "master":<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
\
A - B - C - D - E - F - G - H - I ← branch
</pre>
First draw the diagram a bit differently:<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
|
A
|
B
|
C
|
D
|
E
|
F
|
G
|
H
|
I
↑
branch
</pre>
Now start filling it in, merging one pair at a time:<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| |
A - A1
|
B
|
C
|
D
|
E
|
F
|
G
|
H
|
I
↑
branch
</pre>
"A1" is a merge between a single commit ("1") on master and a single
commit ("A") on branch.<br />
Continue...<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| |
A - A1
| |
B - B1
|
C
|
D
|
E
|
F
|
G
|
H
|
I
↑
branch
</pre>
"B1" is a merge between "A1" and "B". Since "A1" already includes the
changes from "1" and "A", this merge only has to add the changes from
commit "B", which are hopefully limited in scope. If there is a
conflict, it is hopefully small.<br />
Most of these pairwise merges will not conflict, and <tt class="docutils literal"><span class="pre">git-imerge</span></tt>
will let do them for you automatically until it finds a conflict:<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | |
A - A1 - A2
| | |
B - B1 - B2
| | |
C - C1 - C2
| | |
D - D1 - D2
| | |
E - E1 - E2
| |
F - F1 X
| |
G - G1
| |
H - H1
| |
I - I1
↑
branch
</pre>
Now you have to help resolve the conflict that arose at "X" when
merging "E2" and "F1". But Git knows that these two commits share
"E1" as a common ancestor ("merge base"), so all you have to do is
resolve the change originally made in master commit "2" with the
change that was made in branch commit "F":<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | |
A - A1 - A2
| | |
B - B1 - B2
| | |
C - C1 - C2
| | |
D - D1 - D2
| | |
E - E1 - E2
| | |
F - F1 - F2
| |
G - G1
| |
H - H1
| |
I - I1
↑
branch
</pre>
Continue in this manner until the diagram is complete:<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8 - A9 - A10 - A11
| | | | | | | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8 - B9 - B10 - B11
| | | | | | | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6 - C7 - C8 - C9 - C10 - C11
| | | | | | | | | | | |
D - D1 - D2 - D3 - D4 - D5 - D6 - D7 - D8 - D9 - D10 - D11
| | | | | | | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 - E7 - E8 - E9 - E10 - E11
| | | | | | | | | | | |
F - F1 - F2 - F3 - F4 - F5 - F6 - F7 - F8 - F9 - F10 - F11
| | | | | | | | | | | |
G - G1 - G2 - G3 - G4 - G5 - G6 - G7 - G8 - G9 - G10 - G11
| | | | | | | | | | | |
H - H1 - H2 - H3 - H4 - H5 - H6 - H7 - H8 - H9 - H10 - H11
| | | | | | | | | | | |
I - I1 - I2 - I3 - I4 - I5 - I6 - I7 - I8 - I9 - I10 - I11
↑
branch
</pre>
Done!</div>
<div class="section" id="simplifying-the-results">
<h4>
Simplifying the results</h4>
A completed incremental merge contains all of the information you
could possibly want to know about combining two branches; all you need
to do now is discard any information that you don't want in your
permanent record. For example,<br />
<ul>
<li><div class="first">
Commit "I11" is the simple merge of "branch" into "master":</div>
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11
| |
A |
| |
B |
| |
C |
| |
D |
| |
E |
| |
F |
| |
G |
| |
H |
| |
I ---------------------------------------------------- I11' ← master
↑
branch
</pre>
(Usually such a diagram is drawn like so:<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 - I11' ← master
\ /
A -- B -- C --- D --- E --- F --- G --- H --- I ← branch
</pre>
.)<br />
</li>
<li><div class="first">
The rightmost column is the rebase of "branch" onto "master":</div>
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
|
A11'
|
B11'
|
C11'
|
D11'
|
E11'
|
F11'
|
G11'
|
H11'
|
I11' ← branch
</pre>
</li>
<li><div class="first">
The bottommost row is the rebase of "master" onto "branch":</div>
<pre class="literal-block">o - 0
|
A
|
B
|
C
|
D
|
E
|
F
|
G
|
H
|
I - I1'- I2'- I3'- I4'- I5'- I6'- I7'- I8'- I9'- I10'- I11' ← master
↑
branch
</pre>
</li>
<li><div class="first">
If you have already published branch, consider converting this into
a <a class="reference external" href="http://softwareswirl.blogspot.com/2009/04/truce-in-merge-vs-rebase-war.html">rebase with history</a>:</div>
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| |
A ---------------------------------------------------- A11'
| |
B ---------------------------------------------------- B11'
| |
C ---------------------------------------------------- C11'
| |
D ---------------------------------------------------- D11'
| |
E ---------------------------------------------------- E11'
| |
F ---------------------------------------------------- F11'
| |
G ---------------------------------------------------- G11'
| |
H ---------------------------------------------------- H11'
| |
I ---------------------------------------------------- I11'
↑
branch
</pre>
Rebase-with-history is a useful hybrid between rebasing and merging,
having the advantages that it retains both the old and the new
versions of the rebased commits, and that it is a valid way of
rebasing already-published history without the known problems of a
simple rebase.<br />
</li>
</ul>
</div>
<div class="section" id="efficiency">
<h4>
Efficiency</h4>
In most cases <tt class="docutils literal"><span class="pre">git-imerge</span></tt> does not have to construct the full
incremental merge. It uses an <a class="reference external" href="http://softwareswirl.blogspot.com/2012/12/mapping-merge-conflict-frontier.html">efficient algorithm</a>, based on
bisection, for filling in large blocks of the incremental merge
quickly and homing in on the conflicts. A typical in-progress merge
might look like this:<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | | | | | | |
A - -- -- -- -- -- A6 - -- A8 - A9 - A10 - A11
| | | | | | | | |
B - -- -- -- -- -- B6 - B7 - B8 X
| | | | | | |
C - -- -- -- -- -- C6 X
| | | | | | |
D - -- -- -- -- -- D6
| | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6
| |
F - F1 X
| |
G - G1
| |
H - H1
| |
I - I1
↑
branch
</pre>
where the "X"s marks pairwise merges that are known to conflict (i.e.,
they need user attention), and the cross-hatched rectangles are merges
that could be skipped because the merges on the boundaries of the
rectangles were all successful.</div>
<div class="section" id="git-imerge">
<h4>
git-imerge</h4>
<a class="reference external" href="https://github.com/mhagger/git-imerge">git-imerge</a> implements incremental merging for Git. This section
shows how to use <cite>git-imerge</cite> to do a simple merge.<br />
<ul>
<li><div class="first">
You begin much like with <tt class="docutils literal">git merge</tt>: check out the destination
branch and then tell <tt class="docutils literal">git imerge</tt> what branch you want to merge
into it:</div>
<pre class="literal-block">$ git checkout master
$ git imerge start --name=merge-branch --first-parent branch
</pre>
Multiple imerges can be in progress simultaneously, so you have to
give each one a name, in this case "merge-branch".<br />
</li>
<li><div class="first">
<tt class="docutils literal"><span class="pre">git-imerge</span></tt> then uses bisection to find pairwise merges that
conflict, and asks user to fix them one pair at a time:</div>
<pre class="literal-block">Attempting automerge of 1-1...success.
Attempting automerge of 1-4...success.
Attempting automerge of 1-6...success.
Attempting automerge of 9-6...failure.
Attempting automerge of 5-6...failure.
Attempting automerge of 3-6...success.
Attempting automerge of 4-6...failure.
Attempting automerge of 4-1...success.
Attempting automerge of 4-4...failure.
Attempting automerge of 4-3...failure.
Attempting automerge of 4-2...success.
Attempting automerge of 9-2...success.
Autofilling 1-6...success.
Autofilling 2-6...success.
Autofilling 3-1...success.
Autofilling 3-2...success.
Autofilling 3-3...success.
Autofilling 3-4...success.
Autofilling 3-5...success.
Autofilling 3-6...success.
Autofilling 4-2...success.
Autofilling 5-2...success.
Autofilling 6-2...success.
Autofilling 7-2...success.
Autofilling 8-2...success.
Autofilling 9-1...success.
Autofilling 9-2...success.
Attempting automerge of 4-3...failure.
Switched to branch 'imerge/c-d'
Auto-merging conflict.txt
CONFLICT (add/add): Merge conflict in conflict.txt
Automatic merge failed; fix conflicts and then commit the result.
Original first commit:
commit b7fe8a65d0cee2e388e971c4b29be8c6cbb25ee1
Author: Lou User <luser@example.com>
Date: Fri May 3 14:03:05 2013 +0200
c conflict
Original second commit:
commit bd0373cadae08d872536bcda8214c0631e19945a
Author: Lou User <luser@example.com>
Date: Fri May 3 14:03:05 2013 +0200
d conflict
There was a conflict merging commit 4-3, shown above.
Please resolve the conflict, commit the result, then type
git-imerge continue
</pre>
Please note that <tt class="docutils literal"><span class="pre">git-imerge</span></tt> tells you exactly which of the
<em>original</em> commits need to be combined in this merge, and displays
their log messages.<br />
</li>
<li><div class="first">
You can get a diagram of the current state of the merge (it is in
crude ASCII-art for now):</div>
<pre class="literal-block">$ git imerge diagram
**********
*??|?????|
*--+-----+
*??|#?????
*??|??????
*??|??????
*--+??????
Key:
|,-,+ = rectangles forming current merge frontier
* = merge done manually
. = merge done automatically
# = conflict that is currently blocking progress
@ = merge was blocked but has been resolved
? = no merge recorded
</pre>
</li>
<li><div class="first">
After you fix the indicated conflict and commit the result, run
<tt class="docutils literal">git imerge continue</tt> to tell <tt class="docutils literal"><span class="pre">git-imerge</span></tt> to incorporate the
result and proceed to the next conflict:</div>
<pre class="literal-block">$ git add ...
$ git commit
$ git imerge continue
Merge has been recorded for merge 4-3.
Attempting automerge of 5-3...success.
Attempting automerge of 5-3...success.
Attempting automerge of 9-3...success.
Autofilling 5-3...success.
Autofilling 6-3...success.
Autofilling 7-3...success.
Autofilling 8-3...success.
Autofilling 9-3...success.
Attempting automerge of 4-4...success.
Attempting automerge of 4-5...success.
Attempting automerge of 4-6...success.
Attempting automerge of 9-6...success.
Autofilling 4-6...success.
Autofilling 5-6...success.
Autofilling 6-6...success.
Autofilling 7-6...success.
Autofilling 8-6...success.
Autofilling 9-4...success.
Autofilling 9-5...success.
Autofilling 9-6...success.
Merge is complete!
$ git imerge diagram
**********
*??.?????|
*??......|
*??.*....|
*??.?????|
*??.?????|
*--------+
Key:
|,-,+ = rectangles forming current merge frontier
* = merge done manually
. = merge done automatically
# = conflict that is currently blocking progress
@ = merge was blocked but has been resolved
? = no merge recorded
</pre>
</li>
<li><div class="first">
When you are done, simplify the incremental merge into a simple
merge, a simple rebase, or a rebase-with-history:</div>
<pre class="literal-block">$ git imerge finish --goal=merge
</pre>
By default, this creates a new branch NAME that points at the
result, and checks out that branch:<br />
<pre class="literal-block">$ git log -1 --decorate
commit 79afd870a52e216114596b52c800e45139badf74 (HEAD, merge-branch)
Merge: 8453321 993a8a9
Author: Lou User <luser@example.com>
Date: Wed May 8 10:08:07 2013 +0200
Merge frobnicator into floobifier.
</pre>
(It also deletes all of the temporary data.)<br />
</li>
</ul>
<div class="section" id="intermediate-data">
<h5>
Intermediate data</h5>
During an incremental merge, intermediate results are stored directly
in your repository as special references:<br />
<dl class="docutils">
<dt><tt class="docutils literal">refs/imerge/NAME/state</tt></dt>
<dd>A blob containing a little bit of metadata.</dd>
<dt><tt class="docutils literal"><span class="pre">refs/imerge/NAME/manual/M-N</span></tt></dt>
<dd>Manual merge including all of the changes through commits <tt class="docutils literal">M</tt> on
master and <tt class="docutils literal">N</tt> on branch.</dd>
<dt><tt class="docutils literal"><span class="pre">refs/imerge/NAME/auto/M-N</span></tt></dt>
<dd>Automatic merge including all of the changes through commits <tt class="docutils literal">M</tt>
on master and <tt class="docutils literal">N</tt> on branch.</dd>
<dt><tt class="docutils literal">refs/heads/imerge/NAME</tt></dt>
<dd>Temporary branch used when resolving merge conflicts.</dd>
<dt><tt class="docutils literal">refs/heads/NAME</tt></dt>
<dd>Default reference name for storing final results.</dd>
</dl>
</div>
<div class="section" id="project-status">
<h5>
Project status</h5>
As of this writing, <tt class="docutils literal"><span class="pre">git-imerge</span></tt> is very new and still experimental.
Please try it out, but use it cautiously (e.g., on a clone of your
main Git repository) and <a class="reference external" href="mailto:mhagger@alum.mit.edu?Subject=git-imerge">give me your feedback</a>!</div>
<div class="section" id="contributing">
<h5>
Contributing</h5>
I have lots of ideas for additional features, and am sure that you do
too. See <a class="reference external" href="https://github.com/mhagger/git-imerge/blob/master/TODO.rst">the project TODO list</a> for inspiration, and <strong>get
involved</strong>!</div>
</div>
<div class="section" id="for-more-information">
<h4>
For more information:</h4>
<ul class="simple">
<li><a class="reference external" href="https://github.com/mhagger/git-imerge">The git-imerge project home page</a> on GitHub.</li>
</ul>
<ul class="simple">
<li><a class="reference external" href="http://softwareswirl.blogspot.com/search/label/git-imerge">Blog articles about incremental merge</a></li>
</ul>
<ul class="simple">
<li><a class="reference external" href="http://softwareswirl.blogspot.com/search/label/rebase-with-history">Blog articles about rebase-with-history</a></li>
</ul>
</div>
</div>
Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com0tag:blogger.com,1999:blog-6771653874686998211.post-57906780448143488602013-05-05T18:09:00.002+02:002013-05-08T10:25:49.506+02:00Incremental merge vs. direct merge vs. rebase
<div class="document" id="incremental-merge-vs-direct-merge-vs-rebase">
<table border="1" class="docutils"><colgroup><col width="13%" /><col width="27%" /><col width="27%" /><col width="33%" /></colgroup><thead valign="bottom"><tr><th class="head stub"> </th>
<th class="head">Direct merge</th>
<th class="head">Rebase</th>
<th class="head"><a class="reference external" href="http://softwareswirl.blogspot.com/2013/05/git-incremental-merge.html">Incremental merge</a></th>
</tr></thead><tbody valign="top"><tr><th class="stub">Basic commands</th>
<td><pre class="first last literal-block">
git checkout master
git merge branch
<resolve big conflict>
git commit
</pre>
</td>
<td><pre class="first last literal-block">
git checkout branch
git rebase master
while not done:
<resolve smaller conflict>
git rebase --continue
</pre>
</td>
<td><pre class="first literal-block">
git checkout master
git-imerge start [...] branch
while not done:
<resolve pairwise conflict>
git commit [...]
git-imerge continue
git-imerge simplify [...]
</pre>
<p class="last">(This uses the <tt class="docutils literal"><span class="pre">git-imerge</span></tt> tool <a class="footnote-reference" href="#id5" id="id1">[1]</a>.)</p>
</td>
</tr><tr><th class="stub">Size of conflicts</th>
<td><em>All</em> of the changes in master have to be combined with <em>all</em>
of the changes in branch in one go. Various changes are mixed
up together, even if they were originally committed separately,
which can make the conflict large or even intractable.</td>
<td>One step of a rebase requires <em>one</em> change from branch to be
resolved with <em>all</em> of the changes from master. This is easier
than doing a direct merge in one step, but still mixes up
unrelated changes.</td>
<td>At any step, only <em>one</em> change from master has to be combined
with <em>one</em> change from branch. The commit messages and deltas
from each of the two changes are available to help you
understand the two changes.</td>
</tr><tr><th class="stub">Effort</th>
<td>The least amount of effort, if there are only a few simple
conflicts. Effort increases highly nonlinearly in the size of
the conflict.</td>
<td>A bit more effort, if conflicts are simple and isolated.
Conflicts tend to be smaller than for direct merge, but more
numerous. If there are sequential overlapping conflicts, you
might need to resolve similar conflicts multiple times.</td>
<td>More processing power required at start of merge. Conflicts
are resolved as lots of little conflicts, so the effort scales
closer to linearly with the number of pairwise conflicts <a class="footnote-reference" href="#id6" id="id2">[2]</a>.
Benefits decrease if original changes were not committed in
small, self-contained steps.</td>
</tr><tr><th class="stub">Interruptibility</th>
<td>After you start a direct merge, there is no way to record
partial results. You are forced to either complete the whole
merge and then commit it, or abandon the whole merge using
<tt class="docutils literal">git merge <span class="pre">--abort</span></tt>. Progress on a direct merge is
all-or-nothing.</td>
<td>Once a rebase is started, it is awkward to interrupt it <a class="footnote-reference" href="#id7" id="id3">[3]</a>.
It is true that intermediate results are committed, but because
the rebase discards the connection between original commits and
rebased commits, you have to do the bookkeeping yourself.</td>
<td>Each of the pairwise merges is resolved and committed
separately. After each pairwise merge is committed, you can
switch branches, do something else for a while, then return
later to the incremental merge. Progress on an incremental
merge can be locked in in small increments.</td>
</tr><tr><th class="stub">Testability, fixability</th>
<td>The results of a direct merge cannot be tested until the whole
merge is done. If you mess up one part of the merge, there is
no easy way to revert only that part.</td>
<td>The individual rebased commits can be tested. But if there is
a failure, it is nontrivial to figure out which of the changes
on master interacted badly with the rebased commit.</td>
<td>Each pairwise merge can be tested individually. If one fails
testing, it can be discarded and redone without having to
discard earlier pairwise merges. If a failure is discovered
after the incremental merge is done, it is possible to use
<tt class="docutils literal">git bisect</tt> to find out exactly which of the pairwise merges
was faulty.</td>
</tr><tr><th class="stub">Collaboration</th>
<td>Since an in-progress direct merge is not savable, the only way
to get help from a colleague on a direct merge is to call the
colleague to your computer and have him/her work in your
working copy.</td>
<td>It is strongly recommended that a branch that has been
published <em>not</em> be rebased. Thus rebasing is usually applied
only to local work done by a single developer. It is awkward
to share the work of rebasing a branch <a class="footnote-reference" href="#id7" id="id4">[3]</a>.</td>
<td>Each pairwise merge commit is a normal commit that can be
exchanged with other colleagues via push/fetch/etc. Each
pairwise conflict can be resolved by the person most familiar
with that part of the code. In fact, if you plan to retain the
whole incremental merge in your repository history, then no
history rewriting is necessary at all, and there is no problem
merging a published branch, or publishing the intermediate
merge commits as you go.</td>
</tr><tr><th class="stub">Recommendations</th>
<td>In a merge-based workflow, use direct merge for simple merges.
If a difficult conflict arises, switch to incremental merge and
then simplify the results into a simple merge.</td>
<td>In a rebase-based workflow with unpublished local changes, use
rebase to update your work onto upstream changes if you only
expect trivial conflicts. If you expect (or, midway through
the rebase, encounter) nontrivial conflicts, switch to
incremental merge and then simplify the results into a rebase.</td>
<td><p class="first">Use incremental merge when:</p>
<ul class="last simple"><li>a simple merge or rebase is desired but the conflicts are
nontrivial</li>
<li>you would like to rebase work that has already been published
(simplify results to rebase-with-history)</li>
<li>you want to retain a permanent record of how conflicts were
resolved</li>
<li>you want to collaborate on a merge or rebase</li>
</ul></td>
</tr></tbody></table><div class="section" id="notes">
<h4>Notes:</h4>
<table class="docutils footnote" frame="void" id="id5" rules="none"><colgroup><col class="label" /><col /></colgroup><tbody valign="top"><tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td><tt class="docutils literal"><span class="pre">git-imerge</span></tt> is a new tool that I am working on to automate
incremental merging. It is available as open-source from <a class="reference external" href="http://github.com/mhagger/git-imerge">its
public Git repository</a>.</td></tr></tbody></table><table class="docutils footnote" frame="void" id="id6" rules="none"><colgroup><col class="label" /><col /></colgroup><tbody valign="top"><tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>If development was truly chaotic, and especially if it involved
changes that were later reverted, then an incremental merge
might be much more effort than a simple merge because it
requires you to resolve conflicts between changes that are
obsolete.</td></tr></tbody></table><table class="docutils footnote" frame="void" id="id7" rules="none"><colgroup><col class="label" /><col /></colgroup><tbody valign="top"><tr><td class="label">[3]</td><td><em>(<a class="fn-backref" href="#id3">1</a>, <a class="fn-backref" href="#id4">2</a>)</em> <p>To pause and/or share the work of rebasing, add a reference to
the last successful rebased commit using <tt class="docutils literal">git branch
<span class="pre">rebase-wip</span></tt> <em>before</em> aborting the rebase with <tt class="docutils literal">git rebase
<span class="pre">--abort</span></tt>. The result will look something like:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| |
A A'
| |
B B'
| |
C C'
| |
D D'
|
E ↑
| rebase-wip
F
|
G
|
H
|
I
↑
branch
</pre>
<p class="last">Such an interrupted rebase can be continued using a command
like <tt class="docutils literal">git rebase <span class="pre">--onto</span> <span class="pre">rebase-wip</span> D branch</tt>, where <tt class="docutils literal">D</tt> is
the SHA1 of the last commit that was successfully rebased in
the first attempt. The work can be shared by publishing tips
<tt class="docutils literal">branch</tt> and <tt class="docutils literal"><span class="pre">rebase-wip</span></tt>, but care should be taken when
doing so to prevent other people from basing their own work on
the pre-rebased branch <tt class="docutils literal">branch</tt>.</p>
</td></tr></tbody></table></div>
</div>
Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com0tag:blogger.com,1999:blog-6771653874686998211.post-17844441869294757402013-05-05T17:50:00.000+02:002013-07-15T09:45:43.182+02:00One merge to rule them all<div class="document" id="one-merge-to-rule-them-all">
In earlier articles, I explained <a class="reference external" href="http://softwareswirl.blogspot.com/2012/12/the-conflict-frontier-of-nightmare-merge.html">how to break up a difficult merge</a>
into simple pairwise merges, how to efficiently <a class="reference external" href="http://softwareswirl.blogspot.com/2012/12/mapping-merge-conflict-frontier.html">locate the pairwise
merges that block the full merge</a>, how to make <a class="reference external" href="http://softwareswirl.blogspot.com/2012/12/real-world-conflict-diagrams.html">pictures of the merge
conflict frontier</a>, and how to <a class="reference external" href="http://softwareswirl.blogspot.com/2013/05/git-incremental-merge.html">complete a full incremental merge</a>.
What can we do with an incremental merge once it is complete?<br />
An incremental merge, once completed, <strong>contains all of the
information you could possibly want about combining two branches</strong>.
Specifically, it records how to do each of the <tt class="docutils literal">N*M</tt> pairwise
merges, where <tt class="docutils literal">N</tt> and <tt class="docutils literal">M</tt> are the number of commits on each
branch. Each of these submerges is stored as a git commit with an
ancestry that correctly reflects its contents. In fact, an
incremental merge might contain <em>more</em> information than you want to
store permanently in your repository. In this article, I will explain
how to use an incremental merge to derive a simple merge, a rebase (in
either direction), or a rebase-with-history.<br />
In later articles I will provide a <a class="reference external" href="http://softwareswirl.blogspot.com/2013/05/incremental-merge-vs-direct-merge-vs.html">table comparing incremental merge
with simple merge and simple rebase</a>, explain how to <a class="reference external" href="http://softwareswirl.blogspot.com/">compute an
incremental merge more efficiently</a>, and explain <a class="reference external" href="http://github.com/mhagger/git-imerge">git-imerge</a>, a new
tool that automates incremental merging.<br />
<div class="section" id="the-starting-point">
<h4>
The starting point</h4>
For the purposes of this discussion, we will assume that we have
already completed a full incremental merge as described in <a class="reference external" href="http://softwareswirl.blogspot.com/2013/05/git-incremental-merge.html">my
previous article</a> and that it looks like this <a class="footnote-reference" href="http://www.blogger.com/blogger.g?blogID=6771653874686998211#id3" id="id1">[1]</a>:<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8 - A9 - A10 - A11
| | | | | | | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8 - B9 - B10 - B11
| | | | | | | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6 - C7 - C8 - C9 - C10 - C11
| | | | | | | | | | | |
D - D1 - D2 - D3 - D4 - D5 - D6 - D7 - D8 - D9 - D10 - D11
| | | | | | | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 - E7 - E8 - E9 - E10 - E11
| | | | | | | | | | | |
F - F1 - F2 - F3 - F4 - F5 - F6 - F7 - F8 - F9 - F10 - F11
| | | | | | | | | | | |
G - G1 - G2 - G3 - G4 - G5 - G6 - G7 - G8 - G9 - G10 - G11
| | | | | | | | | | | |
H - H1 - H2 - H3 - H4 - H5 - H6 - H7 - H8 - H9 - H10 - H11
| | | | | | | | | | | |
I - I1 - I2 - I3 - I4 - I5 - I6 - I7 - I8 - I9 - I10 - I11
↑
branch
</pre>
Remember that commits "0"-"11" are on master, and commits "A"-"I" are
on the branch. (It goes without saying that there is nothing special
about master; <em>any</em> two branches can be merged in this manner.) The
rest of the commits are pairwise merges between the neighbor above and
the neighbor to the left. For example, commit "C3" is the pairwise
merge between "B3" and "C2", and it includes all of the changes
through master commit "C" and branch commit "3".<br />
The most interesting commit is "I11". It contains all of the changes
through master commit "11" and branch commit "I". In other words,
<strong>its contents are the merge of "master" and "branch"</strong>. By resolving
simple pairwise conflicts, we have managed indirectly to resolve the
full conflict.<br />
Incremental merging is basically a <a class="reference external" href="http://en.wikipedia.org/wiki/Divide_and_conquer_algorithm">divide and conquer</a> approach to
merging. It replaces the task of resolving one big hairy merge
conflict with the task of resolving many little conflicts. Since big
conflicts are <em>much</em> harder to resolve than small conflicts, this can
be a huge win, and can even make it possible to perform incrementally
a merge that would be intractable in a single step.</div>
<div class="section" id="using-the-results-of-an-incremental-merge">
<h4>
Using the results of an incremental merge</h4>
Now that we have a complete incremental merge, how can we use the
results?<br />
In following sections I will explain how to winnow the information
from an incremental merge down into any one of the following:<br />
<ul class="simple">
<li><a class="reference internal" href="http://www.blogger.com/blogger.g?blogID=6771653874686998211#simple-merge">The simple one-step merge of "master" and "branch"</a></li>
<li><a class="reference internal" href="http://www.blogger.com/blogger.g?blogID=6771653874686998211#rebase-of-branch-onto-master-or-vice-versa">"branch" rebased on top of "master"</a></li>
<li><a class="reference internal" href="http://www.blogger.com/blogger.g?blogID=6771653874686998211#rebase-of-branch-onto-master-or-vice-versa">"master" rebased on top of "branch"</a></li>
<li><a class="reference internal" href="http://www.blogger.com/blogger.g?blogID=6771653874686998211#rebase-with-history">Either type of rebase with history</a></li>
</ul>
Obtaining one of these results is done by rewriting history to
<strong>selectively discard</strong> the intermediate commits that are not wanted,
while <strong>adjusting the parentage</strong> of the commits that are retained.</div>
<div class="section" id="simple-merge">
<h4>
Simple merge</h4>
If your goal is a simple merge of branch into master, (with <em>none</em> of
the intermediate results), then all you need from the incremental
merge is commit "I11". Specifically, you need a single, new merge
commit with the <em>contents</em> of "I11" but with commits "11" and "I" as
parents:<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11
| |
A |
| |
B |
| |
C |
| |
D |
| |
E |
| |
F |
| |
G |
| |
H |
| |
I ---------------------------------------------------- I11' ← master
↑
branch
</pre>
It may be easier to recognize this diagram when it is redrawn in the
conventional orientation:<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 - I11' ← master
\ /
A -- B -- C --- D --- E --- F --- G --- H --- I ← branch
</pre>
</div>
<div class="section" id="rebase-of-branch-onto-master-or-vice-versa">
<h4>
Rebase of "branch" onto "master" or vice versa</h4>
If you prefer to rebase "branch" onto "master", what you want are the
contents of commits "A11" through "I11", but with their second parents
omitted to convert them from merge commits into simple, linear
commits:<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
|
A11'
|
B11'
|
C11'
|
D11'
|
E11'
|
F11'
|
G11'
|
H11'
|
I11' ← branch
</pre>
Similarly, it is possible to use the results to rebase "master" onto
"branch" (i.e., as if the changes made on "branch" had actually been
made earlier in the history of "master") by retaining commits
"I1".."I11" with their first parents omitted:<br />
<pre class="literal-block">o - 0
|
A
|
B
|
C
|
D
|
E
|
F
|
G
|
H
|
I - I1'- I2'- I3'- I4'- I5'- I6'- I7'- I8'- I9'- I10'- I11' ← master
↑
branch
</pre>
</div>
<div class="section" id="rebase-with-history">
<h4>
Rebase with history</h4>
It is also possible to rebase "branch" onto "master", but <a class="reference external" href="http://softwareswirl.blogspot.com/2009/04/truce-in-merge-vs-rebase-war.html">retaining
part of the history</a>. This is a very useful hybrid between rebasing
and merging, having the advantages that it retains both the old and
the new versions of the rebased commits, and that it is a valid way to
rebase already-published history without the known problems of a
simple rebase <a class="footnote-reference" href="http://www.blogger.com/blogger.g?blogID=6771653874686998211#id4" id="id2">[2]</a>. A rebase with history is extracted from a full
incremental merge by retaining commits "A11" through "I11", but
adjusting their second parents to point at "A" through "I"
respectively:<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| |
A ---------------------------------------------------- A11'
| |
B ---------------------------------------------------- B11'
| |
C ---------------------------------------------------- C11'
| |
D ---------------------------------------------------- D11'
| |
E ---------------------------------------------------- E11'
| |
F ---------------------------------------------------- F11'
| |
G ---------------------------------------------------- G11'
| |
H ---------------------------------------------------- H11'
| |
I ---------------------------------------------------- I11'
↑
branch
</pre>
Analogously, it is possible to rebase "master" onto "branch" with
history.</div>
<div class="section" id="maybe-keep-everything">
<h4>
Maybe keep everything?</h4>
As we have seen, a full incremental merge is quite powerful: it
contains enough information to derive <em>all</em> of the common rebase/merge
results as special cases.<br />
Come to think of it, why should we have to decide between <a class="reference external" href="http://softwareswirl.blogspot.com/2009/04/truce-in-merge-vs-rebase-war.html">merge
vs. rebase</a> at all? Why not retain <em>all</em> of the intermediate commits
in our history? Retaining the full merge history is simple; we just
reset the "master" reference to "I11" without rewriting <em>any</em>
commits:<br />
<pre class="literal-block">o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11
| | | | | | | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8 - A9 - A10 - A11
| | | | | | | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8 - B9 - B10 - B11
| | | | | | | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6 - C7 - C8 - C9 - C10 - C11
| | | | | | | | | | | |
D - D1 - D2 - D3 - D4 - D5 - D6 - D7 - D8 - D9 - D10 - D11
| | | | | | | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 - E7 - E8 - E9 - E10 - E11
| | | | | | | | | | | |
F - F1 - F2 - F3 - F4 - F5 - F6 - F7 - F8 - F9 - F10 - F11
| | | | | | | | | | | |
G - G1 - G2 - G3 - G4 - G5 - G6 - G7 - G8 - G9 - G10 - G11
| | | | | | | | | | | |
H - H1 - H2 - H3 - H4 - H5 - H6 - H7 - H8 - H9 - H10 - H11
| | | | | | | | | | | |
I - I1 - I2 - I3 - I4 - I5 - I6 - I7 - I8 - I9 - I10 - I11 ← master
↑
branch
</pre>
The resulting repository holds complete information about how the
individual commits were pairwise merged, which conflicts had to be
resolved manually, and how they were resolved (hopefully explained in
the individual log messages). It contains (as special cases) all of
the information that would be recorded for a simple merge, as well as
all of the information that would be retained by a rebase of "branch"
on "master" or vice versa, with or without history.<br />
However, retaining the whole history of the incremental merge, though
arguably the correct approach, causes the permanent retention of a lot
of intermediate commits that will--in practice--probably never be
needed. Moreover, some git tools (like gitk) aren't able to display
such a complicated history very well. Therefore, it is probably best
to use incremental merge as a tactic to achieve a more limited result.</div>
<div class="section" id="still-to-come">
<h4>
Still to come</h4>
In future articles I plan to present a sparse variant of the full
incremental merge and describe <a class="reference external" href="http://github.com/mhagger/git-imerge">git-imerge</a>, a tool that automates
incremental merging.</div>
<div class="section" id="notes">
<h4>
Notes:</h4>
<table class="docutils footnote" frame="void" id="id3" rules="none"><colgroup><col class="label"></col><col></col></colgroup><tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="http://www.blogger.com/blogger.g?blogID=6771653874686998211#id1">[1]</a></td><td>In most cases it is overkill to complete the full incremental
merge, and a sparser variant would be sufficient. But it is
easier to understand sparse incremental merges after we have
analyzed and discussed the full version. So this article we
assume that the full incremental merge is available.</td></tr>
</tbody></table>
<table class="docutils footnote" frame="void" id="id4" rules="none"><colgroup><col class="label"></col><col></col></colgroup><tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="http://www.blogger.com/blogger.g?blogID=6771653874686998211#id2">[2]</a></td><td>The problems of rebasing published work is described, for
example, in the <a class="reference external" href="ftp://www.kernel.org/pub/software/scm/git/docs/git-rebase.html">git-rebase(1)</a> manpage section "Recovering
from Upstream Rebase".</td></tr>
</tbody></table>
</div>
</div>
Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com0tag:blogger.com,1999:blog-6771653874686998211.post-76938038971618822832013-05-05T17:05:00.000+02:002013-05-08T10:25:18.803+02:00Git incremental merge
<div class="document" id="git-incremental-merge">
<p>In earlier articles, I explained <a class="reference external" href="http://softwareswirl.blogspot.com/2012/12/the-conflict-frontier-of-nightmare-merge.html">how to break up a difficult merge</a>
into simple pairwise merges, how to efficiently <a class="reference external" href="http://softwareswirl.blogspot.com/2012/12/mapping-merge-conflict-frontier.html">locate the pairwise
merges that block the full merge</a>, and how to make <a class="reference external" href="http://softwareswirl.blogspot.com/2012/12/real-world-conflict-diagrams.html">pictures of the
merge conflict frontier</a>. How can these insights be used to actually
resolve a full, nightmare merge?</p>
<p>This article will describe how to incrementally resolve the conflicts
in a merge diagram, one pairwise conflict at a time. Each time a
conflict is manually resolved, the conflict frontier is pushed back
and more of the merge diagram becomes green. When the whole conflict
diagram is green, the full incremental merge is done.</p>
<p>The amazing thing is that <strong>an incremental merge, once completed,
contains all of the information you could possibly want about
combining two branches</strong>. In <a class="reference external" href="http://softwareswirl.blogspot.com/2013/05/one-merge-to-rule-them-all.html">my next article</a> I will show, given a
full incremental merge, how to derive a simple merge, a rebase (in
either direction), or a rebase-with-history.</p>
<p>In future articles I will <a class="reference external" href="http://softwareswirl.blogspot.com/2013/05/incremental-merge-vs-direct-merge-vs.html">summarize the differences between
incremental merge, git merge, and git rebase</a>, discuss how to reduce
the amount of work needed to resolve a merge conflict, and explain
<a class="reference external" href="http://github.com/mhagger/git-imerge">git-imerge</a>, a tool that automates the process of incremental
merging.</p>
<div class="section" id="pushing-back-the-conflict-frontier">
<h4>Pushing back the conflict frontier</h4>
<p>In <a class="reference external" href="http://softwareswirl.blogspot.com/2012/12/the-conflict-frontier-of-nightmare-merge.html">an earlier article</a>, I showed how to create a diagram of a merge
that was partly blocked by conflicting pairwise merges:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8 - A9 - A10 - A11
| | | | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8 x x x
| | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6 x x x x x
| | | | | | |
D - D1 - D2 - D4 - D4 - D5 - D6 x x x x x
| | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 x x x x x
| |
F - F1 x x x x x x x x x x
| |
G - G1 x x x x x x x x x x
| |
H - H1 x x x x x x x x x x
| |
I - I1 x x x x x x x x x x
↑
branch
</pre>
<p>Remember: in this diagram, commits "0"-"11" are on master and commits
"A"-"I" are on the branch. The rest of the diagram shows incremental
merges when they are possible; e.g., commit "C3" is a successful merge
between commit "C2" and commit "B3". When a pairwise merge fails, the
diagram shows an "x"; for example Git was unable to merge "E2" and
"F1", which implies that there was some conflict between the changes
introduced in master commit "2" and those introduced in branch commit
"F".</p>
<p>But you are smarter than Git, so (hopefully) you can merge "E2" and
"F1" manually. Among your advantages over Git is that you can
understand the commit messages for commits "2" and "F", so you know
what those two commits were trying to achieve. Your task is to create
a commit "F2" that contains the changes made in <em>both</em> "2" and "F"
<a class="footnote-reference" href="#id2" id="id1">[1]</a>. Once you resolve those two changes and commit the results as
"F2", the diagram is transformed to:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8 - A9 - A10 - A11
| | | | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8 x x x
| | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6 x x x x x
| | | | | | |
D - D1 - D2 - D4 - D4 - D5 - D6 x x x x x
| | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 x x x x x
| | |
F - F1 - F2 x x x x x
| |
G - G1 x x x x x
| |
H - H1 x x x x x
| |
I - I1 x x x x x
↑
branch
</pre>
<p>Not only has "F2" been added to the diagram, but also a white area has
opened up below and to the right of "F2". These merges used to be
blocked by the conflict at "F2". But since we have merged "F2"
manually, it might be that Git can now do some or all of these merges
automatically. For example, we can now attempt to merge "E3" and "F2"
to create "F3", or merge "F2" and "G1" to create "G2", and continue
filling in the white area until we are blocked by new conflicts.</p>
<p>If we are lucky, the conflict "F2" was isolated and the whole block is
now mergeable:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8 - A9 - A10 - A11
| | | | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8 x x x
| | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6 x x x x x
| | | | | | |
D - D1 - D2 - D4 - D4 - D5 - D6 x x x x x
| | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 x x x x x
| | | | | | |
F - F1 - F2 - F3 - F4 - F5 - F6 x x x x x
| | | | | | |
G - G1 - G2 - G3 - G4 - G5 - G6 x x x x x
| | | | | | |
H - H1 - H2 - H3 - H4 - H5 - H6 x x x x x
| | | | | | |
I - I1 - I2 - I3 - I4 - I5 - I6 x x x x x
↑
branch
</pre>
<p>Of course we might not be so lucky, and new conflicts might be
revealed in the rectangle previously blocked by "F2". But either way,
we have made at least <em>some</em> progress in pushing back the conflict
frontier.</p>
<p>If we continue merging commits pairwise--automatically when possible,
manually when necessary--eventually we will clear the whole diagram of
conflicts. When this is accomplished, the incremental merge is done:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8 - A9 - A10 - A11
| | | | | | | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8 - B9 - B10 - B11
| | | | | | | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6 - C7 - C8 - C9 - C10 - C11
| | | | | | | | | | | |
D - D1 - D2 - D3 - D4 - D5 - D6 - D7 - D8 - D9 - D10 - D11
| | | | | | | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 - E7 - E8 - E9 - E10 - E11
| | | | | | | | | | | |
F - F1 - F2 - F3 - F4 - F5 - F6 - F7 - F8 - F9 - F10 - F11
| | | | | | | | | | | |
G - G1 - G2 - G3 - G4 - G5 - G6 - G7 - G8 - G9 - G10 - G11
| | | | | | | | | | | |
H - H1 - H2 - H3 - H4 - H5 - H6 - H7 - H8 - H9 - H10 - H11
| | | | | | | | | | | |
I - I1 - I2 - I3 - I4 - I5 - I6 - I7 - I8 - I9 - I10 - I11
↑
branch
</pre>
<p>But when the incremental merge is done, the full merge is also done!
Commit "I11" combines all of the changes from "master" with all of the
changes from "branch"; in other words, it is the merge of "master" and
"branch". The merge nightmare is over.</p>
</div>
<div class="section" id="still-to-come">
<h4>Still to come</h4>
<p><a class="reference external" href="http://softwareswirl.blogspot.com/2013/05/one-merge-to-rule-them-all.html">My next article</a> will explain how to convert an incremental merge
into a simple merge, a simple rebase, or a rebase <a class="reference external" href="http://softwareswirl.blogspot.com/2009/04/truce-in-merge-vs-rebase-war.html">with history</a>. In
future articles I will <a class="reference external" href="http://softwareswirl.blogspot.com/2013/05/incremental-merge-vs-direct-merge-vs.html">compare the available merge alternatives</a>,
present less exhaustive variants of the full incremental merge and
describe <a class="reference external" href="http://github.com/mhagger/git-imerge">git-imerge</a>, a tool that automates incremental merging.</p>
</div>
<div class="section" id="notes">
<h4>Notes:</h4>
<table class="docutils footnote" frame="void" id="id2" rules="none"><colgroup><col class="label" /><col /></colgroup><tbody valign="top"><tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>The technical reason why merging "E2" and "F1" into "F2" is
simpler than merging "2" and "F" directly is that "E2" and "F1"
share nearby commit "E1" as a merge base (see the documentation
for <a class="reference external" href="ftp://www.kernel.org/pub/software/scm/git/docs/git-merge-base.html">git-merge-base(1)</a>). By contrast, the merge base of "2"
and "F" is commit "0", which is further away from the commits
to be merged.</td></tr></tbody></table></div>
</div>
Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com0tag:blogger.com,1999:blog-6771653874686998211.post-15350721022094397052012-12-28T00:15:00.001+01:002013-05-08T10:24:59.833+02:00Real-world conflict diagrams<div class="document" id="real-world-examples-of-merge-diagrams">
<p>In earlier articles, I described a way to <a class="reference external" href="http://softwareswirl.blogspot.de/2012/12/the-conflict-frontier-of-nightmare-merge.html">diagram a conflicting
merge</a>, and presented an <a class="reference external" href="http://softwareswirl.blogspot.de/2012/12/mapping-merge-conflict-frontier.html">algorithm for efficiently mapping the
conflict frontier</a> which is implemented in an experimental program,
<a class="reference external" href="https://github.com/mhagger/git-mergemate">git-mergemate</a>. In this article, I would like to present a few
examples of real-world merge diagrams.</p>
<p>To generate these diagrams, I first used the following command to
search for merge commits in "master" that had manually-resolved
conflicts:</p>
<pre class="literal-block">
git rev-list --merges --grep='Conflicts:' master
</pre>
<p>Assuming that the SHA1 of the merge is stored in <tt class="docutils literal">$s</tt>, the two
parents of such a commit can be written as <tt class="docutils literal">$s^1</tt> and <tt class="docutils literal">$s^2</tt>. So
I then ran <a class="reference external" href="https://github.com/mhagger/git-mergemate">git-mergemate</a> like so:</p>
<pre class="literal-block">
git-mergemate diagram $s^1...$s^2
</pre>
<p>and converted the output from PPM format to PNG using the <tt class="docutils literal">pnm2png</tt>
utility <a class="footnote-reference" href="#id2" id="id1">[1]</a>.</p>
<div class="section" id="show-me-the-pixels">
<h4>Show me the pixels!</h4>
<p>Each pixel in these diagrams represents one pairwise merge. Thus the
images are quite compact; the width (in pixels) is the number of
commits on "master" after the branch point, and the height is the
number of commits on the branch. I am leaving the images at 1:1
scaling so that their relative sizes can easily be seen (hopefully you
can zoom them using your browser if you want to see more detail).</p>
<p>A pixel is green if the corresponding merge was successful and red if
the merge resulted in a conflict. The merges that were explicitly
tested are shown as bright green/red; the other merges were inferred
using the assumptions described in <a class="reference external" href="http://softwareswirl.blogspot.de/2012/12/mapping-merge-conflict-frontier.html">the algorithm article</a>. As you can
see, it takes very few test merges to compute a whole merge diagram.</p>
<div class="section" id="ho-hum">
<h5>Ho-hum</h5>
<p>Most of the merge diagrams (about 80% of them) were quite boring.</p>
<p>A typical short branch with a conflict:</p>
<img alt="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOy0xGXDmTDeDpq0BdRLzEoA1W3R39seX7_RD0u4FD01ViLwV_q_lNlEBQjZbgPMvxN2zKN3T8oxMo3nUFybe1qwAFpmFLmu3IBpuz42gAnOL412AbKqR5r3wuQqxem_81I8wstU3Xvqk/s1600/diagram-43c456bdfa3127a2f73b3c99fb1a3e21674613ea.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOy0xGXDmTDeDpq0BdRLzEoA1W3R39seX7_RD0u4FD01ViLwV_q_lNlEBQjZbgPMvxN2zKN3T8oxMo3nUFybe1qwAFpmFLmu3IBpuz42gAnOL412AbKqR5r3wuQqxem_81I8wstU3Xvqk/s1600/diagram-43c456bdfa3127a2f73b3c99fb1a3e21674613ea.png" /><p>A short branch with an early conflict:</p>
<img alt="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjybk4IykVYxAiJ9u2U5-nXqiXJfO7kbjSQODQai7tQT5u65aBikUGzkjwBC_2u5hAEWjo39xYv1r7KRqyCFUEciKgpWxb2g0IOuHCUARq5_RoXlErTcWAA9ujCOpRq6FadEvhEsgmEjs/s1600/diagram-3e8ebc18283a23305179aa2055619985163fbd93.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjybk4IykVYxAiJ9u2U5-nXqiXJfO7kbjSQODQai7tQT5u65aBikUGzkjwBC_2u5hAEWjo39xYv1r7KRqyCFUEciKgpWxb2g0IOuHCUARq5_RoXlErTcWAA9ujCOpRq6FadEvhEsgmEjs/s1600/diagram-3e8ebc18283a23305179aa2055619985163fbd93.png" /><p>A very short branch merged long after it was created:</p>
<img alt="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjosveq9_j9xYlko1NKeCqW_fmu94dE8In_Gv_uqECQYIp4Ur-3KzQ7gHWzO2apKIfTaL54hIqw8Eb374f2_GydTMr1KBwjB33nR2YNJmCjthCAXacL4z5yU_m2VXA4q5zqQWc6oceBa0k/s320/diagram-6cc13d7703e8dc0edccd134f44676ca9ad73dd6a.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjosveq9_j9xYlko1NKeCqW_fmu94dE8In_Gv_uqECQYIp4Ur-3KzQ7gHWzO2apKIfTaL54hIqw8Eb374f2_GydTMr1KBwjB33nR2YNJmCjthCAXacL4z5yU_m2VXA4q5zqQWc6oceBa0k/s320/diagram-6cc13d7703e8dc0edccd134f44676ca9ad73dd6a.png" /><p>A longer branch merged soon after it was created, with a conflict with
one of its later commits:</p>
<img alt="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiczEd_xAHdY972whFdL8cbi0JUeUc8EpdevWTmV1FI9RxOwbUb48C6UVXiVqY8Su8Bc95mZP3neWoSovHL4hyupeb6LiaBhKKgxlAIYyBBzwlYN0U4c2qK94gqq_KXRQLkwjz_kiVXJeE/s1600/diagram-7f28917bfa6aef7c38a216d0e30fae40f216e8f4.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiczEd_xAHdY972whFdL8cbi0JUeUc8EpdevWTmV1FI9RxOwbUb48C6UVXiVqY8Su8Bc95mZP3neWoSovHL4hyupeb6LiaBhKKgxlAIYyBBzwlYN0U4c2qK94gqq_KXRQLkwjz_kiVXJeE/s1600/diagram-7f28917bfa6aef7c38a216d0e30fae40f216e8f4.png" /></div>
<div class="section" id="single-conflicts">
<h5>Single conflicts</h5>
<p>However, about 20% of the merge diagrams were more interesting. The
interesting diagrams tend to be bigger because they represent
long-lived branches merged a considerable time after they were
created.</p>
<p>Several diagrams show merges that might be blocked by a single
pairwise conflict:</p>
<p>The conflict at the upper-left corner of the red rectangle might be
all that is blocking this merge:</p>
<img alt="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpchBoy8mhuo9U140LM8T_CwhBGztnsYQWWfMjgimbxe02FwhVdOZBHQ_FETMYYHMjnzHscLbFZBL08FJpApAC1Ds-hYeLal2QL_NFwmBBBmDvPsKNKyaJGTWGulB3R-6aj6vcFAzVQko/s1600/diagram-737dae7c79f05d06fada3246d2269fa4f33dd90a.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpchBoy8mhuo9U140LM8T_CwhBGztnsYQWWfMjgimbxe02FwhVdOZBHQ_FETMYYHMjnzHscLbFZBL08FJpApAC1Ds-hYeLal2QL_NFwmBBBmDvPsKNKyaJGTWGulB3R-6aj6vcFAzVQko/s1600/diagram-737dae7c79f05d06fada3246d2269fa4f33dd90a.png" /><p>A long branch, merged before much was changed on "master":</p>
<img alt="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjs07m54MVuWlZnxiumaXGenldz_R8VgksvnXBsykEwnCyhqdo_gyOyedz201Wm33fV27xwjfsiKx6xxCgpnif5ulcOLjlnxNhwlO_0JSP3MFh4wxlnv3e2ZRAhTVjbDgkyAeDshRa__eI/s1600/diagram-8922c2d3f89fc35636d5e41ca17dfa36b15fbd2d.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjs07m54MVuWlZnxiumaXGenldz_R8VgksvnXBsykEwnCyhqdo_gyOyedz201Wm33fV27xwjfsiKx6xxCgpnif5ulcOLjlnxNhwlO_0JSP3MFh4wxlnv3e2ZRAhTVjbDgkyAeDshRa__eI/s1600/diagram-8922c2d3f89fc35636d5e41ca17dfa36b15fbd2d.png" /><p>A short branch, merged quite a while after it was created:</p>
<img alt="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBrfpWjOEowIiYg2mY7FLkpVzQtpHK0zJDimeqUGGfLnDadUmYJBL49zB0I6YwDB4RHMqX89h1SD9a2pgH3bhrFiIYebgUM5MQPAIGQNZZNCqAI41T__bb8hdyH4HZS_Jzp-jjBL5OcZs/s1600/diagram-9747028e60b994b5df12cf0805c0a3b3f04d9edd.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBrfpWjOEowIiYg2mY7FLkpVzQtpHK0zJDimeqUGGfLnDadUmYJBL49zB0I6YwDB4RHMqX89h1SD9a2pgH3bhrFiIYebgUM5MQPAIGQNZZNCqAI41T__bb8hdyH4HZS_Jzp-jjBL5OcZs/s1600/diagram-9747028e60b994b5df12cf0805c0a3b3f04d9edd.png" /><p>A branch that has diverged widely from "master", but might have only
one pairwise conflict:</p>
<img alt="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnxejhfQN-YDPK0rAV6VoJ79i1vmGj9LbZg17n-VCQrVHiX5Byo5RmlK1afHGgmisVh2w4HIZpq1hMf34zjQz_Muq9YgY7JnvtVMNXx-XoPWSx7ezVmLsqgqMmZTZY2IGs2Gyi_-SBFjc/s1600/diagram-e51c4480518a4b30e241e35218385ef357f0f71b.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnxejhfQN-YDPK0rAV6VoJ79i1vmGj9LbZg17n-VCQrVHiX5Byo5RmlK1afHGgmisVh2w4HIZpq1hMf34zjQz_Muq9YgY7JnvtVMNXx-XoPWSx7ezVmLsqgqMmZTZY2IGs2Gyi_-SBFjc/s1600/diagram-e51c4480518a4b30e241e35218385ef357f0f71b.png" /></div>
<div class="section" id="multiple-conflicts">
<h5>Multiple conflicts</h5>
<p>Other diagrams are yet more complicated, with merge frontiers shaped
by two pairwise conflicts:</p>
<p>A smallish diagram with two early conflict blocks:</p>
<img alt="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_dm1u1ovnnbnJLnoE6-_afuQBoVI4L8bmlR51cZgV_m1GygZDqwEjmafWCDiD7kv6UCEdvqvd2LohoitGAqbCQFWzVYLUydCtvk9q_jC4VokM8IemQykrfFUZyp81IAX8BLrkUKkdmcY/s1600/diagram-fd1165fbe2aca4400e46904ac29f269e42d8da50.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_dm1u1ovnnbnJLnoE6-_afuQBoVI4L8bmlR51cZgV_m1GygZDqwEjmafWCDiD7kv6UCEdvqvd2LohoitGAqbCQFWzVYLUydCtvk9q_jC4VokM8IemQykrfFUZyp81IAX8BLrkUKkdmcY/s1600/diagram-fd1165fbe2aca4400e46904ac29f269e42d8da50.png" /><p>Note that the top line in this diagram is green, meaning that the
first branch commit doesn't conflict with any of the commits on
master:</p>
<img alt="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggWXPtiga7NwSjwjvx0kIJTxpwgWMG2aw5k7c1HshAnLnPAsluUyWIOa8z0EpPAvOS7xMm1_LPNwjF7QwrFE5vbLny90PhKu8kWQ_eiLa0mz37QbnjgIOUgYwK28QSUXXSpqNzIpk2c9E/s1600/diagram-b6508be06fde5b869f6119f883bb7d838425d4fe.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggWXPtiga7NwSjwjvx0kIJTxpwgWMG2aw5k7c1HshAnLnPAsluUyWIOa8z0EpPAvOS7xMm1_LPNwjF7QwrFE5vbLny90PhKu8kWQ_eiLa0mz37QbnjgIOUgYwK28QSUXXSpqNzIpk2c9E/s1600/diagram-b6508be06fde5b869f6119f883bb7d838425d4fe.png" /><p>A larger merge with two early conflict blocks:</p>
<img alt="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgvGZe0U1b_pbvLyfS1ewJoj_vfaEwfDuWbmwnJKUt3NAV2TMIdzahZGAHaDou9edllwMtoHjkLxGIm32ESWEMoGiBhQFuACBx_r7Z-7XMyomLNXd0gq0Jcug_RPQ_47gHoayMH1_IcS0/s1600/diagram-04d07c8da4eec76f1d4c32cd438ddfbf9e2ccb3d.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgvGZe0U1b_pbvLyfS1ewJoj_vfaEwfDuWbmwnJKUt3NAV2TMIdzahZGAHaDou9edllwMtoHjkLxGIm32ESWEMoGiBhQFuACBx_r7Z-7XMyomLNXd0gq0Jcug_RPQ_47gHoayMH1_IcS0/s1600/diagram-04d07c8da4eec76f1d4c32cd438ddfbf9e2ccb3d.png" /><p>The biggest merge of all (281 commits on master, 235 commits on
branch). Please note that even though the branch and master have
diverged significantly, conflicts did not arise until quite some time
after the branch was created:</p>
<img alt="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7WtlSrOU8vWlNKOBc8_lEs74QBu6StpDeryMqazS70Lx29SNww6JEa2Zfeqpd0MXe2T4INEdIAV3InstY81oUkkBJYJDvPXbBbyLZIdPY8aDcKoCrKy2r3PB1HitRqxZFiScYhSRFzis/s1600/diagram-698240f65bd4eb19f122ac2a725818b965445155.png" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7WtlSrOU8vWlNKOBc8_lEs74QBu6StpDeryMqazS70Lx29SNww6JEa2Zfeqpd0MXe2T4INEdIAV3InstY81oUkkBJYJDvPXbBbyLZIdPY8aDcKoCrKy2r3PB1HitRqxZFiScYhSRFzis/s1600/diagram-698240f65bd4eb19f122ac2a725818b965445155.png" /></div>
</div>
<div class="section" id="conclusion">
<h4>Conclusion</h4>
<p>A merge diagram provides a quick, intuitive overview of a conflicting
merge, and highlights the individual commits on each branch that
conflict with each other.</p>
<p>Most merge diagrams are simple (in fact I didn't observe any with more
than two blocking pairwise merges). Thus there is hope that, by
focusing our attention on these critical pairwise merges, we can
resolve a nightmare merge with an acceptable level of pain and (more
importantly) some confidence in the result.</p>
<p>In upcoming articles I will explore how to use incremental merging to
resolve a nightmare merge.</p>
<div class="section" id="notes">
<h5>Notes:</h5>
<table class="docutils footnote" frame="void" id="id2" rules="none"><colgroup><col class="label" /><col /></colgroup><tbody valign="top"><tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td><p class="first">Putting this all together, we have</p>
<pre class="literal-block">
for s in $(git rev-list --grep='Conflicts:' --merges master)
do
git-mergemate diagram $s^1...$s^2 | pnmtopng >../diagram-$s.png
done
</pre>
<p class="last">If you want to try this with your own git repository, please make a
backup first, as the script is only lightly tested!</p>
</td></tr></tbody></table></div>
</div>
</div>
Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com0tag:blogger.com,1999:blog-6771653874686998211.post-43383679125111890862012-12-28T00:15:00.000+01:002013-05-08T10:24:44.458+02:00Mapping the merge conflict frontier<div class="document" id="mapping-the-merge-conflict-frontier">
<p>In <a class="reference external" href="http://softwareswirl.blogspot.de/2012/12/the-conflict-frontier-of-nightmare-merge.html">my last article</a>, I described how to define a "conflict frontier"
for a failed merge and conceptually how to diagram it. In this
article, I will describe an automated and efficient way to compute the
conflict frontier using git. In my next article, I will show some
<a class="reference external" href="http://softwareswirl.blogspot.de/2012/12/real-world-conflict-diagrams.html">examples of real-world merge diagrams</a>.</p>
<p>Remember, a merge diagram summarizes the combinations of commits from
two branches that can be merged successfully, and which of them
conflict. The example used in the previous article looked like this:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8 - A9 - A10 - A11
| | | | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8 x x x
| | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6 x x x x x
| | | | | | |
D - D1 - D2 - D4 - D4 - D5 - D6 x x x x x
| | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 x x x x x
| |
F - F1 x x x x x x x x x x
| |
G - G1 x x x x x x x x x x
| |
H - H1 x x x x x x x x x x
| |
I - I1 x x x x x x x x x x
↑
branch
</pre>
<p>In this diagram, commits "0"-"11" are on master and commits "A"-"I"
are on the branch. The rest of the diagram shows incremental merges
when they are possible; e.g., commit "C3" is a successful merge
between commit "C2" and commit "B3", and it contains the logical
changes that were originally introduced by commits "0"-"3" on master
plus those introduced by commits "A"-"C" on branch. When an
incremental merge failed, the diagram shows an "x". The "conflict
frontier" is the border between merges that are successful and those
that result in conflicts.</p>
<p>This article describes an algorithm for computing the merge frontier
efficiently.</p>
<div class="section" id="assumptions">
<h4>Assumptions</h4>
<p>In this article we make the following assumptions:</p>
<ol class="arabic simple"><li>If a merge can be done directly, then all of the incremental merges
above and to the left of the merge can also be done successfully.
For example, if we can merge "C" and "3" directly, then we assume
that all of the merges "A1", "A2", "A3", "B1", "B2", "B3", "C1",
and "C2" could also be done successfully. This is typically
(though not always) the case in the real world.</li>
<li>If a direct merge conflicts, then all of the incremental merges
below and to the right of it would also conflict. For example, if
a merge of "F" and "2" fails, we assume that the whole rectangle of
incremental merges bounded by "F2", "F11", "I11", and "I2" would
also conflict. Again, this is typically (though not always) the
case in the real world.</li>
<li>We use "git merge" to define whether a merge is successful. If the
command completes without an error, then we assume that the merge
is good. (For a more accurate determination of which merges are
successful, one could also check whether the merged version builds
and passes the project's automated tests.)</li>
</ol></div>
<div class="section" id="bisection">
<h4>Bisection</h4>
<p>Given the above assumptions, it is possible to locate the merge
frontier in any line or column of the diagram via a process of
bisection. For example, to find the merge frontier in row "E" of the
diagram:</p>
<ol class="arabic simple"><li>Test merging "E" and "6" -> SUCCESS. Merge frontier must be to the
right of this entry.</li>
<li>Test merging "E" and "9" -> CONFLICT. Merge frontier must be to
the left of this entry.</li>
<li>Test merging "E" and "7" -> FAIL. Merge frontier must be between
"E6" and "E7".</li>
</ol><p>In the diagram, the bisection look like this:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
|
A
|
B
|
C
|
D
|
E E6 x x
|
F
|
G
|
H
|
I
↑
branch
</pre>
</div>
<div class="section" id="filling">
<h4>Filling</h4>
<p>Given the results of a bisection, assumptions 1 and 2 allow us to fill
in big chunks of the merge diagram. In particular, since merge "E6"
was successful, all of the merges in the rectangle above and to the
left of "E6" are also successful. And since "E7" conflicts, all of
the merges in the rectangle below and to the right of "E7" are also
conflicts:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6
| | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6
| | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6
| | | | | | |
D - D1 - D2 - D4 - D4 - D5 - D6
| | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 x x x x x
|
F x x x x x
|
G x x x x x
|
H x x x x x
|
I x x x x x
↑
branch
</pre>
<p>The combination of bisection and filling allows the merge frontier to
be computed very efficiently.</p>
</div>
<div class="section" id="the-zig-zag-method">
<h4>The zig-zag method</h4>
<p>There are many ways to map out the frontier using bisection and
filling. The one that I use I call the zig-zag method, because it
starts at the bottom-left and zig-zags up the frontier until it
reaches the top-right. In pseudo-code, it is roughly:</p>
<pre class="literal-block">
i1 = 0
i2 = number of commits on branch
while True:
i1 = index of first failing merge in row i2-1
mark block above and to the left of (i1,i2) as "successful"
if i1 == number of commits on master:
break
i2 = index of first failing merge in column i1
mark block below and to the right of (i1,i2) as "conflict"
if i2 == 0:
break
</pre>
<p>Whenever the "index of first failing merge" is needed for a row or a
column, it is found by bisection. For our example, the progression
looks like this (with the position (i1,i2) marked as <tt class="docutils literal">*</tt>):</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| |
A - A1
| |
B - B1
| |
C - C1
| |
D - D1
| |
E - E1
| |
F - F1
| |
G - G1
| |
H - H1
| |
I - I1 *
↑
branch
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| |
A - A1
| |
B - B1
| |
C - C1
| |
D - D1
| |
E - E1
| |
F - F1 * x x x x x x x x x
| |
G - G1 x x x x x x x x x x
| |
H - H1 x x x x x x x x x x
| |
I - I1 x x x x x x x x x x
↑
branch
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6
| | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6
| | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6
| | | | | | |
D - D1 - D2 - D4 - D4 - D5 - D6
| | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 *
| |
F - F1 x x x x x x x x x x
| |
G - G1 x x x x x x x x x x
| |
H - H1 x x x x x x x x x x
| |
I - I1 x x x x x x x x x x
↑
branch
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6
| | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6
| | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6 * x x x x
| | | | | | |
D - D1 - D2 - D4 - D4 - D5 - D6 x x x x x
| | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 x x x x x
| |
F - F1 x x x x x x x x x x
| |
G - G1 x x x x x x x x x x
| |
H - H1 x x x x x x x x x x
| |
I - I1 x x x x x x x x x x
↑
branch
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8
| | | | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8 *
| | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6 x x x x x
| | | | | | |
D - D1 - D2 - D4 - D4 - D5 - D6 x x x x x
| | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 x x x x x
| |
F - F1 x x x x x x x x x x
| |
G - G1 x x x x x x x x x x
| |
H - H1 x x x x x x x x x x
| |
I - I1 x x x x x x x x x x
↑
branch
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8
| | | | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8 * x x
| | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6 x x x x x
| | | | | | |
D - D1 - D2 - D4 - D4 - D5 - D6 x x x x x
| | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 x x x x x
| |
F - F1 x x x x x x x x x x
| |
G - G1 x x x x x x x x x x
| |
H - H1 x x x x x x x x x x
| |
I - I1 x x x x x x x x x x
↑
branch
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8 - A9 - A10 - A11 *
| | | | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8 x x x
| | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6 x x x x x
| | | | | | |
D - D1 - D2 - D4 - D4 - D5 - D6 x x x x x
| | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 x x x x x
| |
F - F1 x x x x x x x x x x
| |
G - G1 x x x x x x x x x x
| |
H - H1 x x x x x x x x x x
| |
I - I1 x x x x x x x x x x
↑
branch
</pre>
<p>Using this algorithm, the whole diagram can be computed by doing only
<tt class="docutils literal">O(B * lg max(M,N))</tt> test merges, where <tt class="docutils literal">B</tt> is the number of
blocks in the resulting diagram and <tt class="docutils literal">M</tt> and <tt class="docutils literal">N</tt> are the number of
commits on master and branch, respectively. In the worst case, <tt class="docutils literal">B =
min(M,N)</tt> and the algorithm scales like <tt class="docutils literal">O(min(M,N) * lg
max(M,N))</tt>.</p>
<p>I have written an example program, <a class="reference external" href="https://github.com/mhagger/git-mergemate">git-mergemate</a>, that computes
merge diagrams using the algorithm described here. To use it, put it
in your <tt class="docutils literal">$PATH</tt>, switch to your git project working directory (which
must be clean), and type</p>
<pre class="literal-block">
git-mergemate diagram master...branch >diagram.ppm
</pre>
<p>The output is a <a class="reference external" href="http://en.wikipedia.org/wiki/Netpbm_format">PPM file</a> in which successful merges are displayed
as green pixels and conflicting merges are displayed as red pixels.
Pixels representing merges that were actually tested are displayed as
bright red/green; the other pixels are inferred using the assumptions
above.</p>
<p>My next article will show some <a class="reference external" href="http://softwareswirl.blogspot.de/2012/12/real-world-conflict-diagrams.html">examples of real-world merge
diagrams</a>.</p>
</div>
<div class="section" id="some-technical-details">
<h4>Some technical details</h4>
<ol class="arabic simple"><li>Each of the bisect steps only needs to search the part of the
row/column that hasn't already been filled in.</li>
<li>If the merges are done using git, it is important to do them with
git's "rerere" feature <em>turned off</em>. Otherwise the results of test
merges are not deterministic because they depend on what
information "rerere" has already gathered.</li>
</ol></div>
</div>
Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com0tag:blogger.com,1999:blog-6771653874686998211.post-75521001580995332482012-12-28T00:10:00.000+01:002013-05-08T10:23:47.953+02:00The conflict frontier of a nightmare merge<div class="document" id="the-conflict-frontier-of-a-nightmare-merge">
<p>If you use feature branches in git, you know the importance of getting
features (or featurettes) merged back to master as soon as possible.
Otherwise, it is all too likely that the branch and master will grow
so far apart that merging them becomes a nightmare. You probably know
that sick feeling in the pit of your stomach when you attempt a merge,
only to be confronted with dozens or even hundreds of conflicts. How
can you possibly understand all of the conflicts, let alone
disentangle them? Sometimes it seems like there is nothing to do but
abandon the branch entirely and start over.</p>
<p>Exactly this situation happened recently at <tt class="docutils literal">$WORK</tt>, and it got me
thinking about what can be done to attack the nightmare merge.</p>
<div class="section" id="explode-the-problem">
<h4>Explode the problem</h4>
<p>The first step to solving a problem is to understand it, and (for me
at least) the first step to understanding a problem is to visualize it
somehow. What does a nightmare merge look like?</p>
<p>Specifically, one doesn't learn very much from the fact that two large
and widely-diverged branches don't merge successfully. Let's explode
the merge into many tiny, simpler merges to try to get a more
intuitive idea of the situation.</p>
<p>Assume, for the sake of simplicity, that both master and the feature
branch have developed linearly since they bifurcated <a class="footnote-reference" href="#id3" id="id1">[1]</a>:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
\
A - B - C - D - E - F - G - H - I ← branch
</pre>
<p>Now, the fact that master and branch cannot be merged means that
<em>somewhere</em> among the commits on master there are one or more changes
that conflict with one or more changes <em>somewhere</em> on branch. But the
majority of the changes probably do not conflict. Let's find the
problematic pairs so that we can focus our attention on them.</p>
<p>We start by drawing the above diagram a little bit differently, with
the branch drawn downwards at a 90 degree angle to master:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
|
A
|
B
|
C
|
D
|
E
|
F
|
G
|
H
|
I
↑
branch
</pre>
<p>Now let's divide and conquer by merging individual pairs of commits,
one at a time. The first merge is between the first post-bifurcation
commits on branch and master, "1" and "A":</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| |
A - A1
|
B
|
C
|
D
|
E
|
F
|
G
|
H
|
I
↑
branch
</pre>
<p>If we're lucky, the merge will go through without conflicts. So let's
merge some more branch commits with the first commit from master:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| |
A - A1
| |
B - B1
| |
C - C1
| |
D - D1
| |
E - E1
| |
F - F1
| |
G - G1
| |
H - H1
| |
I - I1
↑
branch
</pre>
<p>Woohoo! Nothing on the branch conflicted with commit 1 from master.
This is a little bit of progress. For example, it means that we could
probably cleanly rebase the branch to commit "1" (though that is
<a class="reference external" href="http://softwareswirl.blogspot.com/2009/04/truce-in-merge-vs-rebase-war.html">not recommended</a>).</p>
<dl class="docutils"><dt><strong>Aside</strong>:</dt>
<dd><p class="first">Please note that we haven't actually defined what it means for a
merge to be "successful". The simplest approach is simply to
attempt the merge in git and see if it reports a conflict. This
is the approach that I will use. But please note that even a
merge that is <em>textually</em> successful is not necessarily correct:
it might not build, or the merge might have introduced bugs, etc.
It would be possible to define success more strictly, using
criteria like</p>
<ul class="last simple"><li>The merged source tree compiles without errors</li>
<li>The test suite runs correctly</li>
<li>The merged code passes manual inspection</li>
</ul></dd>
</dl></div>
<div class="section" id="blocked">
<h4>Blocked!</h4>
<p>Now let's start successively merging the first column of merge commits
with commit 2 from master:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | |
A - A1 - A2
| | |
B - B1 - B2
| | |
C - C1 - C2
| | |
D - D1 - D2
| | |
E - E1 - E2
| | |
F - F1 - x
| |
G - G1
| |
H - H1
| |
I - I1
↑
branch
</pre>
<p>This time, we hit a problem: there was a conflict (marked with "x")
when trying to merge commit "F1" with commit "E2". What does this
mean?</p>
<p>Commit "E1" essentially contains all of the changes from commits
"A"-"E" on the branch plus the change from commit "1" on master. This
merge was successful. If we denote the change introduced by commit
"X" as "∆X", we can write (schematically):</p>
<pre class="literal-block">
E1 = 0 + ∆1 + ∆A + ∆B + ∆C + ∆D + ∆E (successful)
</pre>
<p>"E2" contains everything in "E1", plus additionally it contains the
change that was originally made in master commit "2". Similarly, "F1"
contains all of the changes from "E1", plus additionally the change
that was originally made on branch commit "F". Both merges "E2" and
"F1" were also successful:</p>
<pre class="literal-block">
E2 = 0 + ∆1 + ∆2 + ∆A + ∆B + ∆C + ∆D + ∆E (successful)
F1 = 0 + ∆1 + ∆A + ∆B + ∆C + ∆D + ∆E + ∆F (successful)
</pre>
<p>But there was a conflict when merging "E2" and "F1":</p>
<pre class="literal-block">
F2 = 0 + ∆1 + ∆2 + ∆A + ∆B + ∆C + ∆D + ∆E + ∆F (conflict)
</pre>
<p>How do we understand this conflict? If we compare "E2", "F1", and
"F2", we see that "F2" was the first time that the changes introduced
in commit "2" came together with the changes introduced in commit "F".
So the fact that "F2" conflicts indicates that a change made in commit
"2" conflicts with a change made in commit "F".</p>
</div>
<div class="section" id="conflict-breeds-more-conflict">
<h4>Conflict breeds more conflict</h4>
<p>Once there is a merge conflict, it blocks our progress. If merge "F2"
fails, then there is no way that we can merge "F2" and "G1" to create
"G2" <a class="footnote-reference" href="#id4" id="id2">[2]</a>. Nor is it possible to merge "F2" with "E3" to create "F3".
Continuing this line of reasoning, we conclude that the one failed
merge at "F2" implies that all of the merges in the rectangle below
and to the right of it will also fail. So let's add that information
to the diagram:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | |
A - A1 - A2
| | |
B - B1 - B2
| | |
C - C1 - C2
| | |
D - D1 - D2
| | |
E - E1 - E2
| |
F - F1 x x x x x x x x x x
| |
G - G1 x x x x x x x x x x
| |
H - H1 x x x x x x x x x x
| |
I - I1 x x x x x x x x x x
↑
branch
</pre>
</div>
<div class="section" id="filling-the-rest-of-the-diagram">
<h4>Filling the rest of the diagram</h4>
<p>However, that doesn't mean that we are completely done. It is quite
possible that a merge of "3" and "A2" will still be successful, and in
general we can keep filling in the table in other directions until
<em>they</em> are blocked by other merge conflicts. The resulting diagram
might look like this:</p>
<pre class="literal-block">
o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 ← master
| | | | | | | | | | | |
A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8 - A9 - A10 - A11
| | | | | | | | |
B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8 x x x
| | | | | | |
C - C1 - C2 - C3 - C4 - C5 - C6 x x x x x
| | | | | | |
D - D1 - D2 - D4 - D4 - D5 - D6 x x x x x
| | | | | | |
E - E1 - E2 - E3 - E4 - E5 - E6 x x x x x
| |
F - F1 x x x x x x x x x x
| |
G - G1 x x x x x x x x x x
| |
H - H1 x x x x x x x x x x
| |
I - I1 x x x x x x x x x x
↑
branch
</pre>
</div>
<div class="section" id="the-conflict-frontier">
<h4>The conflict frontier</h4>
<p>What does a diagram like this tell us?</p>
<p>Even when it is impossible to merge a branch back to master directly,
it <em>is</em> usually possible to merge various parts of the branch with
parts of master. This is logical; not <em>every</em> change on the branch
necessarily conflicts with <em>every</em> change on master.</p>
<p>But there is a "conflict frontier" between the merges that are
successful and the merges that conflict. This frontier has a very
particular shape: the conflicted region is in the bottom-right corner
of the diagram, and it consists of rectangles jutting towards the
upper-left.</p>
<p>The apexes of these rectangles (i.e., "F2", "C7", and "B9" in our
example) indicate the earliest pairs of changes, one on each branch,
that conflict with each other (i.e., "F" vs. "2", "C" vs. "7", and "B"
vs. "9").</p>
<p>In future articles I will present an algorithm for <a class="reference external" href="http://softwareswirl.blogspot.de/2012/12/mapping-merge-conflict-frontier.html">efficiently
mapping out the conflict frontier</a>, show some <a class="reference external" href="http://softwareswirl.blogspot.de/2012/12/real-world-conflict-diagrams.html">examples of real-world
merge diagrams</a>, and discuss how to use incremental merging to attack
the nightmare merge.</p>
</div>
<div class="section" id="notes">
<h4>Notes:</h4>
<table class="docutils footnote" frame="void" id="id3" rules="none"><colgroup><col class="label" /><col /></colgroup><tbody valign="top"><tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td><p class="first">There are two obvious ways to generalize this approach when the
histories of master and branch are not linear:</p>
<ol class="last arabic simple"><li>Merge the individual commits anyway, preserving the topology
of each branch as you go. I think this should work, though
I haven't actually tried it. But in any case, it would be
difficult to visualize the result.</li>
<li>Apply the described procedure to only the main backbone of
commits on each branch, namely the commits along the
first-parent chain from each branch tip to the branch point.
Using git, this would correspond to the output of <tt class="docutils literal">git log
<span class="pre">--first-parent</span></tt>. This is the approach that I will take in
future articles.</li>
</ol></td></tr></tbody></table><table class="docutils footnote" frame="void" id="id4" rules="none"><colgroup><col class="label" /><col /></colgroup><tbody valign="top"><tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>Strictly speaking, even though merge "F2" failed, it might
still be possible to merge "G" and "2". For example, commit
"G" might have reverted commit "F", meaning that "G2" is
textually identical to the successful merge "E2". But in the
usual case, the failure of one merge means that all merges
below and to the right of it will fail, too.</td></tr></tbody></table></div>
</div>
Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com0tag:blogger.com,1999:blog-6771653874686998211.post-55203308435508929652012-11-08T14:32:00.005+01:002012-11-08T14:32:59.618+01:00Compressing permuted dataMark Nelson posed <a href="http://marknelson.us/2012/10/09/the-random-compression-challenge-turns-ten/">a challenge</a>:<br />
<blockquote>
[...] Challenge 2 is more along the lines of troll bait, because it is patently impossible: create a system to compress and then decompress <em>any</em> file of size 415,241 bytes. In other words, create a compressed file that is smaller than the input, then use only that compressed file to restore the original data.<br />
Unlike Challenge 1, there are no size limitations on Challenge 2. Your compressor and decompressor can be as large as you like. [...]<br />
To keep it simple, the files will be simple permutations of the million digit file – scrambled with an encryption system, or perhaps a random number generator, or whatever. The point is, they should have all the same numeric characteristics as the original file, just organized slightly differently.</blockquote>
He also supplied a file <span style="font-family: Courier New, Courier, monospace;">AMillionRandomDigits.bin</span>, which contains 415,241 random bytes.<br />
Of course it is well-known that it is impossible to write a compressor that can compress <i>every</i> possible data file. But the last paragraph states that the files to be compressed "will be simple permutations of the million digit file" [1]. As I will show below, it <i>is</i> possible to compress files that are bytewise permutations of a known file, albeit by only a small amount.<br />
<br />
The key point is that the number of permutations of <span style="font-family: Courier New, Courier, monospace;">AMillionRandomDigits.bin</span> is considerably smaller than the number of files of length 415241. In fact, it is possible to compute the numbers exactly:<br />
<br />
<pre>Number of files of length N = 256^N
Number of permutations of a file of length N = N! / (n_0! n_1! ... n_255!)
</pre>
<br />
where <span style="font-family: Courier New, Courier, monospace;">n_b</span> is the number of times that the byte value <span style="font-family: Courier New, Courier, monospace;">b</span> appears in the original file.<br />
The minimum size, in bits, to which a file from a particular ensemble can be compressed is given by the logarithm base 2 of the number of possible files. Evaluating the above formulas given the byte frequencies in <span style="font-family: Courier New, Courier, monospace;">AMillionRandomDigits.bin</span>, we find that an optimal compressor could compress any of these files by exactly 235 bytes or 0.06%. So a tiny extra crumb of information can be leveraged to effect a tiny bit of compression.<br />
I was initially surprised how little compression is made possible by knowledge of character frequency counts, though in retrospect it is pretty obvious:<br />
<br />
<pre>N = 415241
Each byte value appears approximately N/256 = 1622 times</pre>
<pre>The variation in byte frequencies is about ±sqrt(1622) ≈ ±40
That fact that byte b appears n_b times therefore gives you (very roughly)</pre>
<pre> lg(80) or 6.3 bits of information.</pre>
<pre>Since we have independent counts for 255 bytes, that gives about</pre>
<pre> 6.3 bits * 255 ≈ 200 bytes of information
</pre>
<br />
Thus this simple estimate suggests that the file can be compressed by about 200 bytes (not far off of the correct value, 235, as computed above).
<br />
<br />
Can we actually <i>write</i> a compressor/decompressor for such files? In fact, it is easy to compress the file by one byte. The compressor is trivial: simply drop the last byte from the file. To decompress,
<br />
<ol>
<li>Compute the byte frequencies in <span style="font-family: Courier New, Courier, monospace;">AMillionRandomDigits.bin</span></li>
<li>Compute the byte frequencies in the compressed file</li>
<li>Determine which byte value appears fewer times than expected in the compressed file</li>
<li>Append that byte to the end of the compressed file to produce the original</li>
</ol>
Implementing <i>optimal</i> compression for such a file is left as an exercise for the reader.<br />
<br />
<br />
[1] After I submitted my solution, Mark claimed that what he <i>meant</i> by "permutations" was not the mathematical meaning of the word, but rather...well, I don't really know; if it's not meant precisely then the whole paragraph is kindof meaningless.<br />
<br />Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com0tag:blogger.com,1999:blog-6771653874686998211.post-27709824374898493732011-06-24T05:51:00.005+02:002011-06-24T10:52:29.903+02:00Starting a Python script intelligentlyUnder Linux (or any Unix-style operating system), it is easy to make a Python script executable. You simply add a <a href="http://en.wikipedia.org/wiki/Shebang_%28Unix%29">"shebang" line</a> at the top of the script:
<pre>#! /usr/bin/python
</pre>
and make the script executable:
<pre>chmod +x myscript
</pre>
But real life is sometimes more complicated than this.
<ol>
<li>The Python interpreter might be somewhere else in the user's $PATH, for example <code>/usr/local/bin/python</code> or <code>$HOME/bin/python</code>.</li>
<li>There might be multiple Python interpreters and you might want to select one intelligently (e.g., use <code>/usr/bin/python2.6</code> if it is installed, otherwise fall back to <code>/usr/bin/python2.5</code>, but if neither is available then emit a warning and abort). </li>
<li>Other arbitrary work might have to be done before starting the Python interpreter.</li>
</ol>
So I was looking for a way to use a small blurb of shell code to intelligently start a Python script. Of course one could write a shell script that invokes a separate Python script, but for convenience, I wanted everything in a single file. And I wanted the bulk of the script to look like plain old Python code, not like a giant string that requires extra quoting.
Google failed me, so I put together the following solution:
<pre>#! /bin/sh
# -*- mode: python; coding: utf-8 -*-
# This file is used as both a shell script and as a Python script.
""":"
# This part is run by the shell. It looks for an appropriate Python
# interpreter then uses it to re-exec this script.
if test -x /usr/bin/python2.6
then
PYTHON=/usr/bin/python2.6
elif test -x /usr/bin/python2.5
then
PYTHON=/usr/bin/python2.5
else
echo 1>&2 "No usable Python interpreter was found!"
exit 1
fi
exec $PYTHON "$0" "$@"
" """
# The rest of the file is run by the Python interpreter.
__doc__ = """This string is treated as the module docstring."""
print "Hello world!"
</pre>
When this script is run:
<ul>
<li>The first line is a shebang that causes the script to be interpreted using /bin/sh; i.e., as a standard shell script.</li>
<li>The shell interpreter ignores the second line as a comment.</li>
<li>The line <code>""":"</code> is interpreted as the shell command <code>:</code> (yes, the colon character is the name of a shell command) with some funny quoting around it. The colon command does nothing, so interpretation continues on the next line.</li>
<li>The following lines are interpreted, one by one, as shell code. When interpretation reaches the line <code>exec $PYTHON "$0" "$@"</code>, it causes the shell process to end and the selected Python interpreter to be executed to interpret the same script with the original command-line arguments. Because the shell interprets lines as they are read, it does not matter that the rest of the script is not valid shell syntax.</li>
<li>The Python interpreter starts executing the script. It ignores the shebang line, which just looks like a comment.</li>
<li>It interprets the second line as per <a href="http://www.python.org/dev/peps/pep-0263/">PEP 0263</a> as an encoding declaration. It wouldn't be wise to pick a funky encoding, but utf-8 should be safe. This line also tells emacs to edit the file in Python mode even though it has a shell shebang line.</li>
<li>The lines between <code>""":"</code> and <code>" """</code> are seen as a Python multiline string and ignored. (Actually, they are used as the module docstring, but we will overwrite that in a moment.) The only constraint is that the string cannot contain three consecutive quotation marks.</li>
<li>The line <code>__doc__ = """..."""</code> overwrites the module docstring. This is where you should document the module.</li>
<li>The rest of the file can be arbitrary Python code. This is where you put your main Python script.</li>
</ul>
This is the tidiest way that I could come up with for making a single file that is both valid shell code and valid Python code. If you know of a better way, please leave a comment.Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com3tag:blogger.com,1999:blog-6771653874686998211.post-87025470105103830192010-10-11T11:25:00.003+02:002010-10-11T11:33:38.386+02:00The wacky world of compiling Java<p>Java compilers can supposedly figure out what needs to be compiled, but this feature is broken.</p>
<p>Consider the following three files:</p><p>
</p><ul>
<li><tt>A.java</tt>:
<pre>
public class A {
public static final int x = 1;
}
</pre>
</li>
<li><tt>B.java</tt>:
<pre>
public class B {
public static final int y = A.x * 10 + 1;
}
</pre>
</li>
<li><tt>C.java</tt>:
<pre>
public class C extends B {
public static void main(String[] argv) {
System.out.println(Integer.toString(B.y * 10 + 1));
}
}
</pre>
</li>
</ul>
<p>Usually, when one class depends on another, the Java compiler can figure out the dependency by looking at the associated class file. The trick in this case is that each class depends only on a <tt>public static final</tt> member of the previous class. The Java specification allows <tt>static final</tt> constants to be inlined, in which case the corresponding class file contains no trace of the dependency. This causes javac problems.</p>
<p>To illustrate the problem, I ran several tests. In each case I erased all files, created <tt>A.java</tt>, <tt>B.java</tt>, and <tt>C.java</tt> as shown above, and compiled them. Then I modified one or more of the files (by changing the numerical constant) and triggered a new compilation by doing the equivalent of "<tt>javac C.java</tt>". If the Java compiler could detect dependencies correctly, then it would determine that <tt>C.java</tt> depends on <tt>B.java</tt> which depends on <tt>A.java</tt>, and it would compile the modified file(s) and all dependent file(s). So, for example, if <tt>B.java</tt> is modified, then <tt>B.java</tt> and <tt>C.java</tt> should be recompiled, whereas the old <tt>A.class</tt> file should be used as-is. I ran these tests using the Java compilers that I have handy: OpenJDK 1.6.0_18, the GNU compiler for Java 4.4.3 (gcj), the Eclipse batch Java compiler 3.5.1, and the Eclipse IDE. The results are tabulated below.</p>
<table border="1" class="docutils">
<colgroup>
<col width="22%" />
<col width="28%" />
<col width="12%" />
<col width="12%" />
<col width="12%" />
<col width="16%" />
</colgroup>
<tbody valign="top">
<tr><td colspan="2">OpenJDK 1.6.0_18</td>
<td colspan="3">File(s) recompiled</td>
<td rowspan="2">Result?</td>
</tr>
<tr><td colspan="2"><tt>javac C.java</tt></td>
<td><tt>A.java</tt></td>
<td><tt>B.java</tt></td>
<td><tt>C.java</tt></td>
</tr>
<tr><td rowspan="6">Files changed</td>
<td>No files</td>
<td> </td>
<td> </td>
<td>X</td>
<td>Debatable</td>
</tr>
<tr><td><tt>A.java</tt></td>
<td> </td>
<td> </td>
<td>X</td>
<td>BROKEN!</td>
</tr>
<tr><td><tt>B.java</tt></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Too much</td>
</tr>
<tr><td><tt>A.java</tt> and <tt>B.java</tt></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td><tt>C.java</tt></td>
<td> </td>
<td> </td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td>All files</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
</tbody>
</table>
<table class="docutils" border="1">
<colgroup>
<col width="22%">
<col width="28%">
<col width="12%">
<col width="12%">
<col width="12%">
<col width="16%">
</colgroup>
<tbody valign="top">
<tr><td colspan="2">OpenJDK 1.6.0_18</td>
<td colspan="3">File(s) recompiled</td>
<td rowspan="2">Result?</td>
</tr>
<tr><td colspan="2"><tt>javac -Xprefer:newer C.java</tt></td>
<td><tt>A.java</tt></td>
<td><tt>B.java</tt></td>
<td><tt>C.java</tt></td>
</tr>
<tr><td rowspan="6">Files changed</td>
<td>No files</td>
<td> </td>
<td> </td>
<td>X</td>
<td>Debatable</td>
</tr>
<tr><td><tt>A.java</tt></td>
<td> </td>
<td> </td>
<td>X</td>
<td>BROKEN!</td>
</tr>
<tr><td><tt>B.java</tt></td>
<td> </td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td><tt>A.java</tt> and <tt>B.java</tt></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td><tt>C.java</tt></td>
<td> </td>
<td> </td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td>All files</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
</tbody>
</table>
<table class="docutils" border="1">
<colgroup>
<col width="22%">
<col width="28%">
<col width="12%">
<col width="12%">
<col width="12%">
<col width="16%">
</colgroup>
<tbody valign="top">
<tr><td colspan="2">OpenJDK 1.6.0_18</td>
<td colspan="3">File(s) recompiled</td>
<td rowspan="2">Result?</td>
</tr>
<tr><td colspan="2"><tt>javac -Xprefer:source C.java</tt></td>
<td><tt>A.java</tt></td>
<td><tt>B.java</tt></td>
<td><tt>C.java</tt></td>
</tr>
<tr><td rowspan="6">Files changed</td>
<td>No files</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Too much</td>
</tr>
<tr><td><tt>A.java</tt></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td><tt>B.java</tt></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Too much</td>
</tr>
<tr><td><tt>A.java</tt> and <tt>B.java</tt></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Too much</td>
</tr>
<tr><td><tt>C.java</tt></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Too much</td>
</tr>
<tr><td>All files</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Too much</td>
</tr>
</tbody>
</table>
<table border="1" class="docutils">
<colgroup>
<col width="22%" />
<col width="28%" />
<col width="12%" />
<col width="12%" />
<col width="12%" />
<col width="16%" />
</colgroup>
<tbody valign="top">
<tr><td colspan="2">gcj 4.4.3 (GNU compiler for Java)</td>
<td colspan="3">File(s) recompiled</td>
<td rowspan="2">Result?</td>
</tr>
<tr><td colspan="2"><tt>gcj -C C.java</tt></td>
<td><tt>A.java</tt></td>
<td><tt>B.java</tt></td>
<td><tt>C.java</tt></td>
</tr>
<tr><td rowspan="6">Files changed</td>
<td>No files</td>
<td> </td>
<td> </td>
<td>X</td>
<td>Debatable</td>
</tr>
<tr><td><tt>A.java</tt></td>
<td> </td>
<td> </td>
<td>X</td>
<td>BROKEN!</td>
</tr>
<tr><td><tt>B.java</tt></td>
<td> </td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td><tt>A.java</tt> and <tt>B.java</tt></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td><tt>C.java</tt></td>
<td> </td>
<td> </td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td>All files</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
</tbody>
</table>
<table class="docutils" border="1">
<colgroup>
<col width="22%">
<col width="28%">
<col width="12%">
<col width="12%">
<col width="12%">
<col width="16%">
</colgroup>
<tbody valign="top">
<tr><td colspan="2">Eclipse batch compiler 3.5.1</td>
<td colspan="3">File(s) recompiled</td>
<td rowspan="2">Result?</td>
</tr>
<tr><td colspan="2"><tt>ecj C.java</tt></td>
<td><tt>A.java</tt></td>
<td><tt>B.java</tt></td>
<td><tt>C.java</tt></td>
</tr>
<tr><td rowspan="6">Files changed</td>
<td>No files</td>
<td> </td>
<td> </td>
<td>X</td>
<td>Debatable</td>
</tr>
<tr><td><tt>A.java</tt></td>
<td> </td>
<td> </td>
<td>X</td>
<td>BROKEN!</td>
</tr>
<tr><td><tt>B.java</tt></td>
<td> </td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td><tt>A.java</tt> and <tt>B.java</tt></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td><tt>C.java</tt></td>
<td> </td>
<td> </td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td>All files</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
</tbody>
</table>
<table class="docutils" border="1">
<colgroup>
<col width="22%">
<col width="28%">
<col width="12%">
<col width="12%">
<col width="12%">
<col width="16%">
</colgroup>
<tbody valign="top">
<tr><td colspan="2">Eclipse IDE</td>
<td colspan="3">File(s) recompiled</td>
<td rowspan="2">Result?</td>
</tr>
<tr><td colspan="2">Files edited within Eclipse than saved</td>
<td><tt>A.java</tt></td>
<td><tt>B.java</tt></td>
<td><tt>C.java</tt></td>
</tr>
<tr><td rowspan="6">Files changed</td>
<td>No files</td>
<td> </td>
<td> </td>
<td> </td>
<td>Success</td>
</tr>
<tr><td><tt>A.java</tt></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td><tt>B.java</tt></td>
<td> </td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td><tt>A.java</tt> and <tt>B.java</tt></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td><tt>C.java</tt></td>
<td> </td>
<td> </td>
<td>X</td>
<td>Success</td>
</tr>
<tr><td>All files</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Success</td>
</tr>
</tbody>
</table>
<h2>Flavors of fail</h2>
<p>The Java compilers err in two ways:</p>
<ul>
<li>Compiling too much — one or more files were compiled even though neither the file nor any of its dependencies were changed. This mistake results in correct build products, but creates extra work for the compiler (namely, a full build every time).</li>
<li>Compiling too little — a file that was changed was not recompiled, even though it was a (direct or indirect) dependency of <tt>C.java</tt>. This mistake results in incorrect build products and a <em>broken application</em>!</li>
</ul>
<p>As you can see, the only method of compilation that compiled the correct, minimum amount of code in each case was the automatic compilation that is done by the Eclipse IDE. Unfortunately, an IDE cannot practically be used in an automated build system.</p>
<p>The only way reliably to get correct build results from a command line compiler is to use the <tt>-Xprefer:source</tt> option to the OpenJDK compiler. What does this option do? It causes the compiler to ignore any existing class files and compile <em>everything</em> from source at each build. Forget incremental compilation.</p>
<p>All other options fail to notice that <tt>A.java</tt> and <tt>B.java</tt> need to be recompiled when <tt>A.java</tt> is changed. Presumably the compiler notices, when compiling <tt>C.java</tt>, that it depends on <tt>B.class</tt>. But, not being able to tell from <tt>B.class</tt> that it depends on <tt>A.class</tt>, it thinks that the <tt>B.class</tt> file is up to date and so never even looks in <tt>B.java</tt> or <tt>A.java</tt>. The result is that the build is incomplete and the application is broken.</p>
<p>A broken build is a nasty thing because it introduces an inconsistency at a level where it is not expected — the application runs differently than the source code specifies. This can easily result in a debugging wild goose chase (correct source code that mysteriously doesn't run correctly) or a broken deployment (the incorrectly-compiled code passed all tests, but a full build at deployment exposed the broken code). I am not willing to tolerate even a small chance of a broken build.</p>
<h2>What can be done?</h2>
<p>It is not reasonable to rely on an IDE for building code to be deployed. Among other things, an IDE is hard to automate, and one would need to ensure that developers are all using exactly the same version of the IDE. Therefore, using the Eclipse IDE for deployment builds is not an alternative.</p>
<p>If each class file would note which other files it depends on (including via <tt>static final</tt> uses), then it would be easy to teach <tt>javac</tt> to compile the right thing automatically. I'm no expert on this topic but I fear that this would require a change to the class file format, which more or less needs unanimous approval of the United Nations General Assembly.</p>
<p>If there were a tool that could reliably determine Java inter-class dependencies, it could be used by other tools to deduce which files need to be recompiled given a particular source code change. (What is needed is a sort of <tt>gcc -M</tt> for Java.) I don't know of such a tool, but feedback would be very welcome.</p>
<p>So to summarize, I don't know a good solution to this problem. In practice, I rely on Eclipse's recompilation for most day-to-day use, and do an automated full build when I want to be sure. But this means that I spend a lot of time waiting for javac when I would rather be coding.</p>
<p>Feedback about this topic would be very welcome. What do you use for compiling Java code? Is it reliable? Has the lack of reliable tools caused you grief? What projects should we keep an eye on?</p>Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com1tag:blogger.com,1999:blog-6771653874686998211.post-76924703328918003572009-10-02T21:45:00.003+02:002010-10-11T09:14:45.661+02:00GNU make trick for handling long lists<p>If you compile a big Java project using Make, you might have discovered that there is a limit to the length of a command line.</p>
<p>The problem is that the Java virtual machine is so slow to start up, and it is so difficult to determine Java dependencies reliably, that it is advantageous to compile all of your Java files in a single invocation of javac. You also know that <a href="http://miller.emu.id.au/pmiller/books/rmch/">recursive make is considered harmful</a>, so you want to create your big list of Java files by including module makefiles in the project directories [1]. So you naively do the following:</p>
<pre>
JAVA_SRC :=
# ...here include lots of module makefiles that add files to JAVA_SRC...
.PHONY: java_class_files
java_class_files: $(JAVA_SRC)
$(JAVAC) $(JAVAC_FLAGS) $(JAVA_SRC)
</pre>
<p>But sooner or later this starts to fail, because the javac command line gets too long. How long is too long? On my Linux computer, the maximum length of a command line is about 128 kB, which (at, say, 40 characters per filename) comes out to about 3200 files. Sure, that's a lot of files, but our main project at work passed that number a couple of years ago. So what do you do?</p>
<p>javac has the nice feature that you can put the list of files to compile into a text file, and specify the name of this text file on the command line preceded by an "@" sign: <tt>javac @java-file-list</tt>. This sounds promising, so you try writing the list of Java source files to a file:</p>
<pre>
JAVA_SRC :=
# ...here include lots of module makefiles that add files to JAVA_SRC...
.PHONY: java_class_files
java_class_files: $(JAVA_SRC)
@echo $JAVA_SRC >java-filelist
$(JAVAC) $(JAVAC_FLAGS) @java-filelist
</pre>
<p>But that obviously won't work, because the "echo" command line is just as long as the old "javac" command line. Similarly, xargs(1) doesn't get you anywhere because there is no obvious way to get the list of filenames onto the standard input of the xargs command. It's a tough nut to crack.</p>
<p>(At about this time you wonder why you ever decided to use make on a Java project, but we'll leave <span style="font-style: italic;">that</span> topic for another day.)</p>
<p>It turns out that there is a solution. Recent versions of GNU make provide <span style="font-style: italic;">just enough</span> functionality to barely allow a general solution to this problem. It relies on a few of make's primitive functions:</p>
<ul>
<li> $(if CONDITION,THEN-PART[,ELSE-PART]) — If CONDITION is a non-empty string, then expand to THEN-PART; otherwise expand to ELSE-PART.
</li>
<li>$(call VARIABLE,PARAM1,PARAM2,...) — Expand the contents of VARIABLE, substituting the PARAM1 wherever $(1) appears, PARAM2 for $(2), etc.</li>
<li>$(words TEXT) — Return the number of whitespace-separated words in TEXT.
</li>
<li>$(word N,TEXT) — Return the Nth whitespace-separated word of TEXT, or the empty string if there are fewer than N words in TEXT.</li>
<li>$(wordlist S,E,TEXT) — Return the list of words in TEXT with indices S through E.</li>
</ul>
<p>With these tools and a little bit of recursion, we can write a make function that is the approximate equivalent of the unix xargs command:</p>
<pre>
# Macro for invoking a command on groups of 1000 words at a time
# (analogous to xargs(1)). The macro invokes itself recursively
# until the list of words is depleted.
#
# Usage: $(call xargs,COMMAND,LIST)
#
# COMMAND should be a shell command to which the words will be
# appended as arguments in groups of 1000.
define xargs
$(1) $(wordlist 1,1000,$(2))
$(if $(word 1001,$(2)),$(call xargs,$(1),$(wordlist 1001,$(words $(2)),$(2))))
endef
</pre>
<p>How does this work? The first line invokes the command (i.e., $(1)) with the first 1000 words as arguments. The second line checks whether there is a word number 1001; if so, it invokes xargs recursively with words 1001 through the last word in the list. [2]</p>
<p>Now we use the new xargs function to define a function that writes all of the words from a list into a file:</p>
<pre>
# Macro for writing the contents of a word list to a file, one word
# per line. This macro will work even for very long lists of words.
#
# Usage: $(call write-to-file,FILENAME,LIST)
define write-to-file
@: >$(1)
$(call xargs,@printf "%s\n" >>$(1),$(2))
endef
</pre>
<p>The first line uses the shell builtin ":" to create the new file (and erase any old contents that the file might have had). The second line uses xargs to invoke the shell builtin printf with the filenames as arguments, 1000 at a time, and appends the results to the new file. The formatting string passed to printf causes it to print a newline after each filename rather than, say, 1000 files on each line.</p>
<p>With these functions in hand, the Java files can be compiled as follows:</p>
<pre>
JAVA_SRC :=
# ...here include lots of module makefiles that add files to JAVA_SRC...
JAVA_FILE_LIST := java-file-list
.INTERMEDIATE: $(JAVA_FILE_LIST)
$(JAVA_FILE_LIST): $(JAVA_SRC)
$(call write-to-file,$@,$(JAVA_SRC))
.PHONY: java_class_files
java_class_files: $(JAVA_FILE_LIST)
$(JAVAC) $(JAVAC_FLAGS) @$(JAVA_FILE_LIST)
</pre>
<p>Well, as the Germans say, "nicht schön aber selten". But don't you get the feeling that this would all be a lot easier in a real programming language?</p>
<p>[1] If you want to compile <span style="font-style:italic;">every</span> Java file within a set of subdirectories, then you can generate the list of files using the find(1) command. But I'm assuming that you want to have finer control over exactly which Java files are compiled in a particular build.</p>
<p>[2] It is possible to make the number "1000" another parameter to the xargs function, but it is ugly because the macro also needs the analogue of "1001". Instead, to get the list of words starting at 1001, you can first get the list of words starting at 1000, then get the sublist of <span style="font-style:italic;">that</span> list starting at word number 2. Or, at the cost of starting another process, you could use shell facilities for doing arithmetic.</p>Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com1tag:blogger.com,1999:blog-6771653874686998211.post-60147132057284947242009-08-13T01:30:00.007+02:002010-10-11T09:08:55.320+02:00Git, Mercurial, and Bazaar—simplicity through inflexibility<p>Git, Mercurial, and Bazaar, the most prominent DAG-based distributed version control systems, have rightly earned a reputation for making branching and merging easy enough to do every day. And their proponents often deride Subversion for making branching easy but merging difficult.</p>
<p>What is the magic pixie dust that allows DVCSs to merge so effortlessly? And why is Subversion <a href="http://www.youtube.com/watch?v=4XpnKHJAok8">too stupid</a> to get it right? (And how can the author of this blog possibly use the words "simplicity" or "inflexibility" in the same breath as "git"???)</p>
<p><span style="font-weight: bold;">Partisanship disclaimer</span>: I like both Subversion and git and use them both daily, and I expect that I would like Mercurial and Bazaar if I had time to work with them. I have contributed patches to the testing frameworks of both Subversion and git, and I am the maintainer of <a href="http://cvs2svn.tigris.org/">cvs2svn</a>/<a href="http://cvs2svn.tigris.org/cvs2git.html">cvs2git</a>. In other words, I'm a fan of <span style="font-style: italic;">all</span> of the VCSs discussed in this article.</p>
<p>Contrary to what you might think, DVCSs' merging prowess is not because those projects have rocket scientists developing infallible merge algorithms. Their algorithms work better, but not radically so. Merge conflicts are an unavoidable aspect of parallel development, and nasty ones will occur regardless of what VCS is being used.</p>
<p>Rather, it is because the DAG-based DVCSs <span style="font-weight: bold;">cannot represent complicated histories</span> and therefore <span style="font-weight:bold;">do not have</span> to merge them.</p>
<h2>Inflexible DVCSs</h2>
<p>To understand merging in DAG-based VCSs, it is necessary to understand the Fundamental Law of DAG-based VCSs:</p>
<blockquote><span style="font-style: italic;"><span style="font-weight: bold;">Every</span> change made by <span style="font-weight: bold;">every</span> commit that is an ancestor of a given revision is assumed to be incorporated into that revision.</span>
</blockquote>
<p>The Fundamental Law simplifies merging because it vastly restricts the types of histories that can be represented in the VCS. For example, the DAG cannot represent the history of</p>
<ul><li>Cherry picking, in which some (but not all) commits from one branch are incorporated into another branch</li><li>Rebases in which the new parent is not a descendant of the old parent</li><li>Rebases in which the order of the commits is changed
</li><li>Merging changes from branch to branch at a less-than-full-tree granularity; for example, merging the changes to file1 but not the changes to file2
</li></ul>
<p>Of course I am not claiming that the DVCSs do not support these operations. Rather, I am claiming that their support, where it exists, is broken because the DAG cannot properly record the histories of such operations. That is why, for example, it is problematic to rebase branches that have been published (though for some kinds of rebases <a href="http://softwareswirl.blogspot.com/2009/04/truce-in-merge-vs-rebase-war.html">there may be a solution</a>). It is also why it is common to have conflicts when merging two branches if some content has already been cherry-picked from one branch to the other.</p>
<h2>Flexible Subversion</h2>
<p>Subversion, on the other hand, completely earned its reputation for making merging difficult—primarily because it didn't record merges at all. That made merging a nightmare of manual record-keeping or reliant on third-party tools like <a href="http://www.orcaware.com/svn/wiki/Svnmerge.py">svnmerge.py</a>. But that was before Subversion release 1.5.</p>
<p>Starting with release 1.5, Subversion, ironically, supports a much more flexible model of merging than the DAG-based DVCSs. Changes from any commit can be merged to any branch at the single-file level of granularity, enabling all of the operations listed above and some even weirder things (for example, a change that was originally applied to one file can be "merged" onto a completely different file). If your workflow demands this sort of thing, Subversion might hold significant advantages for you.</p>
<p>But there are also many disadvantage to Subversion's flexibility:</p>
<ul>
<li>Subversion's merging model is more complicated than that of DAG-based VCSs, and therefore more complicated to implement and less predictable.</li>
<li>It is much harder to visualize the history of a Subversion project (contrast that to DVCSs, whose history can be displayed as a single DAG).</li>
<li>Subversion merges are innately slow, because of the large quantities of metadata that have to be manipulated.</li>
<li>The bookkeeping of SVN merge info requires more user conscientiousness, and mistakes are not as easy to spot and fix.</li>
</ul>
<h2>Simplicity vs. flexibility</h2>
<p>Of course, each of these systems has dramatic strengths and weaknesses related to implementation quality, user interface, etc. But on the single criterion of how merge history is recorded, I claim that git, Mercurial, and Bazaar are fundamentally simpler than Subversion because they are less flexible.</p>
<p>So, does Subversion's flexibility mean that it is better than the DVCSs? Definitely not. Many projects don't need the flexibility offered by Subversion—or even worse, they <span style="font-style: italic;">mistakenly think</span> that they need it. For such projects, the simpler DAG-based model is adequate and will make branching and merging go more smoothly. Other projects, particularly those that rely heavily on cherry-picking, might consider whether Subversion, or perhaps a DVCS like <a href="http://darcs.net/">darcs</a> that has a different model, might be a better fit. And there are certainly other considerations, besides the history model, that will influence the decision of which VCS to use.</p>
<p>Me? I'm going to continue using both.</p>Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com4tag:blogger.com,1999:blog-6771653874686998211.post-16457176633719152202009-08-09T23:50:00.004+02:002010-10-11T09:04:48.145+02:00Rebase with history -- implementation ideas<p>In my last blog posts, I discussed and idea for <a href="http://softwareswirl.blogspot.com/2009/04/truce-in-merge-vs-rebase-war.html">retaining history when rebasing</a> and <a href="http://softwareswirl.blogspot.com/2009/08/upstream-rebase-just-works-if-history.html">a concrete example</a> of how it improves on traditional rebasing without history. In this third post in the series, I would like to discuss how rebase with history could be implemented.</p>
<p>First, a reminder of what a simple rebase-with-history looks like:</p>
<pre>
M0----M1----M2 M0----M1----M2
\ \ \
\ --> \ B1'---B2'
\ \ / /
B1----B2 -------B1----B2
</pre>
<p>The work B1+B2 on a feature branch as been rebased onto the current version M2 of the master branch, producing two rebased revisions B1' and B2'. A traditional rebase-without-history would discard commits B1 and B2, whereas the point of rebase-with-history is to retain them.</p>
<p>A rebase-with-history can be done manually, today, in any of the DAG-based DVCSs. The algorithm is basically:</p>
<ol><li>Ensure that the old parent of the rebased branch (M0 in the example above) is a direct ancestor of the new parent (M2).</li><li>Create a new temporary branch at M2.</li><li>For each commit that is to be rebased, merge that commit onto the temporary branch, resolving conflicts if necessary. This creates merge commits B1' then B2'.</li><li>Designate the last new merge commit (B2') at the head of the temporary branch to be the new head of the feature branch.
</li></ol>
<h2>Avoiding information overload</h2>
<p>Earlier I argued that the full DAG resulting from a rebase-with-history contains more detail than the average user wants to see. After all, many people are happy with the current style of rebase-without-history, which completely discards B1 and B2. So how can we reduce the visual clutter without losing the historical information that is needed to support future merges? In other words, assume that the VCS wants to store the above DAG, but the typical user typically wants to see this:</p>
<pre>
M0----M1----M2
\
B1'---B2'
</pre>
<p>How can this dualism be implemented?</p>
<p>It is clear that the commits B1 and B2 have to be retained in the DAG rather than recorded in some kind of supplemental metadata. Otherwise, many VCS commands would have to be modified to be aware of this metadata and treat B1 and B2 <span style="font-style: italic;">as if</span> they were in the DAG. It is also necessary to prevent B1 and B2 from being garbage collected. So it is simpler to retain them in the DAG proper.</p>
<p>It is also clear that somehow marking the pre-rebase commits B1 and B2 themselves "historical" is not correct. After all, those commits may be integral to other branches in which they are not "historical". Perhaps the rebased version of the branch will turn out to be a false start and never even be pushed upstream. So it is not the commits themselves that are "historical"; it is only their relationship with the rebased commits B1' and B2' that is "historical".</p>
<p>And that, of course, suggests the correct solution. Commit B1' has two parents, M2 and B1. M2 is always interesting and should be recorded as a normal parent. But B1 is interesting mainly as the historical predecessor of B1', but usually <span style="font-style: italic;">not</span> as part of the visible history in the upstream repository. Therefore, within the B1' node, B1 should be recorded as a parent but with a special "historical" flag.</p>
<p>Similarly, commit B2' has two parents--B1', which is a "normal" parent, and B2, which should be recorded as a "historical" parent.</p>
<p>The "historical" flag would be mainly a hint to GUI-related commands that the linked-to parent can be suppressed in most user views. When walking the DAG for display purposes, "historical" parents would simply not be followed. The historical commits B1 and B2 are only connected to the branch head via "historical" parent links, so they will not be traversed. By this mechanism, the DAG of the branch is displayed just like the DAG of a rebase-without-history. Of course, if commits B1 and B2 are non-historical ancestors of other branches, they will be displayed normally when the ancestry of those branches is displayed.</p>
<p>The GUI-related commands should have an extra --full-history flag to override the default behavior, causing historical parents to be displayed along with non-historical parents. (Optionally, historical links could be flagged somehow to allow them to be colored differently, etc.)</p>
<p><span style="font-weight: bold;">CAVEAT:</span> Please recall that this whole discussion only applies to the "feature branch tracking upstream" class of rebase operations, namely those where:</p>
<ul><li>the order of commits in the branch is unchanged, and</li><li>the new location of the branch is a descendant of the old location of the branch.</li></ul>
<p>However, this is a common and important class of rebases. Other types of rebases (e.g., those that change the order of commits) cannot be done with history retention, because there is no way to represent such history within a DAG model. (darcs uses different technology and therefore doesn't share this limitation.)</p>
<p><span style="font-weight: bold;">Feedback:</span> I would be very happy to receive feedback about this proposal for "rebase with history". Would it be helpful? Are there corner-cases that would prevent its implementation? Where has it been discussed before? (There's nothing new under the sun, you know...) Please add comments!</p>
<p><span style="font-style:italic;">[Modified 2009-08-13 to add an outline of the rebase-with-history algorithm.]</span></p>Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com4tag:blogger.com,1999:blog-6771653874686998211.post-34512234584997307092009-08-09T13:56:00.012+02:002010-10-11T09:02:30.947+02:00Upstream rebase Just Works™ if history is retained<p>The <a href="http://www.kernel.org/pub/software/scm/git/docs/git-rebase.html">git-rebase documentation</a> has a section entitled "Recovering from Upstream Rebase" describing how to straighten out the mess that results from an upstream repository doing a rebase. In my <a href="http://softwareswirl.blogspot.com/2009/04/truce-in-merge-vs-rebase-war.html">last blog post</a> I described how simple rebases can be done with history, and claimed that this makes it possible to share a repository that has been rebased. In this post, I would like to show how the example from the git-rebase manual page work out if rebase history is retained.</p>
<p>First, the example <span style="font-style:italic;">without</span> history. The premise is that you are working on a topic branch that is dependent on a subsystem that somebody else is working on:</p>
<pre>
m---m---m---m---m---m---m---m master
\
o---o---o---o---o subsystem
\
*---*---* topic
</pre>
<p>Then the subsystem developer rebases subsystem onto master:</p>
<pre>
m---m---m---m---m---m---m---m master
\ \
o---o---o---o---o o'--o'--o'--o'--o' subsystem
\
*---*---* topic
</pre>
<p>If you now merge the topic branch to the subsystem branch, a mess results:</p>
<pre>
m---m---m---m---m---m---m---m master
\ \
o---o---o---o---o o'--o'--o'--o'--o'--M subsystem
\ /
*---*---*-..........-*--* topic
</pre>
<p>The first problem here is that git has no idea that the two versions of the subsystem revisions, namely "o-o-o-o-o" and "o'-o'-o'-o'-o'", actually contain the same logical changes. Therefore, the merge that creates commit M can result in spurious conflicts. Specifically, git will try to combine the changes in "o-o-o-o-o" with those in "o'-o'-o'-o'-o'", and may or may not be able to make sense out of the situation.</p>
<p>Second, in some (admittedly rather unlikely) situations, git's ignorance can cause problems with future merges.</p>
<p>Finally, there is no easy way to convert the merged topic branch into a series of patches that apply cleanly to the rebased subsystem branch and can therefore be submitted upstream. In this case, the only available patch that applies cleanly to the rebased subsystem branch is the merge patch M, which is a single commit that squashes together the entire topic branch and is therefore difficult to review.</p>
<p>In fact, in the git world you typically wouldn't merge the topic branch into the subsystem branch. Instead, you would rebase the topic branch onto the rebased subsystem branch:</p>
<pre>
m---m---m---m---m---m---m---m master
\
o'--o'--o'--o'--o' subsystem
\
*'--*'--*' topic
</pre>
<p>This gives a usable result, but requires extra effort by the topic branch maintainer (and the maintainers of any other branches that rely on the subsystem branch or the topic branch). Because this rebase is also prone to conflicts unless you explicitly tell git what interval of revisions to rebase, using a command like <code>git rebase --onto subsystem SHA1 topic</code>. SHA1 refers to the last commit in the pre-rebase "o-o-o-o-o" series, and you have to determine it yourself. Rebase-with-history relieves the user from having to do this kind of manual bookkeeping.</p>
<p>If, instead, the subsystem maintainer had retained history when rebasing as described in <a href="http://softwareswirl.blogspot.com/2009/04/truce-in-merge-vs-rebase-war.html">my earlier post</a>, the DAG would look like this:</p>
<pre>
m---m---m---m---m---m---m---m master
\ \
\ o'--o'--o'--o'--o' subsystem
\ / / / / /
--------------------o---o---o---o---o
\
*---*---* topic
</pre>
<p>Now you are free to merge the topic branch with the subsystem branch:</p>
<pre>
m---m---m---m---m---m---m---m master
\ \
\ o'--o'--o'--o'--o' subsystem
\ / / / / / \
--------------------o---o---o---o---o ---------
\ \
*---*---*---x topic
</pre>
<p>The fact that both the old and new versions of the subsystem revisions are still in the DAG is not a problem, because the dependencies make it clear to git that the old version is already incorporated into the new version, so git will never try to merge them again. But this is also not ideal, because it still doesn't allow the topic branch to be converted into a tidy patch series.</p>
<p>Even better, you can even <span style="font-style:italic;">rebase</span> your changes onto the subsystem branch, preferably also retaining rebase history:</p>
<pre>
m---m---m---m---m---m---m---m master
\ \
\ o'--o'--o'--o'--o' subsystem
\ / / / / / \
--------------------o---o---o---o---o \
\ \
\ *'--*'--*' topic
\ / / /
*---*---*
</pre>
<p>Whether you retain history during this rebase or not, the rebase is simple because git can figure out for itself what range of revisions to rebase. The result is also a well-formed repository, and will not create any problems for other developers who might have based their work on your topic branch. And when you are ready, you can submit the patches "*'-*'-*'" as-is to the subsystem maintainer.</p>
<p>Remember that just because the old and new versions of rebased revisions are included in the DAG doesn't mean that they have to be displayed in the UI all of the time. If there were a way to mark the old versions "uninteresting", then the "interesting" part of the DAG could be displayed just like in the rebase-without-history case:</p>
<pre>
m---m---m---m---m---m---m---m master
\
o'--o'--o'--o'--o' subsystem
\
*'--*'--*' topic
</pre>
<p>but the retention of the hidden pre-rebase commits enables future rebases and merges to be carried out smoothly.</p>
<p><span style="font-style:italic;">[Modified 2009-08-16 based on discussion the git mailing list; thanks especially to Björn Steinbrink for his comments.]</span></p>
<p>See also my related posts:</p>
<ul><li><a href="http://softwareswirl.blogspot.com/2009/04/truce-in-merge-vs-rebase-war.html">an introduction to rebase-with-history</a></li><li><a href="http://softwareswirl.blogspot.com/2009/08/rebase-with-history-implementation.html">thoughts about its implementation</a></ul>Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com0tag:blogger.com,1999:blog-6771653874686998211.post-31194694282502299012009-08-08T23:40:00.008+02:002010-10-11T08:59:42.019+02:00A truce in the merge vs. rebase war?<p>There is a bit of a cultural war going on between the various DVCS communities. Git users love to use "git rebase" to rewrite the history in their local repository before pushing a perfect set of patches upstream, even though they realize that using rebase on work that has already been published causes lots of trouble. Bazaar users, on the other hand, tend to insist that history should be immutable and that merging is preferred over rebasing, and are shocked, <span style="font-weight: bold;">SHOCKED</span> at the libertine ways of their git counterparts.</p>
<p>In this blog post, I want to talk about the correspondence between rebasing and merging, and hope to show that there is a middle ground that has some of the advantages of both extremes. I will restrict myself to discussing a very specific kind of rebasing, namely tracking upstream changes while working on a feature branch.</p>
<p>In this post I will use ASCII art to represent repository DAGs; in these drawings, time progresses from left to right.</p>
<p>Suppose somebody is working on a feature branch that is logically developed as several smaller steps B1, B2, ..., and meanwhile development is continuing on master as M1, M2, ....</p>
<h2>The problem with repeated merges from master to branch</h2>
<p>After the feature branch has been developed for a while, the DAG looks like this.</p>
<pre>
M0----M1----M2
\
\
\
B1----B2
</pre>
<p>If master is merged to the feature branch, we get</p>
<pre>
M0----M1----M2
\ \
\ \
\ \
B1----B2----X1
</pre>
<p>where X1 is a merge commit that "corrects" B1+B2 for the changes that have occurred in master as M1+M2. A little while later, after work has proceeded on both master and the feature branch, the DAG looks like this</p>
<pre>
M0----M1----M2----M3----M4
\ \
\ \
\ \
B1----B2----X1----B3----B4
</pre>
<p>and the next merge changes it to</p>
<pre>
M0----M1----M2-------M3-------M4
\ \ \
\ \ \
\ \ \
B1----B2----X1----B3----B4----X2
</pre>
<p>Where X2 is a merge commit that "corrects" B1+B2+X1+B3+B4 for the changes that occurred in master as M3 and M4.</p>
<p>The problem is that it is very difficult for a reviewer to check changes that are submitted in this form.</p>
<p>The reviewer could review the overall difference between M4 and X2, but then all of the changes would be mashed together and he wouldn't have the benefits of the commit messages explaining the individual steps.</p>
<p>The reviewer could review the changes patch by patch, but
<ul><li>He is most familiar with the current state of master. But he has to analyze the earlier patches in the context of older versions of master (M0 and M2).</li><li>The merge patches X1 and X2 are especially difficult to analyze. X2, for example, reflects the changes needed to adapt <em>multiple</em> patches (B1+B2+X1+B3+B4) to <em>multiple</em> changes on master (M1+M2+M3+M4). X1 cannot be checked without understanding the changes made between two historical master versions (M0..M2). Can <em>you</em> remember exactly what happened in your code base from four weeks ago until two weeks ago?</li></ul>What's more, the patch series contains information that is of marginal usefulness to a reviewer. Why should the reviewer care how a logical change <em>would have</em> been made against a historical master revision, when his decision is actually whether to apply a change to the current master revision?</p>
<p>It is true that no information is lost, but the information is not <em>retained</em> in a very convenient form.</p>
<hr />
<h2>The problems with rebasing</h2>
<p>The same situation, after the first rebase, now looks like this:</p>
<pre>
M0----M1----M2
\
\
\
B1'---B2'
</pre>
<p>The original patches B1 and B2 have been transformed into patches B1' and B2', which are hopefully logically equivalent.</p>
<p>After further work, the DAG looks like</p>
<pre>
M0----M1----M2----M3----M4
\
\
\
B1'---B2'---B3----B4
</pre>
<p>which is rebased a second time to</p>
<pre>
M0----M1----M2----M3----M4
\
\
\
B1''--B2''--B3'---B4'
</pre>
<p>This patch series is all very neat and clean and very easy to review. The patches can all be understood within a single context -- the current state of master -- that the reviewer is presumably most familiar with. The rebased patch series is optimized for the reviewer.</p>
<p>The problem is that rebasing discards history, perhaps not in the strict sense that intermediate commits are gone, but in the sense that there is no useful record of how the rebased patches are related to the original patches.</p>
<p>Specifically, the history of the individual patches has been lost (e.g., B1 -> B1' -> B1''). This loss is particularly unfortunate because these transformations were presumably done automatically by the VCS, which might have introduced errors. The author's original intention, as expressed in the original hand-written patches (B1, B2, B3, B4), has been lost. There is no way for the reviewer to drill down and tell the author, for example, that "when rebasing from B2' to B2'', you forgot to take into account change foo in M3".</p>
<p>Perhaps the worst problem with traditional rebasing is that it prevents collaboration. Somebody who pulls from a repository before and after a rebasing will experience conflicts because the two histories contradict each other. Thus the standard caveat "don't rebase within a published repository", which can be reworded to "don't collaborate on work that you might later want to rebase".</p>
<hr />
<h2>How should rebase really work?</h2>
<p>For the simplest type of rebasing discussed above, it is simple to get both reviewing convenience and historical accuracy by changing rebase to retain a little bit more information about what it did.</p>
<p>Consider again the first rebasing:</p>
<pre>
M0----M1----M2 M0----M1----M2
\ \
\ --> \
\ \
B1----B2 B1'---B2'
</pre>
<p>In fact, consider the first step of the first rebasing, in which B1 is "replayed" on top of M2:</p>
<pre>
M0----M1----M2 M0----M1----M2
\ \ \
\ --> \ B1'
\ \
B1----B2 B1----B2
</pre>
<p>This really involves <em>merging</em> the changes of (M1+M2) with the changes of B1. B1' is really a <em>merge commit</em>. So let's record it like a merge commit, with both of its ancestors:</p>
<pre>
M0----M1----M2 M0----M1----M2
\ \ \
\ --> \ B1'
\ \ /
B1----B2 -------B1----B2
</pre>
<p>The second step of the rebasing, similarly, <em>merges</em> B2 with (M1+M2+B1'), giving</p>
<pre>
M0----M1----M2 M0----M1----M2
\ \ \
\ --> \ B1'---B2'
\ \ / /
B1----B2 -------B1----B2
</pre>
<p>The semantics of the resulting DAG is correct; B1' really <em>does</em> include all of the changes from M0+M1+M2 and also the changes from B1. Similarly, B2' includes the changes from B1' and from B2.</p>
<p>The resulting DAG is certainly more complicated than the one resulting from a traditional rebase, but it includes both the easy-to-review patches B1' and B2' and also the history of the individual patches B1 -> B1' and B2 -> B2'. It is an honest representation of the repository's history.</p>
<p>The key difference between this procedure and the simple merge described above is that the merge is done <em>patch by patch</em>. This avoids the creation of merge patches like X1 and X2 above that mix together the merge work needed for multiple patches.</p>
<p>After further work, the DAG looks like this:</p>
<pre>
M0----M1----M2----M3----M4
\ \
\ B1'---B2'---B3----B4
\ / /
-------B1----B2
</pre>
<p>Now we rebase a second time:</p>
<pre>
M0----M1----M2----M3----M4
\ \ \
\ \ B1''--B2''--B3'---B4'
\ \ / / / /
\ -------B1'---B2'---B3----B4
\ / /
--------------B1----B2
</pre>
<p>The DAG is yet more complicated, but it is still correct and has the desired properties. It is easy to read off the easy-to-review patches (B1'', B2'', B3', B4'), and it is also easy to read off the history of individual patches (e.g., B1 -> B1' -> B1'').</p>
<p>Now, if somebody pulls from the repository between the two rebases and then again after the second rebase, the right thing will happen. The retention of the merge information allows rebased work to be shared. Indeed, one collaborator can do one rebase and another can do the next.</p>
<p>Maybe neither merging nor rebasing is optimal for maintaining a
long-lived feature branch.</p>
<h2>Is this overkill?</h2>
<p>The DAGs resulting from these history-retaining rebases are quite complicated. Does anybody really want to see all this complexity in their main upstream repository?</p>
<p>Probably not, or at least not usually. So I see two alternatives:</p>
<ul><li>Retain the rebase history while working on the feature, but discard it when submitting patches upstream. This approach has the advantage that multiple people can collaborate on the feature without stepping on each others feet, but the upstream repository is spared the intermediate complexity. It also doesn't require any changes to the tools, except maybe some simple scripts to help with the iterative merge needed to do a history-retaining rebase.</li>
<li>Retain the rebase history permanently, but suppress the display of the pre-rebase commits in most cases. This would require tool changes to allow commits to be marked "uninteresting" and to omit them from most types of operations.</li>
</ul>
<p>See also my followup posts:</p>
<ul><li><a href="http://softwareswirl.blogspot.com/2009/08/upstream-rebase-just-works-if-history.html">a concrete example of rebasing within a repository that has already been published</a></li><li><a href="http://softwareswirl.blogspot.com/2009/08/rebase-with-history-implementation.html">thoughts about the implementation of rebase-with-history</a></ul>Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com5tag:blogger.com,1999:blog-6771653874686998211.post-41058089186150738832009-08-08T13:31:00.004+02:002010-10-11T08:51:48.290+02:00Benchmarking build systems<p>I've long been on a quest to find a replacement for "make" for building projects. I will be writing more about this, but for now I would like to report some preliminary results from benchmarking some build systems, including a <a href="http://code.google.com/p/fabricate/">newcomer</a> to the scene.</p>
<p>The benchmarking effort was inspired by by discovery of a couple of excellent <a href="http://gamesfromwithin.com/?p=42">blog</a> <a href="http://gamesfromwithin.com/?p=44">posts</a> done by Noel Llopis in 2005. I especially wanted to extend his results to cover <a href="http://code.google.com/p/fabricate/">non-recursive use of make (PDF)</a> and <a href="http://code.google.com/p/fabricate/">fabricate</a>. Happily, Noel published <a href="http://www.gamesfromwithin.com/wp-content/uploads/bin/generate_libs.zip">the scripts</a> that he used for his work, so I downloaded them and added two new benchmarks.</p>
<p>There are many other build tools (and build tool advocates!) so I made a <a href="http://repo.or.cz/w/build-benchmarks.git">public git repository</a> to hold the benchmarking tools. I welcome patches containing benchmarks for other build tools! <strike>Personally, I would be very interested to see benchmarks of <a href="http://code.google.com/p/waf/">waf</a>, which claims to be much faster than scons.</strike> <span style="font-style:italic;">[I have added data for <a href="http://code.google.com/p/waf/">waf</a> to the results table below.]</span></p>
<p>Here I would like to report some preliminary results. Briefly, the test "project" consists of 50 libraries; each library consists of 100 C++ header files and 100 C++ source files; each header+source defines one trivial class; and the source files each include 15 other headers from the same library plus 5 headers from other, earlier-numbered libraries. The interlibrary dependencies only go one way, making it easy to define a self-consistent build order. Altogether the project includes about 10000 files, meaning that it is quite large but not enormous.</p>
<p>For comparison, I repeated some of the blog author's original measurements on my computer (a Linux notebook), plus a few new ones. I only bothered with the "full" build (build everything from scratch) and the global "do-nothing" build (run the build again when there is nothing to do). The builds that I added are as follows:</p>
<p>A "make_nonrecursive" build that uses "gcc -MM" to generate include-file dependencies for each file, and includes all of the dependency files and the library-wide Makefiles into a single project-wide Makefile that knows about all of the dependencies. See <a href="http://aegis.sourceforge.net/auug97.pdf">"Recursive Make Considered Harmful" (PDF)</a> for more information about this style of Makefile.</p>
<p>I added two fabricate benchmarks, one using atimes_runner and one using strace_runner. In both cases the build file does the obvious loop-over-libraries, loop-over-sourcefiles and invokes g++ using fabricate.run() once for each file. The tests use fabricate trunk r21.</p>
<p>Here are my results:</p>
<table>
<tbody><tr><th>Build system</th><th>Full compilation</th><th>Do-nothing build</th></tr>
<tr><td>make (recursively invoked)</td><td>194 s</td><td>3.7 s</td></tr>
<tr><td>make_nonrecursive</td><td>319 s</td><td>1.7 s</td></tr>
<tr><td>scons</td><td>312 s</td><td>52 s</td></tr>
<tr><td>waf</td><td>207 s</td><td>2.7 s</td></tr>
<tr><td>fabricate, using atimes_runner</td><td>4245 s</td><td>17.7 s</td></tr>
<tr><td>fabricate, using strace_runner</td><td>251 s</td><td>18.9 s</td></tr>
<tr><td>fabricate, using strace_runner and mtime_hasher</td><td>328 s</td><td>352 s [1]</td></tr>
</tbody></table>
<p>[1] I.e., this combination didn't work; it rebuilt everything.</p>
<h2>Summary</h2>
<p>make_nonrecursive is slower than recursive make for the full build, but this is probably mostly because make_recursive uses "gcc -MM" to generate the dependencies rather than "makedepend" as used by the recursive make. This is an implementation difference that could be changed. For the do-nothing build, make_nonrecursive is faster.</p>
<p>fabricate with the atimes_runner is so excruciatingly slow at doing a full build that it cannot seriously be considered for a project of this size. In retrospect, it is clear that it has to be slow for large projects, because before and after <span style="font-weight: bold;" class="moz-txt-star">every</span> command it has to monkey around with the atimes of <b class="moz-txt-star">every single file in the project</b>. Thus the time for building a project with N files goes something like O(N<sup class="moz-txt-sup">2</sup>).</p>
<p>A full build using fabricate and strace_runner is much better, almost as fast as make.</p>
<p>For do-nothing builds, fabricate is considerably slower than make, though not as slow as scons. A do-nothing build is the extreme case of an incremental build, which is the bread-and-butter of daily development. So the fabricate project should definitely invest time to speed up do-nothing builds.</p>
<p>For some reason, fabricate with strace_runner and mtime_hasher didn't work for me. The do-nothing build in fact rebuilt everything, making it even slower than the corresponding full build.</p>
<p>EDIT, 2009-09-15: Reformatted results in tabular form, and added numbers for <a href="http://code.google.com/p/waf/">waf</a>.</p>Michael Haggertyhttp://www.blogger.com/profile/00332393822754558070noreply@blogger.com5