Friday, December 28, 2012

Real-world conflict diagrams

In earlier articles, I described a way to diagram a conflicting merge, and presented an algorithm for efficiently mapping the conflict frontier which is implemented in an experimental program, git-mergemate.  In this article, I would like to present a few examples of real-world merge diagrams.

To generate these diagrams, I first used the following command to search for merge commits in "master" that had manually-resolved conflicts:

git rev-list --merges --grep='Conflicts:' master

Assuming that the SHA1 of the merge is stored in $s, the two parents of such a commit can be written as $s^1 and $s^2.  So I then ran git-mergemate like so:

git-mergemate diagram $s^1...$s^2

and converted the output from PPM format to PNG using the pnm2png utility [1].

Show me the pixels!

Each pixel in these diagrams represents one pairwise merge.  Thus the images are quite compact; the width (in pixels) is the number of commits on "master" after the branch point, and the height is the number of commits on the branch.  I am leaving the images at 1:1 scaling so that their relative sizes can easily be seen (hopefully you can zoom them using your browser if you want to see more detail).

A pixel is green if the corresponding merge was successful and red if the merge resulted in a conflict.  The merges that were explicitly tested are shown as bright green/red; the other merges were inferred using the assumptions described in the algorithm article.  As you can see, it takes very few test merges to compute a whole merge diagram.

Ho-hum

Most of the merge diagrams (about 80% of them) were quite boring.

A typical short branch with a conflict:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOy0xGXDmTDeDpq0BdRLzEoA1W3R39seX7_RD0u4FD01ViLwV_q_lNlEBQjZbgPMvxN2zKN3T8oxMo3nUFybe1qwAFpmFLmu3IBpuz42gAnOL412AbKqR5r3wuQqxem_81I8wstU3Xvqk/s1600/diagram-43c456bdfa3127a2f73b3c99fb1a3e21674613ea.png

A short branch with an early conflict:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjybk4IykVYxAiJ9u2U5-nXqiXJfO7kbjSQODQai7tQT5u65aBikUGzkjwBC_2u5hAEWjo39xYv1r7KRqyCFUEciKgpWxb2g0IOuHCUARq5_RoXlErTcWAA9ujCOpRq6FadEvhEsgmEjs/s1600/diagram-3e8ebc18283a23305179aa2055619985163fbd93.png

A very short branch merged long after it was created:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjosveq9_j9xYlko1NKeCqW_fmu94dE8In_Gv_uqECQYIp4Ur-3KzQ7gHWzO2apKIfTaL54hIqw8Eb374f2_GydTMr1KBwjB33nR2YNJmCjthCAXacL4z5yU_m2VXA4q5zqQWc6oceBa0k/s320/diagram-6cc13d7703e8dc0edccd134f44676ca9ad73dd6a.png

A longer branch merged soon after it was created, with a conflict with one of its later commits:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiczEd_xAHdY972whFdL8cbi0JUeUc8EpdevWTmV1FI9RxOwbUb48C6UVXiVqY8Su8Bc95mZP3neWoSovHL4hyupeb6LiaBhKKgxlAIYyBBzwlYN0U4c2qK94gqq_KXRQLkwjz_kiVXJeE/s1600/diagram-7f28917bfa6aef7c38a216d0e30fae40f216e8f4.png
Single conflicts

However, about 20% of the merge diagrams were more interesting.  The interesting diagrams tend to be bigger because they represent long-lived branches merged a considerable time after they were created.

Several diagrams show merges that might be blocked by a single pairwise conflict:

The conflict at the upper-left corner of the red rectangle might be all that is blocking this merge:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpchBoy8mhuo9U140LM8T_CwhBGztnsYQWWfMjgimbxe02FwhVdOZBHQ_FETMYYHMjnzHscLbFZBL08FJpApAC1Ds-hYeLal2QL_NFwmBBBmDvPsKNKyaJGTWGulB3R-6aj6vcFAzVQko/s1600/diagram-737dae7c79f05d06fada3246d2269fa4f33dd90a.png

A long branch, merged before much was changed on "master":

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjs07m54MVuWlZnxiumaXGenldz_R8VgksvnXBsykEwnCyhqdo_gyOyedz201Wm33fV27xwjfsiKx6xxCgpnif5ulcOLjlnxNhwlO_0JSP3MFh4wxlnv3e2ZRAhTVjbDgkyAeDshRa__eI/s1600/diagram-8922c2d3f89fc35636d5e41ca17dfa36b15fbd2d.png

A short branch, merged quite a while after it was created:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBrfpWjOEowIiYg2mY7FLkpVzQtpHK0zJDimeqUGGfLnDadUmYJBL49zB0I6YwDB4RHMqX89h1SD9a2pgH3bhrFiIYebgUM5MQPAIGQNZZNCqAI41T__bb8hdyH4HZS_Jzp-jjBL5OcZs/s1600/diagram-9747028e60b994b5df12cf0805c0a3b3f04d9edd.png

A branch that has diverged widely from "master", but might have only one pairwise conflict:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnxejhfQN-YDPK0rAV6VoJ79i1vmGj9LbZg17n-VCQrVHiX5Byo5RmlK1afHGgmisVh2w4HIZpq1hMf34zjQz_Muq9YgY7JnvtVMNXx-XoPWSx7ezVmLsqgqMmZTZY2IGs2Gyi_-SBFjc/s1600/diagram-e51c4480518a4b30e241e35218385ef357f0f71b.png
Multiple conflicts

Other diagrams are yet more complicated, with merge frontiers shaped by two pairwise conflicts:

A smallish diagram with two early conflict blocks:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_dm1u1ovnnbnJLnoE6-_afuQBoVI4L8bmlR51cZgV_m1GygZDqwEjmafWCDiD7kv6UCEdvqvd2LohoitGAqbCQFWzVYLUydCtvk9q_jC4VokM8IemQykrfFUZyp81IAX8BLrkUKkdmcY/s1600/diagram-fd1165fbe2aca4400e46904ac29f269e42d8da50.png

Note that the top line in this diagram is green, meaning that the first branch commit doesn't conflict with any of the commits on master:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggWXPtiga7NwSjwjvx0kIJTxpwgWMG2aw5k7c1HshAnLnPAsluUyWIOa8z0EpPAvOS7xMm1_LPNwjF7QwrFE5vbLny90PhKu8kWQ_eiLa0mz37QbnjgIOUgYwK28QSUXXSpqNzIpk2c9E/s1600/diagram-b6508be06fde5b869f6119f883bb7d838425d4fe.png

A larger merge with two early conflict blocks:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgvGZe0U1b_pbvLyfS1ewJoj_vfaEwfDuWbmwnJKUt3NAV2TMIdzahZGAHaDou9edllwMtoHjkLxGIm32ESWEMoGiBhQFuACBx_r7Z-7XMyomLNXd0gq0Jcug_RPQ_47gHoayMH1_IcS0/s1600/diagram-04d07c8da4eec76f1d4c32cd438ddfbf9e2ccb3d.png

The biggest merge of all (281 commits on master, 235 commits on branch).  Please note that even though the branch and master have diverged significantly, conflicts did not arise until quite some time after the branch was created:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7WtlSrOU8vWlNKOBc8_lEs74QBu6StpDeryMqazS70Lx29SNww6JEa2Zfeqpd0MXe2T4INEdIAV3InstY81oUkkBJYJDvPXbBbyLZIdPY8aDcKoCrKy2r3PB1HitRqxZFiScYhSRFzis/s1600/diagram-698240f65bd4eb19f122ac2a725818b965445155.png

Conclusion

A merge diagram provides a quick, intuitive overview of a conflicting merge, and highlights the individual commits on each branch that conflict with each other.

Most merge diagrams are simple (in fact I didn't observe any with more than two blocking pairwise merges).  Thus there is hope that, by focusing our attention on these critical pairwise merges, we can resolve a nightmare merge with an acceptable level of pain and (more importantly) some confidence in the result.

In upcoming articles I will explore how to use incremental merging to resolve a nightmare merge.

Notes:
[1]

Putting this all together, we have

for s in $(git rev-list --grep='Conflicts:' --merges master)
do
    git-mergemate diagram $s^1...$s^2 | pnmtopng >../diagram-$s.png
done

If you want to try this with your own git repository, please make a backup first, as the script is only lightly tested!

Mapping the merge conflict frontier

 

In my last article, I described how to define a "conflict frontier" for a failed merge and conceptually how to diagram it.  In this article, I will describe an automated and efficient way to compute the conflict frontier using git.  In my next article, I will show some examples of real-world merge diagrams.

Remember, a merge diagram summarizes the combinations of commits from two branches that can be merged successfully, and which of them conflict.  The example used in the previous article looked like this:

o - 0 - 1  - 2  - 3  - 4  - 5  - 6  - 7  - 8  - 9  - 10  - 11    ← master
    |   |    |    |    |    |    |    |    |    |     |     |
    A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8 - A9 - A10 - A11
    |   |    |    |    |    |    |    |    |
    B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8   x     x     x
    |   |    |    |    |    |    |
    C - C1 - C2 - C3 - C4 - C5 - C6   x    x    x     x     x
    |   |    |    |    |    |    |
    D - D1 - D2 - D4 - D4 - D5 - D6   x    x    x     x     x
    |   |    |    |    |    |    |
    E - E1 - E2 - E3 - E4 - E5 - E6   x    x    x     x     x
    |   |
    F - F1   x    x    x    x    x    x    x    x     x     x
    |   |
    G - G1   x    x    x    x    x    x    x    x     x     x
    |   |
    H - H1   x    x    x    x    x    x    x    x     x     x
    |   |
    I - I1   x    x    x    x    x    x    x    x     x     x

    ↑
  branch

In this diagram, commits "0"-"11" are on master and commits "A"-"I" are on the branch.  The rest of the diagram shows incremental merges when they are possible; e.g., commit "C3" is a successful merge between commit "C2" and commit "B3", and it contains the logical changes that were originally introduced by commits "0"-"3" on master plus those introduced by commits "A"-"C" on branch.  When an incremental merge failed, the diagram shows an "x".  The "conflict frontier" is the border between merges that are successful and those that result in conflicts.

This article describes an algorithm for computing the merge frontier efficiently.

Assumptions

In this article we make the following assumptions:

  1. If a merge can be done directly, then all of the incremental merges above and to the left of the merge can also be done successfully. For example, if we can merge "C" and "3" directly, then we assume that all of the merges "A1", "A2", "A3", "B1", "B2", "B3", "C1", and "C2" could also be done successfully.  This is typically (though not always) the case in the real world.
  2. If a direct merge conflicts, then all of the incremental merges below and to the right of it would also conflict.  For example, if a merge of "F" and "2" fails, we assume that the whole rectangle of incremental merges bounded by "F2", "F11", "I11", and "I2" would also conflict.  Again, this is typically (though not always) the case in the real world.
  3. We use "git merge" to define whether a merge is successful.  If the command completes without an error, then we assume that the merge is good.  (For a more accurate determination of which merges are successful, one could also check whether the merged version builds and passes the project's automated tests.)

Bisection

Given the above assumptions, it is possible to locate the merge frontier in any line or column of the diagram via a process of bisection.  For example, to find the merge frontier in row "E" of the diagram:

  1. Test merging "E" and "6" -> SUCCESS.  Merge frontier must be to the right of this entry.
  2. Test merging "E" and "9" -> CONFLICT.  Merge frontier must be to the left of this entry.
  3. Test merging "E" and "7" -> FAIL.  Merge frontier must be between "E6" and "E7".

In the diagram, the bisection look like this:

o - 0 - 1  - 2  - 3  - 4  - 5  - 6  - 7  - 8  - 9  - 10  - 11    ← master
    |
    A
    |
    B
    |
    C
    |
    D
    |
    E                            E6   x         x
    |
    F
    |
    G
    |
    H
    |
    I

    ↑
  branch

Filling

Given the results of a bisection, assumptions 1 and 2 allow us to fill in big chunks of the merge diagram.  In particular, since merge "E6" was successful, all of the merges in the rectangle above and to the left of "E6" are also successful.  And since "E7" conflicts, all of the merges in the rectangle below and to the right of "E7" are also conflicts:

o - 0 - 1  - 2  - 3  - 4  - 5  - 6  - 7  - 8  - 9  - 10  - 11    ← master
    |   |    |    |    |    |    |
    A - A1 - A2 - A3 - A4 - A5 - A6
    |   |    |    |    |    |    |
    B - B1 - B2 - B3 - B4 - B5 - B6
    |   |    |    |    |    |    |
    C - C1 - C2 - C3 - C4 - C5 - C6
    |   |    |    |    |    |    |
    D - D1 - D2 - D4 - D4 - D5 - D6
    |   |    |    |    |    |    |
    E - E1 - E2 - E3 - E4 - E5 - E6   x    x    x     x     x
    |
    F                                 x    x    x     x     x
    |
    G                                 x    x    x     x     x
    |
    H                                 x    x    x     x     x
    |
    I                                 x    x    x     x     x

    ↑
  branch

The combination of bisection and filling allows the merge frontier to be computed very efficiently.

The zig-zag method

There are many ways to map out the frontier using bisection and filling.  The one that I use I call the zig-zag method, because it starts at the bottom-left and zig-zags up the frontier until it reaches the top-right.  In pseudo-code, it is roughly:

i1 = 0
i2 = number of commits on branch
while True:
    i1 = index of first failing merge in row i2-1
    mark block above and to the left of (i1,i2) as "successful"
    if i1 == number of commits on master:
        break

    i2 = index of first failing merge in column i1
    mark block below and to the right of (i1,i2) as "conflict"
    if i2 == 0:
        break

Whenever the "index of first failing merge" is needed for a row or a column, it is found by bisection.  For our example, the progression looks like this (with the position (i1,i2) marked as *):

o - 0 - 1  - 2  - 3  - 4  - 5  - 6  - 7  - 8  - 9  - 10  - 11    ← master
    |   |
    A - A1
    |   |
    B - B1
    |   |
    C - C1
    |   |
    D - D1
    |   |
    E - E1
    |   |
    F - F1
    |   |
    G - G1
    |   |
    H - H1
    |   |
    I - I1   *

    ↑
  branch


o - 0 - 1  - 2  - 3  - 4  - 5  - 6  - 7  - 8  - 9  - 10  - 11    ← master
    |   |
    A - A1
    |   |
    B - B1
    |   |
    C - C1
    |   |
    D - D1
    |   |
    E - E1
    |   |
    F - F1   *    x    x    x    x    x    x    x     x     x
    |   |
    G - G1   x    x    x    x    x    x    x    x     x     x
    |   |
    H - H1   x    x    x    x    x    x    x    x     x     x
    |   |
    I - I1   x    x    x    x    x    x    x    x     x     x

    ↑
  branch


o - 0 - 1  - 2  - 3  - 4  - 5  - 6  - 7  - 8  - 9  - 10  - 11    ← master
    |   |    |    |    |    |    |
    A - A1 - A2 - A3 - A4 - A5 - A6
    |   |    |    |    |    |    |
    B - B1 - B2 - B3 - B4 - B5 - B6
    |   |    |    |    |    |    |
    C - C1 - C2 - C3 - C4 - C5 - C6
    |   |    |    |    |    |    |
    D - D1 - D2 - D4 - D4 - D5 - D6
    |   |    |    |    |    |    |
    E - E1 - E2 - E3 - E4 - E5 - E6   *
    |   |
    F - F1   x    x    x    x    x    x    x    x     x     x
    |   |
    G - G1   x    x    x    x    x    x    x    x     x     x
    |   |
    H - H1   x    x    x    x    x    x    x    x     x     x
    |   |
    I - I1   x    x    x    x    x    x    x    x     x     x

    ↑
  branch


o - 0 - 1  - 2  - 3  - 4  - 5  - 6  - 7  - 8  - 9  - 10  - 11    ← master
    |   |    |    |    |    |    |
    A - A1 - A2 - A3 - A4 - A5 - A6
    |   |    |    |    |    |    |
    B - B1 - B2 - B3 - B4 - B5 - B6
    |   |    |    |    |    |    |
    C - C1 - C2 - C3 - C4 - C5 - C6   *    x    x     x     x
    |   |    |    |    |    |    |
    D - D1 - D2 - D4 - D4 - D5 - D6   x    x    x     x     x
    |   |    |    |    |    |    |
    E - E1 - E2 - E3 - E4 - E5 - E6   x    x    x     x     x
    |   |
    F - F1   x    x    x    x    x    x    x    x     x     x
    |   |
    G - G1   x    x    x    x    x    x    x    x     x     x
    |   |
    H - H1   x    x    x    x    x    x    x    x     x     x
    |   |
    I - I1   x    x    x    x    x    x    x    x     x     x

    ↑
  branch


o - 0 - 1  - 2  - 3  - 4  - 5  - 6  - 7  - 8  - 9  - 10  - 11    ← master
    |   |    |    |    |    |    |    |    |    |     |     |
    A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8
    |   |    |    |    |    |    |    |    |
    B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8   *
    |   |    |    |    |    |    |
    C - C1 - C2 - C3 - C4 - C5 - C6   x    x    x     x     x
    |   |    |    |    |    |    |
    D - D1 - D2 - D4 - D4 - D5 - D6   x    x    x     x     x
    |   |    |    |    |    |    |
    E - E1 - E2 - E3 - E4 - E5 - E6   x    x    x     x     x
    |   |
    F - F1   x    x    x    x    x    x    x    x     x     x
    |   |
    G - G1   x    x    x    x    x    x    x    x     x     x
    |   |
    H - H1   x    x    x    x    x    x    x    x     x     x
    |   |
    I - I1   x    x    x    x    x    x    x    x     x     x

    ↑
  branch


o - 0 - 1  - 2  - 3  - 4  - 5  - 6  - 7  - 8  - 9  - 10  - 11    ← master
    |   |    |    |    |    |    |    |    |    |     |     |
    A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8
    |   |    |    |    |    |    |    |    |
    B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8   *     x     x
    |   |    |    |    |    |    |
    C - C1 - C2 - C3 - C4 - C5 - C6   x    x    x     x     x
    |   |    |    |    |    |    |
    D - D1 - D2 - D4 - D4 - D5 - D6   x    x    x     x     x
    |   |    |    |    |    |    |
    E - E1 - E2 - E3 - E4 - E5 - E6   x    x    x     x     x
    |   |
    F - F1   x    x    x    x    x    x    x    x     x     x
    |   |
    G - G1   x    x    x    x    x    x    x    x     x     x
    |   |
    H - H1   x    x    x    x    x    x    x    x     x     x
    |   |
    I - I1   x    x    x    x    x    x    x    x     x     x

    ↑
  branch


o - 0 - 1  - 2  - 3  - 4  - 5  - 6  - 7  - 8  - 9  - 10  - 11    ← master
    |   |    |    |    |    |    |    |    |    |     |     |
    A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8 - A9 - A10 - A11   *
    |   |    |    |    |    |    |    |    |
    B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8   x     x     x
    |   |    |    |    |    |    |
    C - C1 - C2 - C3 - C4 - C5 - C6   x    x    x     x     x
    |   |    |    |    |    |    |
    D - D1 - D2 - D4 - D4 - D5 - D6   x    x    x     x     x
    |   |    |    |    |    |    |
    E - E1 - E2 - E3 - E4 - E5 - E6   x    x    x     x     x
    |   |
    F - F1   x    x    x    x    x    x    x    x     x     x
    |   |
    G - G1   x    x    x    x    x    x    x    x     x     x
    |   |
    H - H1   x    x    x    x    x    x    x    x     x     x
    |   |
    I - I1   x    x    x    x    x    x    x    x     x     x

    ↑
  branch

Using this algorithm, the whole diagram can be computed by doing only O(B * lg max(M,N)) test merges, where B is the number of blocks in the resulting diagram and M and N are the number of commits on master and branch, respectively.  In the worst case, B = min(M,N) and the algorithm scales like O(min(M,N) * lg max(M,N)).

I have written an example program, git-mergemate, that computes merge diagrams using the algorithm described here.  To use it, put it in your $PATH, switch to your git project working directory (which must be clean), and type

git-mergemate diagram master...branch >diagram.ppm

The output is a PPM file in which successful merges are displayed as green pixels and conflicting merges are displayed as red pixels. Pixels representing merges that were actually tested are displayed as bright red/green; the other pixels are inferred using the assumptions above.

My next article will show some examples of real-world merge diagrams.

Some technical details

  1. Each of the bisect steps only needs to search the part of the row/column that hasn't already been filled in.
  2. If the merges are done using git, it is important to do them with git's "rerere" feature turned off.  Otherwise the results of test merges are not deterministic because they depend on what information "rerere" has already gathered.

The conflict frontier of a nightmare merge

If you use feature branches in git, you know the importance of getting features (or featurettes) merged back to master as soon as possible. Otherwise, it is all too likely that the branch and master will grow so far apart that merging them becomes a nightmare.  You probably know that sick feeling in the pit of your stomach when you attempt a merge, only to be confronted with dozens or even hundreds of conflicts.  How can you possibly understand all of the conflicts, let alone disentangle them?  Sometimes it seems like there is nothing to do but abandon the branch entirely and start over.

Exactly this situation happened recently at $WORK, and it got me thinking about what can be done to attack the nightmare merge.

Explode the problem

The first step to solving a problem is to understand it, and (for me at least) the first step to understanding a problem is to visualize it somehow.  What does a nightmare merge look like?

Specifically, one doesn't learn very much from the fact that two large and widely-diverged branches don't merge successfully.  Let's explode the merge into many tiny, simpler merges to try to get a more intuitive idea of the situation.

Assume, for the sake of simplicity, that both master and the feature branch have developed linearly since they bifurcated [1]:

o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11    ← master
     \
      A - B - C - D - E - F - G - H - I                ← branch

Now, the fact that master and branch cannot be merged means that somewhere among the commits on master there are one or more changes that conflict with one or more changes somewhere on branch.  But the majority of the changes probably do not conflict.  Let's find the problematic pairs so that we can focus our attention on them.

We start by drawing the above diagram a little bit differently, with the branch drawn downwards at a 90 degree angle to master:

o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11    ← master
    |
    A
    |
    B
    |
    C
    |
    D
    |
    E
    |
    F
    |
    G
    |
    H
    |
    I

    ↑
  branch

Now let's divide and conquer by merging individual pairs of commits, one at a time.  The first merge is between the first post-bifurcation commits on branch and master, "1" and "A":

o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11    ← master
    |   |
    A - A1
    |
    B
    |
    C
    |
    D
    |
    E
    |
    F
    |
    G
    |
    H
    |
    I

    ↑
  branch

If we're lucky, the merge will go through without conflicts.  So let's merge some more branch commits with the first commit from master:

o - 0 - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11    ← master
    |   |
    A - A1
    |   |
    B - B1
    |   |
    C - C1
    |   |
    D - D1
    |   |
    E - E1
    |   |
    F - F1
    |   |
    G - G1
    |   |
    H - H1
    |   |
    I - I1

    ↑
  branch

Woohoo!  Nothing on the branch conflicted with commit 1 from master. This is a little bit of progress.  For example, it means that we could probably cleanly rebase the branch to commit "1" (though that is not recommended).

Aside:

Please note that we haven't actually defined what it means for a merge to be "successful".  The simplest approach is simply to attempt the merge in git and see if it reports a conflict.  This is the approach that I will use.  But please note that even a merge that is textually successful is not necessarily correct: it might not build, or the merge might have introduced bugs, etc. It would be possible to define success more strictly, using criteria like

  • The merged source tree compiles without errors
  • The test suite runs correctly
  • The merged code passes manual inspection

Blocked!

Now let's start successively merging the first column of merge commits with commit 2 from master:

o - 0 - 1  - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11    ← master
    |   |    |
    A - A1 - A2
    |   |    |
    B - B1 - B2
    |   |    |
    C - C1 - C2
    |   |    |
    D - D1 - D2
    |   |    |
    E - E1 - E2
    |   |    |
    F - F1 - x
    |   |
    G - G1
    |   |
    H - H1
    |   |
    I - I1

    ↑
  branch

This time, we hit a problem: there was a conflict (marked with "x") when trying to merge commit "F1" with commit "E2".  What does this mean?

Commit "E1" essentially contains all of the changes from commits "A"-"E" on the branch plus the change from commit "1" on master.  This merge was successful.  If we denote the change introduced by commit "X" as "∆X", we can write (schematically):

E1 = 0 + ∆1      + ∆A + ∆B + ∆C + ∆D + ∆E           (successful)

"E2" contains everything in "E1", plus additionally it contains the change that was originally made in master commit "2".  Similarly, "F1" contains all of the changes from "E1", plus additionally the change that was originally made on branch commit "F".  Both merges "E2" and "F1" were also successful:

E2 = 0 + ∆1 + ∆2 + ∆A + ∆B + ∆C + ∆D + ∆E           (successful)
F1 = 0 + ∆1      + ∆A + ∆B + ∆C + ∆D + ∆E + ∆F      (successful)

But there was a conflict when merging "E2" and "F1":

F2 = 0 + ∆1 + ∆2 + ∆A + ∆B + ∆C + ∆D + ∆E + ∆F      (conflict)

How do we understand this conflict?  If we compare "E2", "F1", and "F2", we see that "F2" was the first time that the changes introduced in commit "2" came together with the changes introduced in commit "F". So the fact that "F2" conflicts indicates that a change made in commit "2" conflicts with a change made in commit "F".

Conflict breeds more conflict

Once there is a merge conflict, it blocks our progress.  If merge "F2" fails, then there is no way that we can merge "F2" and "G1" to create "G2" [2].  Nor is it possible to merge "F2" with "E3" to create "F3". Continuing this line of reasoning, we conclude that the one failed merge at "F2" implies that all of the merges in the rectangle below and to the right of it will also fail.  So let's add that information to the diagram:

o - 0 - 1  - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11    ← master
    |   |    |
    A - A1 - A2
    |   |    |
    B - B1 - B2
    |   |    |
    C - C1 - C2
    |   |    |
    D - D1 - D2
    |   |    |
    E - E1 - E2
    |   |
    F - F1   x   x   x   x   x   x   x   x   x   x
    |   |
    G - G1   x   x   x   x   x   x   x   x   x   x
    |   |
    H - H1   x   x   x   x   x   x   x   x   x   x
    |   |
    I - I1   x   x   x   x   x   x   x   x   x   x

    ↑
  branch

Filling the rest of the diagram

However, that doesn't mean that we are completely done.  It is quite possible that a merge of "3" and "A2" will still be successful, and in general we can keep filling in the table in other directions until they are blocked by other merge conflicts.  The resulting diagram might look like this:

o - 0 - 1  - 2  - 3  - 4  - 5  - 6  - 7  - 8  - 9  - 10  - 11    ← master
    |   |    |    |    |    |    |    |    |    |     |     |
    A - A1 - A2 - A3 - A4 - A5 - A6 - A7 - A8 - A9 - A10 - A11
    |   |    |    |    |    |    |    |    |
    B - B1 - B2 - B3 - B4 - B5 - B6 - B7 - B8   x     x     x
    |   |    |    |    |    |    |
    C - C1 - C2 - C3 - C4 - C5 - C6   x    x    x     x     x
    |   |    |    |    |    |    |
    D - D1 - D2 - D4 - D4 - D5 - D6   x    x    x     x     x
    |   |    |    |    |    |    |
    E - E1 - E2 - E3 - E4 - E5 - E6   x    x    x     x     x
    |   |
    F - F1   x    x    x    x    x    x    x    x     x     x
    |   |
    G - G1   x    x    x    x    x    x    x    x     x     x
    |   |
    H - H1   x    x    x    x    x    x    x    x     x     x
    |   |
    I - I1   x    x    x    x    x    x    x    x     x     x

    ↑
  branch

The conflict frontier

What does a diagram like this tell us?

Even when it is impossible to merge a branch back to master directly, it is usually possible to merge various parts of the branch with parts of master.  This is logical; not every change on the branch necessarily conflicts with every change on master.

But there is a "conflict frontier" between the merges that are successful and the merges that conflict.  This frontier has a very particular shape: the conflicted region is in the bottom-right corner of the diagram, and it consists of rectangles jutting towards the upper-left.

The apexes of these rectangles (i.e., "F2", "C7", and "B9" in our example) indicate the earliest pairs of changes, one on each branch, that conflict with each other (i.e., "F" vs. "2", "C" vs. "7", and "B" vs. "9").

In future articles I will present an algorithm for efficiently mapping out the conflict frontier, show some examples of real-world merge diagrams, and discuss how to use incremental merging to attack the nightmare merge.

Notes:

[1]

There are two obvious ways to generalize this approach when the histories of master and branch are not linear:

  1. Merge the individual commits anyway, preserving the topology of each branch as you go.  I think this should work, though I haven't actually tried it.  But in any case, it would be difficult to visualize the result.
  2. Apply the described procedure to only the main backbone of commits on each branch, namely the commits along the first-parent chain from each branch tip to the branch point. Using git, this would correspond to the output of git log --first-parent.  This is the approach that I will take in future articles.
[2]Strictly speaking, even though merge "F2" failed, it might still be possible to merge "G" and "2".  For example, commit "G" might have reverted commit "F", meaning that "G2" is textually identical to the successful merge "E2".  But in the usual case, the failure of one merge means that all merges below and to the right of it will fail, too.

Thursday, November 8, 2012

Compressing permuted data

Mark Nelson posed a challenge:
[...] Challenge 2 is more along the lines of troll bait, because it is patently impossible: create a system to compress and then decompress any file of size 415,241 bytes. In other words, create a compressed file that is smaller than the input, then use only that compressed file to restore the original data.
Unlike Challenge 1, there are no size limitations on Challenge 2. Your compressor and decompressor can be as large as you like. [...]
To keep it simple, the files will be simple permutations of the million digit file – scrambled with an encryption system, or perhaps a random number generator, or whatever. The point is, they should have all the same numeric characteristics as the original file, just organized slightly differently.
He also supplied a file AMillionRandomDigits.bin, which contains 415,241 random bytes.
Of course it is well-known that it is impossible to write a compressor that can compress every possible data file.  But the last paragraph states that the files to be compressed "will be simple permutations of the million digit file" [1].  As I will show below, it is possible to compress files that are bytewise permutations of a known file, albeit by only a small amount.

The key point is that the number of permutations of AMillionRandomDigits.bin is considerably smaller than the number of files of length 415241. In fact, it is possible to compute the numbers exactly:

Number of files of length N = 256^N
Number of permutations of a file of length N = N! / (n_0! n_1! ... n_255!)

where n_b is the number of times that the byte value b appears in the original file.
The minimum size, in bits, to which a file from a particular ensemble can be compressed is given by the logarithm base 2 of the number of possible files. Evaluating the above formulas given the byte frequencies in AMillionRandomDigits.bin, we find that an optimal compressor could compress any of these files by exactly 235 bytes or 0.06%. So a tiny extra crumb of information can be leveraged to effect a tiny bit of compression.
I was initially surprised how little compression is made possible by knowledge of character frequency counts, though in retrospect it is pretty obvious:

N = 415241
Each byte value appears approximately N/256 = 1622 times
The variation in byte frequencies is about ±sqrt(1622) ≈ ±40
That fact that byte b appears n_b times therefore gives you (very roughly)
    lg(80) or 6.3 bits of information.
Since we have independent counts for 255 bytes, that gives about
    6.3 bits * 255 ≈ 200 bytes of information

Thus this simple estimate suggests that the file can be compressed by about 200 bytes (not far off of the correct value, 235, as computed above).

Can we actually write a compressor/decompressor for such files?  In fact, it is easy to compress the file by one byte.  The compressor is trivial: simply drop the last byte from the file. To decompress,
  1. Compute the byte frequencies in AMillionRandomDigits.bin
  2. Compute the byte frequencies in the compressed file
  3. Determine which byte value appears fewer times than expected in the compressed file
  4. Append that byte to the end of the compressed file to produce the original
Implementing optimal compression for such a file is left as an exercise for the reader.


[1] After I submitted my solution, Mark claimed that what he meant by "permutations" was not the mathematical meaning of the word, but rather...well, I don't really know; if it's not meant precisely then the whole paragraph is kindof meaningless.