Git is Inconsistent

davidmathers · on April 17, 2011

Here's the short version:

  I am the original sentence.

Alice commits a change in her repo:

  I am a different sentence.

Bob commits a change in his repo:

  I am the original sentence.
  I am the original sentence.

Now Alice pulls Bob's commit. What should happen?

The argument is that in certain cases it can be known which of Bob's 2 sentences is the original and which is the copy (due to context provided by an intermediate commit) and that therefore a correct VCS will figure out that the original is on the bottom:

  I am the original sentence.
  I am a different sentence.

But git doesn't look at history so will always produce:

  I am a different sentence.
  I am the original sentence.

I don't care. If you force me to care then I actually prefer git's behavior. Git is consistent: a merge will always produce the same result for the same files. I don't want history to matter.

The problem is not actually solvable. So git doesn't try to solve it. I think that's why it's called "the stupid content tracker."

EDIT: Is there anything worse than "smart" features that only work, say, 80% of the time? The closer they get to 100% the worse it gets, because then you start relying on them and they break right when you stop paying attention.

Peaker · on April 17, 2011

> Git is consistent: a merge will always produce the same result for the same files

I thought the point was that if you pull the exact same commits in different order the merge will produce a different result for the same files, meaning that in git the history does matter. Whereas darcs/etc will always produce the same result, such that history does not matter?

davidmathers · on April 17, 2011

pull the exact same commits in different order

Sort of. The OP doesn't write clearly. He's also confused about how git works. What he means is..

Say Bob has 2 commits (B1-B2) and Alice has 1 (A1)

Scenario 1: Alice merges each of Bob's commits in sequence (i.e. she replays his commit history onto her repo: A1-B1-B2).

Scenario 2: Alice merges only B2 (A1-B2).

The point is that, with git, Alice's repo will be different in each scenario. Because in scenario 2 git doesn't examine commit B1 and use that info to try and figure out what the content in commit B2 "means".

With darcs, on the other hand, her scenario 1 repo will be identical to her scenario 2 repo.

The flip side is that in scenario 2 git will always produce the same result for the same B2, because B1 is irrelevant. With darcs a change in B1 will change the result.

NOTE: "git pull --rebase" actually does "replay commit history" instead of "merge" when pulling code into your repo (result: B1-B2-A1). I use it as my default. The outcome is the same as darcs, the difference is that everything is explicit.

knowtheory · on April 17, 2011

That's what i don't get.

I don't understand where or how you could encounter a circumstance where this would matter. This complaint seems to be an abstract theoretical point (maybe to support git alternatives? dunno) that even esoteric usage of a DSCV would never come across.

I dunno, maybe i'm not being creative enough in my use of histories.

EDIT: Okay this explains everything in a considerably more concise fashion than the article does: http://news.ycombinator.com/item?id=2456529

Groxx · on April 17, 2011

It would matter in this situation:

In the beginning:

  function A(){
    return 1;
  }

Now commit this in one branch:

  function B(){
    return 1;
  }
  function A(){
    return 1;
  }

then this:

  function A(){
    return 1;
  }
  function B(){
    return 1;
  }
  function A(){
    return 1;
  }

And then this in another branch off the base:

  function A(){
    return 2;
  }

Now merge the two end points. Which is correct? This, assuming a purely line-based diff:

  function A(){
    return 2;
  }
  function B(){
    return 1;
  }
  function A(){
    return 1;
  }

or this, assuming knowledge of the history of events?

  function A(){
    return 1;
  }
  function B(){
    return 1;
  }
  function A(){
    return 2;
  }

In Javascript, where such code is acceptable, `A()` now returns 1 or 2.

In Git, or by applying patches manually, it depends on the order in which you merge. If you merge the `B()A()` branch with the `return 2` branch and then the `A()B()A()` one, you'll get the second result. But if you merge the `A()B()A()` directly with the `return 2` branch, you'll get first one. The same set of changes producing different outcomes.

In Darcs, the history between `A()`, `B()A()`, and `A()B()A()` are checked, and it's seen that the second `A()` is the "original" one, so the `return 2` is applied to that one.

Which means that you won't necessarily get the same behavior merging two Darcs patches as you would merging it within the repository, where there is a history. Git behaves exactly as if you were dealing with patches. I side with Git on this, personally, but it's a valid point - you have history, why not use it?

saalweachter · on April 17, 2011

You know, I suspect that in many production cases, neither merge is "correct". The example involves a lot code duplication, and a change to the block of code which was duplicated.

The probable case is something like:

  function foo(){
    do_something_complex_but_not_correct();
  }

with one person making the change to:

  function foo(){
    something_else();
    do_something_complex_but_not_correct();
  }

and then:

  function foo(){
    do_something_complex_but_not_correct();
    something_else();
    do_something_complex_but_not_correct();
  }

in the stated two-step change, while another author makes the change to:

  function foo(){
    do_something_complex_and_also_correct();
  }

The correct "merge" is going to be to apply the second change to both blocks of code, not just the first or the second:

  function foo(){
    do_something_complex_and_also_correct();
    something_else();
    do_something_complex_and_also_correct();
  }

Groxx · on April 17, 2011

Which is why I side with explicit, patch-like behavior. Interpreting a `move-and-copy` as a `move` when there's a chunk of duplicate data that could mess things up means it's essentially doing a primitive semantic analysis of what you meant to do. It may be correct more of the time, but it can't be correct all of the time.

What I "meant to do" could have been as you stated, where both should have changed. Or I could have copied the internals of a function to a new one, and made minor changes around it, and actually do wish to use that new copy as the official version. There is no way to 100% accurately detect such intent without being explicit about it, so I'd prefer something dumb and therefore extremely predictable.

riffraff · on April 17, 2011

more precisely, I believe, the history of the merges counts, not the history of the files per their original edits.

etherealG · on April 17, 2011

commits are not commutative in the general case. darcs algo is commutative for the merged commits, and gits is not, in this example.

the git people are arguing that the speed lost by gaining this commutative nature is just not worth it. i agree.

tzs · on April 17, 2011

Based on the article and the Reddit discussion, that is not correct. It's more like this:

Scenario #1:

Alice and Bob both make changes. Alice pulls Bob's change and merges it. Bob makes a second change. Alice pulls the second change and merges it.

Scenario #2:

Alice and Bob both make changes. Bob makes a second change. Alice pulls Bob's changes and merges.

The final result, which is Alice's change merged with Bob's two changes, ends up different in the two cases, and there were no merge conflicts.

bitwize · on April 17, 2011

No, no, no.

History is EVERYTHING to a VCS. You ALWAYS want exact information of what changed at what time. This lets you do all sorts of cool things like examine the provenance of a file in detail, integrate a similar change across two different branches whose code may have diverged, etc.

Meticulous tracking of history as well as efficient handling of large binary blobs are why the pros almost always rely on Perforce for large projects.

cookiecaper · on April 17, 2011

As far as I know, the history in git doesn't change unless you explicitly ask it to (rebase). So you should still always be able to tell exactly what changed and at what time it was changed. Perhaps git doesn't employ this information to the liking of others, but it should all be there.

DennisP · on April 17, 2011

Joel Spolsky describes mercurial as storing lists of changes, rather than a series of file snapshots.

"And so, when we want to merge our code together, Mercurial actually has a whole lot more information: it knows what each of us changed and can reapply those changes, rather than just looking at the final product and trying to guess how to put it together.

"For example, if I change a function a little bit, and then move it somewhere else, Subversion doesn’t really remember those steps, so when it comes time to merge, it might think that a new function just showed up out of the blue. Whereas Mercurial will remember those things separately: function changed, function moved, which means that if you also changed that function a little bit, it is much more likely that Mercurial will successfully merge our changes."

http://hginit.com/00.html

I'd assumed git and mercurial worked the same way.

psykotic · on April 17, 2011

The short of it is that Joel is wrong. Git and Mercurial use similar data structures and neither of them store changes in the way that Darcs stores changes. Maybe he knows that full well but is telling a white lie to get a teaching point across.

If you make a change to a file in Git and commit it, the new version will store the full updated contents of that file (delta compression is an orthogonal issue). Indeed, my use of the word "version" is revealing. That concept is secondary in Darcs; changes are what have primary ontological status.

tytso · on April 17, 2011

Jeol is half-right and half-wrong. Mercurial stores its version history as a series of deltas, yes. Git stores its version history as a series of snapshots. (Git does do delta compression, but the delta compression is done independently of the version history, which is why git can be highly efficient at storing its complete version history in its repositories.) This doesn't matter, though, since you can get from snapshots to deltas and vice-versa very easily; the two systems are dual from each other. In that way, he is also wrong --- the reason why git and mercurial are smarter than svn is not because of how they store their commits, since that really is an implementation detail.

At least for git, git will do start by doing a 3-way merge, and if that fails, only then will it try to resolve the merge conflict by looking at the intermediate history. This is much faster, and for Linus, who wants to encourage lots of branching and merging, merge speed is highly important. This is what makes git fundamentally better than svn or cvs; the fact that it can get many more merge cases right, and that it can do this quickly and painlessly. So the darcs folks who say that git only does 3-way merges is incorrect; git can do much more sophisticated things than just 3-way merges. However, it only pulls out these more sophisticated weapons when the simple approach doesn't work (and 95+% of the time, the simple approach works just great).

What Darcs did is it focused on the "get many, many, MANY more merge cases right", but it completely ignored the "quickly" part of the equation. That's partially because it's amazingly complicated. Just take a look at the Darcs "Theory of Patches", and its obsessive fixation on being able to whether or not you different patches are commutative, etc., and that gives you a very strong hint of its complexity right here: http://en.wikibooks.org/wiki/Understanding_Darcs/Patch_theor...

The question is whether this complexity is necessary or not. It certainly does slow things down. And fundamentally, that's the question; is it worth it to slow down nearly every single SCM operation just so that a few corner cases can be handled automatically, instead of requiring minimal human intervention? Since people of good will can disagree on this, the controversy certainly continues to exist. But I think a very large number of people are quite happy with the engineering tradeoff made by systems such as Git and Mercurial.

psykotic · on April 17, 2011

Great summary!

I stopped using Darcs a few years ago, but I heard the current generation at least resolved the notorious exponential time slowdowns.

Git's speed is definitely a big selling point. More than that, the ecosystem and services like GitHub are what really sold me on it versus alternatives. But Mercurial has a lot to offer and its simpler user interface, better Windows support and extensions like BFiles make it a much better fit for certain use cases.

I shouldn't have been so hasty to say that Mercurial doesn't store changes. But I'd argue, and you seem to agree, that Mercurial's revlog does not reflect a difference from Git in the basic philosophy of merging and the status and role of versions. In both cases you're basically dealing with genealogically annotated purely functional trees. By comparison, Darcs's theory of patches represents a radical departure. At the very least I'm happy that someone is trying to think deep and different thoughts in this area.

ob · on April 17, 2011

No, you are totally wrong. Did you even try this in Git? Any sensible VC system will give you a conflict here. The article discusses auto-merge behaviour. You ABSOLUTELY can get auto-merge to work 100% of the time. When it doesn't you get a conflict that you need to manually resolve. BitKeeper does get this right (disclaimer: I am one of the developers of BitKeeper).

davidmathers · on April 17, 2011

What a powerful insight. Only now do I see how truly wrong I was. I don't know how I could have been so blind.

Groxx · on April 17, 2011

This is not an accurate summary. The claim is that a "correct" VCS will detect a move, not a copy, which is an entirely different beast - if it were a copy, it might be correct to apply the patch to both lines in some situations.

tytso · on April 17, 2011

I've contributed a tiny amount to git (the high-level "git mergetool") so I can't speak for all of the git developers, but I've spent enough time hanging around for them to say that the general feeling they have is that git's algorithm which is "3-way merge, and then look at the intervening commits to fix any merge conflicts" is good enough.

You can always try to spend more time trying to use more data, or deducing more semantic information, but past a certain point, it's what Linus Torvalds has called "mental masterburation".

For example, you could try to create an algorithm that notices that in branch A a method function has been renamed, and in branch B, a call to that method function was introduced, and when you merge A and B, it will also automatically rename the method function invocation that was added in branch B. That might be closer to "doing the right thing". But does it matter? In practice, a quick trial compile check of the sources before you finalize the merge will solve the problem, and that way you don't have to start adding language-specific semantic parsers for C++, Java, etc. So just because something could be done to make merges smarter, doesn't mean that it should be done.

It's a similar case going on here. Yes, if you prepend and postpend identical text, a 3-way merge can get confused. And since git doesn't invoke its extra resolution magic unless the merge fails, the "wrong" result, at least according to the darcs folks, can happen. But the reason why git has chosen this result is that Linus wanted merges to be fast. If you have to examine every single intermediate node to figure out what might be going on, merges would become much slower, since in real life there will be many, many more intermediate nodes that darcs would have to analyze. Given that this situation doesn't happen much in real life (not withstanding SCM geeks who spend all day dreaming up artificial merge scenarios), it's considered a worthwhile tradeoff.

adambyrtek · on April 17, 2011

Good point, and another argument for maintaining a reasonable test coverage. I'd even argue that a merge strategy that is too clever (like the one you described) could be more risky than a dumb one. It could lead to a resolution that is valid from a compiler standpoint, but wrong semantically, which makes it even harder to discover.

scorpion032 · on April 17, 2011

"Make things as simple as possible; no simpler" ~ Albert Einstein.

jongraehl · on April 18, 2011

Why are you sure that this ambiguity can't be efficiently detected? If it can be, it's worth doing.

yuvadam · on April 17, 2011

To quote Johannes Schindelin [1] :

  This all just proves again that there can be no perfect merge strategy;
  you'll always have to verify that the right thing was done.

[1] - http://thread.gmane.org/gmane.comp.version-control.git/10574...

lambda · on April 17, 2011

It's obvious that there is no perfect merge strategy. There will always be ambiguous cases, cases in which the merge algorithm doesn't have enough information to make an informed decision, or cases in which there are changes that effect lines not caught by doing line-by-line diffs. I think that the point that Zooko and Russel O'Connor are making is that there are cases in which the merge algorithm does have available to it the information necessary to make a better decision (that is, the entire history of changes, rather than just just the two changes being merged and their common ancestor), but in Git, that information isn't being taken into account. While you are never going to have a perfect merge strategy, the argument is that you can have one that is better.

Some people, however, feel that the Git algorithm is good enough, and doing it the Darcs way would be slower without much benefit other than for fairly artificial examples (you have to be doing something where you move a block of code, and then re-introduce that same block back in the original place on one side of the merge, while patching that block on the other side of the merge). Personally, I've found Git's merge strategy adequate for everything I've used it for. Git has support for multiple merge strategies, so if someone wanted to implement a better but slower one as an opt-in, they could do so.

Andys · on April 17, 2011

Amen. There's no way I ever do this in a real code base without checking that the result is what I intended.

ams6110 · on April 17, 2011

Yes, I also always look at diffs after a merge, and also before I commit. Several times I've caught changes that I really didn't want to go back to the branch.

saalweachter · on April 17, 2011

Is there any reason to assume that merges should be associative? Hell, of the four normed division algebras, only three are associative; just because you can say "operations on octonions should be associative" doesn't mean that you can necessarily create a system of octonions where it's true.

For what it's worth, "git pull --rebase" does enforce a specific order to changes (local changes always happen after remote changes) which will produce the same results regardless of when user Bob pulls user Charlie's changes: regardless of whether Bob pulls change c1 after commiting both b1 and b2 or after commiting b1 and before commiting b2, the final commit order will always be "a, c1, b1, b2".

Of course, if Bob commits and pushes b1 before Charlie commits and pushes c1, the final commit order will be "a, b1, c1, b2", but how could it ever be otherwise?

pjscott · on April 17, 2011

There are ways of making a DVCS that allow all merges to be associative, and all patches commutative except when there's a causal dependency between them, e.g. if patch A creates a file, and patch B edits that file, then they cannot commute. I believe darcs makes these guarantees, and making a correct implementation is relatively straightforward. (Making it fast is more complicated, but definitely doable.)

Ultimately, though, what you really want is for the VCS to just do what you mean. That's a lot trickier than providing mathematical guarantees about patch reordering and convergence.

KirinDave · on April 17, 2011

Not to be grumpy about it, but git's shortcomings are well-known and most people don't run into them on a daily basis.

Some DVCS, like Darcs, might behave better, but they all seem almost comically slow even for medium-sized repos. If I have to sacrifice git's speed for certain types of correctness (that don't trouble me on a daily basis), I will be VERY reluctant to make that choice.

nevinera · on April 17, 2011

>There are still some people who still think nothing is wrong with git; that it is okay for the result of a merge to depend on how things are merged rather than on only what is merged; that is it okay for two git repositories that pull the same patches to have different contents depending on how they pulled those patches. I don’t know what to say to those people. Such a view seems like insanity to me.

Git merges files, not file-histories. Git's behavior is simple, clear, and easy to understand.

I can see why you might expect merges to be transitive like this (it would be an elegant property, if it were true), but why does it matter to you? In what way do you use merges that could rely on this expectation?

Peaker · on April 17, 2011

Here's a quote from the article explaining what would rely on this expectation:

> There are still some people who still think nothing is wrong with git; that it is okay for the result of a merge to depend on how things are merged rather than on only what is merged; that is it okay for two git repositories that pull the same patches to have different contents depending on how they pulled those patches. I don’t know what to say to those people. Such a view seems like insanity to me.

jerf · on April 17, 2011

I think I kind of know what the author was getting at, but I'm not sure, and ending with the moral equivalent of "If you disagree with me, I guess you're just stupid" was a bit disappointing.

I think the idea is the potential problems with this could emerge if you have two people simultaneously doing somewhat larger complicated merges that have this core problem perhaps more than once. I think that may be true, but the probability of this occurring is well below just plain-old-fashioned human screwups, and the solution to both ("laborious history comparison, examination, and a reset --hard to a hash by somebody") is the same in both. I really don't see how fixing this would solve any real-world problem.

copper · on April 17, 2011

I thought the author was trying to get at the old Babbage quote on confusion of ideas.

FWIW, I use git-svn to handle complex merges in svn because git has a better merge algorithm. While this particular situation doesn't affect that use case - I think it could, but it should be rare with (svn) branch discipline - the fact that it might is something to keep in mind.

nevinera · on April 17, 2011

That quote by the author says 'I expect this behavior'. It does not give some use-case that would rely on it, aside from the use case of 'I use git, and incorrectly assume that merges are transitive.'

More specifically, if they pulled the same patches, the outcome would be identical. What he wants to be able to do is pull the same history by pulling different patches in that history. A patch is the diff between two repository states, and that's all it is. Sadly, diffs are intransitive.

ob · on April 17, 2011

Then you're no better off using a DVCS than you are using diff and patch. Not even git (which is as dumb as it gets) is _that_ dumb.

JoeAltmaier · on April 17, 2011

There are so many theories of "source control" that none of them are "simple clear and easy". They take study, and if you start from a different place, a new paradigm will be hard to learn and internalize.

That said: An elegant property? Are you kidding? That is intrinsic to most tools that dare call themselves "source control". Git requires extraordinary explanation if it behaves in an extraordinary fashion.

nevinera · on April 18, 2011

>That is intrinsic to most tools that dare call themselves "source control".

Bullshit. 'Merge' is one of the most complicated operations in every versioning system. I'm pretty confident svn is 'inconsistent'. Or is that too niche?

JoeAltmaier · on April 19, 2011

Merge is a tool. The issue is, can you reproduce source accurately, from a variety of starting points, and be sure you have some canonical thing (release x.y).

Do I understand you right? This is not the expectation for a source control tool?

nevinera · on April 20, 2011

You can reproduce any repository state that you (or anyone else) have stored. You can do it simply, reliably, and quickly. That is not in doubt.

Merge is not a tool for reproducing a canonical state, it's a tool for combining two or more of them, an entirely different topic.

Any other straw men you'd like to hold up real fast?

JoeAltmaier · on April 22, 2011

Got it, thanks.

ob · on April 17, 2011

There are two things most commenters in this thread have missed:

1) The article talks about auto-merges. If the code is "too close" by some definition of close, you get a conflict that needs to be manually merged. The article does NOT talk about manual merges.

2) The article is titled "Git is Inconsistent", it doesn't claim Git is WRONG, it claims it is INCONSISTENT. It does different things depending on how you merge and when.

I think consistency in a DVCS is a desirable goal. It should not matter whether you pull A then B, or pull B then A, or whether given a series of commits, you pull after each one, or just once at the end. The end result should be the same.

That it is a rare occurrence only makes it worse. You will mostly trust the auto-merge algorithm until you hit the corner case and it will be very expensive in terms of time/money to fix the mistake.

Git's brilliance/stupidity is precisely that it only tracks contents, so although it could get the right answer it makes it very expensive to do it.

davidmathers · on April 17, 2011

The article is titled "Git is Inconsistent", it doesn't claim Git is WRONG, it claims it is INCONSISTENT.

Ok. The claim that git is inconsistent is wrong. From OP:

The problem with git’s merging is that it doesn’t satisfy the “merge associativity law” which states that merging change A into a branch followed by merging change B into the branch gives the same results as merging both changes in together in one merge.

There is no such concept in git as "merging both changes in together in one merge".

I have modified a shell script written by Simon Marlow that illustrates, using git, how merging two patches separately can give different results than merging two patches together.

The shell script doesn't do what is claimed. It can't because git has no facility for "merging two patches together". Git can only do 2 things with patches:

1. generate a patch

2. apply a patch

But! git has a function which is equivalent to combining 2 patches in a single merge:

git pull --rebase

The shell script does not use this command. It first applies 2 patches separately. It then applies 1 patch separately.

There are still some people who still think nothing is wrong with git; that it is okay for the result of a merge to depend on how things are merged rather than on only what is merged; that is it okay for two git repositories that pull the same patches to have different contents depending on how they pulled those patches. I don’t know what to say to those people.

This is just incoherent. I have no idea what to say in response because I have no idea what the intended meaning is.

ob · on April 17, 2011

If you never merge, but only use "git pull --rebase", you will have a straight line history and thus lose all of the "distributed" nature of the history. That's fine, but limiting. Any system that allows distributed development has to deal with parallel work that gets merged in stages. Otherwise you are no better than diff/patch (FWIW, rebase merges before rebasing, so it is also vulnerable to this problem, rebasing just A, then rebasing B is NOT the same as rebasing A + B).

See: http://pastebin.com/SxmwpFkY

davidmathers · on April 17, 2011

OP is saying something like "when I cook things with my freezer they don't get hot." It's that non-sensical.

Git can't do (at all) what he wants to accuse it of doing wrong (because it has nothing to do with what git does). So I'm just pointing out the closest approximation to what he's aiming at is to use pull --rebase.

Personally I like to have a straight line history as a default and only merge when required. Rather than always merge by default.

Edit: Ok, I'm not sure I understand the point of the pastebin. Maybe. If you want the lower C to become X you need to git checkout master and then git rebase c. Not the other way around. Is that it?

ob · on April 18, 2011

> OP is saying something like "when I cook things with my freezer they don't get hot." It's that non-sensical.

No, OP is saying "when I cook my food in the microwave for 3 minutes, I get it to a very different temperature than if I cook it for 1.5 minutes first and then another 1.5 minutes"

Groxx · on April 17, 2011

Super-simple-summary:

Git doesn't use history to determine merge behavior (edit: in this circumstance). Git behaves like applying patches. Darcs uses the history to make "intelligent" patches.

It's a matter of taste. If you look at Git as having a history, therefore should use the history, yes, it's incorrect. But if you look at it as a patch manager, it's behaving as it should, and Darcs is frighteningly unpredictable - the numbers on the patch might not match the numbers of the lines it modifies.

I side with Git on this. I can generate patches from Git that will work anywhere, and use them 100% identically within Git as manually applying them. The same cannot be said for Darcs.

ob · on April 17, 2011

Of course Git uses history. It doesn't _have_ to, but it does. As a matter of fact, as soon as you use diff3, you are using history (that's where the GCA comes from).

Groxx · on April 17, 2011

Know which situations it does use it, similar to this setup? Apparently not for moves, any other potential gotchas? I prefer patch-like behavior, because it can be predicted by looking at the patch.

__david__ · on April 17, 2011

After reading this it strikes me that git is imperative--it stores files as they were when you checked them in and merges what you tell it in the order you tell it.

Darcs, however, is more declarative--it stores patches. And not just patches but patches with dependencies. This set of patches describes how the current state of the repository is constructed. So when you merge you're really just adding new patches to the repo and it knows exactly what to do to make it work.

The interesting thing is that git has all the information there... It could go through the relevant history, diff everything and put the resulting patches in a darcs-like data structure and then commute patches with darcs' patch theory.

But in the end I'm not sure I'm ready to call darcs' style right and git's wrong. Both of them have a fairly easy to understand object models and they both have merges that act in accordance to the internal philosophies of those object models.

etherealG · on April 17, 2011

I agree with you completely, but want to know how this can be fixed in git? Surely there has to be something about the merging algorithm that can be changed to fix this, and if that's the case we can just patch it and move on.

What is the specific problem with the algorithm that causes this?

pmjordan · on April 17, 2011

I assume this will reduce the quality of the merge algorithm from a stand-alone point of view, which is presumably a very hard sell.

etherealG · on April 17, 2011

I don't know this is true for sure, perhaps introducing the patch would increase it's quality. If someone offered such a patch we could discuss, instead the article only shows the broken test case. It's almost a darcs plug without and reasoning.

etherealG · on April 17, 2011

see the link below posted by tonfa, seems this patch isn't worth it anyway :)

daviddavis · on April 17, 2011

I wonder how mercurial compares in this aspect. Also, I'll keep using git because for sure, it's a helluva lot better than SVN or CVS (which my company was using when I got there).

tonfa · on April 17, 2011

Same as git, and you'll probably get the same reactions.

"""

In other words, we're already at the point of significantly diminished, possibly negative returns on effort. The last few percent will always require some level of human-equivalent intelligence. I think effort here is much better spent elsewhere, like researching general AI or playing on waterslides.

""" http://thread.gmane.org/gmane.comp.version-control.mercurial...

etherealG · on April 17, 2011

Thanks so much for this link, this is exactly the kind of analysis I was hoping for. Clearly this is all a bit FUD, and darcs which gets this right, is trying too hard. I wonder how fast the general merge algo that darcs is using to get this right is? <trollface>

tonfa · on April 17, 2011

Matt's point is that while some algorithms will fix this particular case, you can still come up with a different edge case which makes it break. The whole "prefect merge tool" was very popular five years ago (during git's and mercurial's infancy), but it didn't lead anywhere.

Simple merges strategy are "good enough" in practice.

ob · on April 17, 2011

Matt's point is that they've chosen a system that makes it really hard to get that last 10%.

"We have tried to draw spirals using cartesian coordinates, what we have gets us 90% there, but there are infinities and edge cases involved in getting a perfect spiral. The equations describing them would get so complicated it's just not worth it."

What we have in BitKeeper is the equivalent of polar coordinates... it makes drawing spirals much, much easier ;)

tonfa · on April 17, 2011

Do you have a page describing how that would differ? The bk website seems awfully outdated: there's no mention of the existence of other DVCS, there's a quote from MySQL being happy with bk -- they switched to bazaar two years ago --, etc..

I would be nice if you could give some examples where bk gets the merge right while git doesn't.

ob · on April 18, 2011

Yeah, the website is awfully outdated and information free. BitMover is working on it.

One example that bk gets right and git doesn't is precisely the one explained in the article.

etherealG · on April 17, 2011

I'd appreciate an explanation of your approach too, it would be great to know how we could "adjust our coordinates" to take care of this issue without loosing speed, but gaining accuracy.

I suspect what you'll find is changing the base in this way, while fixing this problem would introduce other problems that occur much more regularly, but I hope I'm wrong.

ob · on April 18, 2011

You're wrong, but unfortunately I'm not at liberty to describe how bk does merges... it's part of the "secret sauce". i am truley sorry for your lots...

etherealG · on April 18, 2011

hehe, fair play. <gets out the reverse engineering kit> :) just kidding.

anonymoushn · on April 17, 2011

The least we can do is detect a situation like this and make it a conflict.

jojo1 · on April 17, 2011

Hmmm, nobody seems to care: http://article.gmane.org/gmane.comp.version-control.git/1057...

tzs · on April 17, 2011

The article mentions that some systems do have the associativity property--that is, extra rungs in the merge ladder do not affect the result.

I can see how that can be achieved in the case of fully automatic merges. When merging B2 into C1+B1, you'd effectively un-merge C1+B1, merge B1 and B2, and then merge C1 and B1+B2.

But how would that work if C1+B1 had a conflict that had to be manually resolved? Assuming merging B1+B2 into C1 has the same problem (a fair assumption) will I have to do the same manual fixes again?

Or are they smart enough to look at the failed automatic C1+B1 merge, and generate a patch to that from the manual fixes I did, and then try to use those to resolve the merge of C1 and B1+B2?

I suspect there will be cases where this is just not going to work well.

dmoney · on April 18, 2011

Off topic, but the link to the shell script and the images in the article use Data URIs, which I didn't know existed: http://en.wikipedia.org/wiki/Data_URI_scheme

gnosis · on April 17, 2011

Does anyone know how bazaar would handle this?

tonfa · on April 17, 2011

Just try or check the source. If they use patience or some kind of cdv merge, I expect they would get the same merge in both direction.

mml · on April 17, 2011

hmm. i was hoping the article discussed git's mind-bogglingly horrible user interface.

can't have everything i guess.

hasenj · on April 17, 2011

git's UI is great; as long as you understand how it works.

The good thing is: "how it works" is really simple.

You should treat it like a language (just like all system/unix tools), not an "app".

Peaker · on April 17, 2011

I think git is one of the best tools we have, but its UI is really bad:

checkout and reset do completely different things when given files or when not given files.

reset on files should really have been called unadd. reset on refspecs should really have been jumpto, moveto or something else indicative that the current branch ptr is moved to a new refspec. --soft and friends could have been --no-update-index or --no-update-files.

checkout on files should really have been called overwrite. checkout on branch names should have probably been switch, setcurrentbranch or a name indicative that the current branch is being changed.

pull and push are symmetric names for asymmetric behavior. pull could have been a flag for merge (-f meaning fetch first).

reset --hard was for a long time the only way to move a branch ptr to a new position along with the files, but it has the potentially unintended consequence of also irreversibly deleting working tree changes. If you use it to delete, that's fine, but since you had to use it to move the branch ptr, it is simply wrong to have irreversible damage as a side effect. Especially in an RCS which is used by many as the fail-safe against their own user mistakes.

There's no easy way to see which branches are tracking what. And until recently it was a big PITA to even make the current branch track a remote branch.

Deleting remote branches has awkward syntax (pushing an empty string to a branch name) and then you have to use a specialized command (remote prune) if you want the deletion to be propagated to other repositories.

Another annoyance: Git doesn't let you push a detached head to a new remote branch, so you have to create a temp branch ptr to the detached head position and later delete it.

Git also doesn't have good support for versioned sub-projects. submodule is sub-par, and requires a multitude of extra commands even in the cases that should have been seamless.

cmurphycode · on April 17, 2011

"checkout and reset do completely different things when given files or when not given files. reset on files should really have been called unadd. reset on refspecs should really have been jumpto, moveto or something else indicative that the current branch ptr is moved to a new refspec. --soft and friends could have been --no-update-index or --no-update-files."

I can understand your confusion, given the seemingly separate use cases for reset, but in fact, it makes perfect sense. Reset always does what it says it does. Let's break it down:

git reset --mixed <commit> will make your current HEAD point to <commit>, reset the index to <commit>, and leave your working tree alone. This is useful for "uncommitting" the last commit, e.g. so you can split it up into smaller commits. Example:

  git commit -am "lots of changes"
  # realize you should really do better
  git reset --mixed HEAD~1
  git add myfile.py
  git commit -m "implemented feature x"
  git add yourfile.py
  git commit -m "bugfix #3182"

Handy. Now let's look at the "unadd' scenario:

  git add dontstage.py
  git reset HEAD dontstage.py == git reset --mixed HEAD dontstage.py, since --mixed is the implicit default

git doesn't touch your commits, since you are already on HEAD. Git does reset the index to HEAD, which is before you added dontstage.py. If you had other changes that you added, it won't reset those, since you provided the limiter of dontstage.py. Git does not touch your working tree, so dontstage.py stays modified. The end result? Your working tree, index, and commits look exactly like before you ran git add dontstage.py.

Now, if someone (e.g. easy git: http://people.gnome.org/~newren/eg/) wants to make git reset HEAD to unadd, that's fine by me. I'm speculating here, but I imagine that the Linus/git dev point of view is, why call it anything other than exactly what it is? It's just nice and elegant that it happens to suffice multiple use cases.

The more you get into git, the more you start to realize why some of the commands that seemed arcane in the beginning are simple and elegantly named.

Peaker · on April 18, 2011

Even after your explanation, the name "reset" and "--mixed" make no sense to me. "reset" is not indicative of what's being reset. "--mixed" is almost meaningless. "--soft" and "--hard" are also mostly meaningless.

I'm OK with having a low-level primitive like "reset" that doesn't have a simple meaning so cannot have a meaningful name. But then, it should be wrapped with meaningful commands such as "moveto" with flags to avoid touching index or working tree, and "unadd" on top of "reset". Then, I don't think anyone would ever use reset directly, so it would probably be phased out :-)

TillE · on April 17, 2011

Sounds like you should be using Mercurial. The only thing it really lacks is the ability to change history, but this is more of a feature than a bug.

tghw · on April 17, 2011

Mercurial has the same ability to change history as git, it just doesn't make it the default workflow. The mq extension (Mercurial Queues) lets you pull existing changesets into a temporary queue, fold them, reorder them, etc. The rebase extension lets you rebase and the histedit extension lets you edit history in slightly different ways.

All of these ship with Mercurial, but are turned off be default. Enabling them is just a matter of adding

    [extensions]
    mq =

to your .hgrc file.

__david__ · on April 17, 2011

None of that says "really bad UI". Quirky, sure. Not as straightforward as others. Could definitely be improved. But not "really bad".

Though I will add that the index is a horribly named concept and it really bugs me that different commands use different names for it ("--cached", but sometimes "--index"). They need to rename it to "staging" and change all the command line options to --staging (keeping the old ones as hidden backward compatible options, of course... "diff --cached" is engrained in my memory at this point). I think that would make things more consistent and clear.

irons · on April 18, 2011

None of that says "really bad UI". Quirky, sure. Not as straightforward as others.

I wouldn't call strychnine a poison. It's just a quirky food additive.