Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
300 views
in Technique[技术] by (71.8m points)

cherry pick - What it means "changes introduced by a commit" in git

Everywhere I see this: "...cherry-pick applies changes introduced by a commit..."

I did this: created this file in master:

** File 1 **

Content

** Footer **

then branched out to branch2 and committed a change:

** File 1 **

Content
Edit 1

** Footer **

and then another one:

** File 1 **

Content
Edit 2
Edit 1

** Footer **

Now I went back to master and tried to cherry-pick the latest commit from branch2. I expected that only 'Edit2' will get imported since isn't this a change introduced by that commit, compared to the previous one?

What I got instead is the following merge conflict:

** File 1 **

Content
<<<<<<< HEAD
=======
Edit 2
Edit 1
>>>>>>> b634e53...
** Footer **

Now my obvious question is what is it that I misunderstand about how cherry-pick works, and concretely why there is a merge conflict here, which would be a fast-forward with git merge?

IMPORTANT NOTICE: This is NOT meant as a question about merge conflicts, what I'm interested in is what cherry-pick is actually doing here. And I'm not asking from curiosity/whatever, but because I'm running into troubles using git on my job.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

As several people have noted in comments (and made links to other questions), git cherry-pick actually does a three way merge. How do cherry-pick and revert work? describes this, but more in terms of content than mechanism.

I describe the source of a particular set of merge conflicts in Why do I get this merge conflict with git rebase interactive?, along with a general outline of cherry-pick and revert, but I think it's a good idea to step back and ask the mechanism question you did. I would re-frame it a bit, though, as these three questions:

  • Is a commit really a snapshot?
  • If a commit is a snapshot, how does git show or git log -p show it as a change?
  • If a commit is a snapshot, how can git cherry-pick or git revert work?

Answering the last requires first answering one more question:

  • How does Git perform git merge?

So, let's take these four questions, in the correct order. This is going to be rather long, and if you like, you can jump straight to the last section—but note that it builds on the third section, which builds on the second, which builds on the first.

Is a commit really a snapshot?

Yes—though, technically, a commit refers to a snapshot, rather than being one. This is pretty simple and straightforward. To use Git, we generally start out by running git clone, which gets us a new repository. Occasionally, we start out by making an empty directory and using git init to create an empty repository. Either way, though, we now have three entities:

  1. The repository itself, which a big database of objects, plus a smaller database of name to hash ID mappings (for, e.g., branch names), plus lots of other mini-databases implemented as single files (e.g., one per reflog).

  2. Something Git calls the index, or the staging area, or sometimes the cache. What it gets called depends on who does the calling. The index is essentially where you have Git build the next commit you will make, though it takes on an expanded role during merges.

  3. The work-tree, which is where you can actually see files and work on / with them.

The object database holds four types of objects, which Git calls commits, trees, blobs, and annotated tags. Trees and blobs are mostly implementation detail, and we can ignore annotated tags here: the main function of this big database, for our purposes, is to hold all our commits. These commits then refer to the trees and blobs that hold the files. In the end, it's actually the combination of trees-plus-blobs that is the snapshot. Still, every commit has exactly one tree, and that tree is what gets us the rest of the way to the snapshot, so except for lots of devilish implementation details, the commit itself might as well be a snapshot.

How we use the index to make new snapshots

We won't go too deep into the weeds yet, but we will say that the index works by holding a compressed, Git-ified, mostly-frozen copy of every file. Technically, it holds a reference to the actually-frozen copy, stored as a blob. That is, if you start by doing git clone url, Git has run git checkout branch as the last step of the clone. This checkout filled-in the index from the commit at the tip of branch, so that the index has a copy of every file in that commit.

Indeed, most1 git checkout operations fill in both the index and the work-tree from a commit. This lets you see, and use, all of your files in the work-tree, but the work-tree copies aren't the ones that are actually in the commit. What's in the commit is (are?) frozen, compressed, Git-ified, can-never-be-changed blob snapshots of all of those files. This keeps those versions of those files forever—or for as long as the commit itself exists—and is great for archival, but useless for doing any actual work. That's why Git de-Git-ifies the files into the work-tree.

Git could stop here, with just commits and work-trees. Mercurial—which is in many ways like Git—does stop here: your work-tree is your proposed next commit. You just change stuff in your work-tree and then run hg commit and it makes the new commit from your work-tree. This has the obvious advantage that there's no pesky index making trouble. But it also has some drawbacks, including being inherently slower than Git's method. In any case, what Git does is to start with the previous commit's information saved in the index, ready to be committed again.

Then, each time you run git add, Git compresses and Git-ifies the file you add, and updates the index now. If you change just a few files, and then git add just those few files, Git only has to update a few index entries. So this means that at all times the index has the next snapshot inside it, in the special Git-only compressed and ready-to-freeze form.

This in turn means that git commit simply needs to freeze the index contents. Technically, it turns the index into a new tree, ready for the new commit. In a few cases, such as after some reverts, or for a git commit --allow-empty, the new tree will actually be the same tree as some previous commit, but you don't need to know or care about this.

At this point, Git collects your log message and the other metadata that goes into each commit. It adds the current time as the time-stamp—this helps make sure that each commit is totally unique, as well as being generally useful. It uses the current commit as the new commit's parent hash ID, uses the tree hash ID produced by saving the index, and writes out the new commit object, which gets a new and unique commit hash ID. The new commit therefore contains the actual hash ID of whatever commit you had checked out earlier.

Last, Git writes the new commit's hash ID into the current branch name, so that the branch name now refers to the new commit, rather than to the new commit's parent, as it used to. That is, whatever commit was the tip of the branch, now that commits is one step behind the tip of the branch. The new tip is the commit you just made.


1You can use git checkout commit -- path to extract one particular file from one particular commit. This still copies the file into the index first, so that's not really an exception. However, you can also use git checkout to copy files just from the index, to the work-tree, and you can use git checkout -p to selectively, interactively patch files, for instance. Each of these variants has its own special set of rules as to what it does with index and/or work-tree.

Since Git builds new commits from the index, it may be wise—albeit painful—to re-check the documentation often. Fortunately, git status tells you a lot about what's in the index now—by comparing the current commit vs the index, then comparing the index vs the work-tree, and for each such comparison, telling you what's different. So a lot of the time, you don't have to carry around, in your head, all the wildly varying details of each Git command's effect on index and/or work-tree: you can just run the command, and use git status later.


How does git show or git log -p show a commit as a change?

Each commit contains the raw hash ID of its parent commit, which in turn means that we can always start at the last commit of some string of commits, and work backwards to find all the previous commits:

... <-F <-G <-H   <--master

We only need to have a way to find the last commit. That way is: the branch name, such as master here, identifies the last commit. If that last commit's hash ID is H, Git finds commit H in the object database. H stores G's hash ID, from which Git finds G, which stores F's hash ID, from which Git finds F, and so on.

This is also the guiding principle behind showing a commit as a patch. We have Git look at the commit itself, find its parent, and extract that commit's snapshot. Then we have Git extract the commit's snapshot too. Now we have two snapshots, and now we can compare them—subtract the earlier one from the later one, as it were. Whatever is different, that must be what changed in that snapshot.

Note that this only works for non-merge commits. When we have Git build a merge commit, we have Git store not one but two parent hash IDs. For instance, after running git merge feature while on master, we may have:

       G--H--I
      /       
...--F         M   <-- master (HEAD)
             /
       J--K--L   <-- feature

Commit M has two parents: its first parent is I, which was the tip commit on master just a moment ago. Its second parent is L, which is still the tip commit on feature. It's hard—well, impossible, really—to present commit M as a simple change from either I or L, and by default, git log simply doesn't bother to show any changes here!

(You can tell both git log and git show to, in effect, split the merge: to show a diff from I to M, and then to show a second, separate diff from L to M, using git log -m -p or git show -m. The git show command produces, by default, what Git calls a combined diff, which is kind of weird and special: it's made by, in effect, running both diffs as for -m, then ignoring most of what they say and showing you only some of those changes that come from both commits. This relates pretty strongly to how merges work: the idea is to show the parts that might have had merge conflicts.)

This leads us to our embedded question, which we need to cover before we get to cherry-pick and revert. We need to talk about the mechanics of git merge, i.e., how we got a snapshot for commit M in the first place.

How does Git perform git merge?

Let's start by noting that the point of a merge—well, of most merges, anyway—is to combine work. When we did git checkout master and then git merge feature, we meant: I did some work on master. Someone else did some work on feature. I'd like to combine the work they did with the work I did. There is a process for doing this combining, and then a simpler process for saving the result.

Thus, there are two parts to a true merge that results in a commit like M above. The first part is what I like t


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...