As several people have noted in comments (and made links to other questions), git cherry-pick
actually does a three way merge. How do cherry-pick and revert work? describes this, but more in terms of content than mechanism.
I describe the source of a particular set of merge conflicts in Why do I get this merge conflict with git rebase interactive?, along with a general outline of cherry-pick and revert, but I think it's a good idea to step back and ask the mechanism question you did. I would re-frame it a bit, though, as these three questions:
- Is a commit really a snapshot?
- If a commit is a snapshot, how does
git show
or git log -p
show it as a change?
- If a commit is a snapshot, how can
git cherry-pick
or git revert
work?
Answering the last requires first answering one more question:
- How does Git perform
git merge
?
So, let's take these four questions, in the correct order. This is going to be rather long, and if you like, you can jump straight to the last section—but note that it builds on the third section, which builds on the second, which builds on the first.
Is a commit really a snapshot?
Yes—though, technically, a commit refers to a snapshot, rather than being one. This is pretty simple and straightforward. To use Git, we generally start out by running git clone
, which gets us a new repository. Occasionally, we start out by making an empty directory and using git init
to create an empty repository. Either way, though, we now have three entities:
The repository itself, which a big database of objects, plus a smaller database of name to hash ID mappings (for, e.g., branch names), plus lots of other mini-databases implemented as single files (e.g., one per reflog).
Something Git calls the index, or the staging area, or sometimes the cache. What it gets called depends on who does the calling. The index is essentially where you have Git build the next commit you will make, though it takes on an expanded role during merges.
The work-tree, which is where you can actually see files and work on / with them.
The object database holds four types of objects, which Git calls commits, trees, blobs, and annotated tags. Trees and blobs are mostly implementation detail, and we can ignore annotated tags here: the main function of this big database, for our purposes, is to hold all our commits. These commits then refer to the trees and blobs that hold the files. In the end, it's actually the combination of trees-plus-blobs that is the snapshot. Still, every commit has exactly one tree, and that tree is what gets us the rest of the way to the snapshot, so except for lots of devilish implementation details, the commit itself might as well be a snapshot.
How we use the index to make new snapshots
We won't go too deep into the weeds yet, but we will say that the index works by holding a compressed, Git-ified, mostly-frozen copy of every file. Technically, it holds a reference to the actually-frozen copy, stored as a blob. That is, if you start by doing git clone url
, Git has run git checkout branch
as the last step of the clone. This checkout
filled-in the index from the commit at the tip of branch, so that the index has a copy of every file in that commit.
Indeed, most1 git checkout
operations fill in both the index and the work-tree from a commit. This lets you see, and use, all of your files in the work-tree, but the work-tree copies aren't the ones that are actually in the commit. What's in the commit is (are?) frozen, compressed, Git-ified, can-never-be-changed blob snapshots of all of those files. This keeps those versions of those files forever—or for as long as the commit itself exists—and is great for archival, but useless for doing any actual work. That's why Git de-Git-ifies the files into the work-tree.
Git could stop here, with just commits and work-trees. Mercurial—which is in many ways like Git—does stop here: your work-tree is your proposed next commit. You just change stuff in your work-tree and then run hg commit
and it makes the new commit from your work-tree. This has the obvious advantage that there's no pesky index making trouble. But it also has some drawbacks, including being inherently slower than Git's method. In any case, what Git does is to start with the previous commit's information saved in the index, ready to be committed again.
Then, each time you run git add
, Git compresses and Git-ifies the file you add, and updates the index now. If you change just a few files, and then git add
just those few files, Git only has to update a few index entries. So this means that at all times the index has the next snapshot inside it, in the special Git-only compressed and ready-to-freeze form.
This in turn means that git commit
simply needs to freeze the index contents. Technically, it turns the index into a new tree, ready for the new commit. In a few cases, such as after some reverts, or for a git commit --allow-empty
, the new tree will actually be the same tree as some previous commit, but you don't need to know or care about this.
At this point, Git collects your log message and the other metadata that goes into each commit. It adds the current time as the time-stamp—this helps make sure that each commit is totally unique, as well as being generally useful. It uses the current commit as the new commit's parent hash ID, uses the tree hash ID produced by saving the index, and writes out the new commit object, which gets a new and unique commit hash ID. The new commit therefore contains the actual hash ID of whatever commit you had checked out earlier.
Last, Git writes the new commit's hash ID into the current branch name, so that the branch name now refers to the new commit, rather than to the new commit's parent, as it used to. That is, whatever commit was the tip of the branch, now that commits is one step behind the tip of the branch. The new tip is the commit you just made.
1You can use git checkout commit -- path
to extract one particular file from one particular commit. This still copies the file into the index first, so that's not really an exception. However, you can also use git checkout
to copy files just from the index, to the work-tree, and you can use git checkout -p
to selectively, interactively patch files, for instance. Each of these variants has its own special set of rules as to what it does with index and/or work-tree.
Since Git builds new commits from the index, it may be wise—albeit painful—to re-check the documentation often. Fortunately, git status
tells you a lot about what's in the index now—by comparing the current commit vs the index, then comparing the index vs the work-tree, and for each such comparison, telling you what's different. So a lot of the time, you don't have to carry around, in your head, all the wildly varying details of each Git command's effect on index and/or work-tree: you can just run the command, and use git status
later.
How does git show
or git log -p
show a commit as a change?
Each commit contains the raw hash ID of its parent commit, which in turn means that we can always start at the last commit of some string of commits, and work backwards to find all the previous commits:
... <-F <-G <-H <--master
We only need to have a way to find the last commit. That way is: the branch name, such as master
here, identifies the last commit. If that last commit's hash ID is H
, Git finds commit H
in the object database. H
stores G
's hash ID, from which Git finds G
, which stores F
's hash ID, from which Git finds F
, and so on.
This is also the guiding principle behind showing a commit as a patch. We have Git look at the commit itself, find its parent, and extract that commit's snapshot. Then we have Git extract the commit's snapshot too. Now we have two snapshots, and now we can compare them—subtract the earlier one from the later one, as it were. Whatever is different, that must be what changed in that snapshot.
Note that this only works for non-merge commits. When we have Git build a merge commit, we have Git store not one but two parent hash IDs. For instance, after running git merge feature
while on master
, we may have:
G--H--I
/
...--F M <-- master (HEAD)
/
J--K--L <-- feature
Commit M
has two parents: its first parent is I
, which was the tip commit on master
just a moment ago. Its second parent is L
, which is still the tip commit on feature
. It's hard—well, impossible, really—to present commit M
as a simple change from either I
or L
, and by default, git log
simply doesn't bother to show any changes here!
(You can tell both git log
and git show
to, in effect, split the merge: to show a diff from I
to M
, and then to show a second, separate diff from L
to M
, using git log -m -p
or git show -m
. The git show
command produces, by default, what Git calls a combined diff, which is kind of weird and special: it's made by, in effect, running both diffs as for -m
, then ignoring most of what they say and showing you only some of those changes that come from both commits. This relates pretty strongly to how merges work: the idea is to show the parts that might have had merge conflicts.)
This leads us to our embedded question, which we need to cover before we get to cherry-pick and revert. We need to talk about the mechanics of git merge
, i.e., how we got a snapshot for commit M
in the first place.
How does Git perform git merge
?
Let's start by noting that the point of a merge—well, of most merges, anyway—is to combine work. When we did git checkout master
and then git merge feature
, we meant: I did some work on master
. Someone else did some work on feature
. I'd like to combine the work they did with the work I did. There is a process for doing this combining, and then a simpler process for saving the result.
Thus, there are two parts to a true merge that results in a commit like M
above. The first part is what I like t