Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.1k views
in Technique[技术] by (71.8m points)

git push - Could Git Smart API thin pack calculation ever consider reusing common sub-trees?

Q: When git pushes refs that have no common history over the Smart Protocol, can it consider root or sub-trees already in-common between local and origin when building the thin-pack to send?

tl;dr

Consider this (uncommon) situation when working-with and pushing to a remote Git repository.

  • I have a local repository where the local master points to a tree with 1110 descendant sub-trees a[0-9]/b[0-9]/c[0-9].
  • Remote origin/master is current with the local master commit i.e. identical histories. It uses ssh protocol.
  • For whatever reason, I create a local branch squashed. I set that branch to a new, single root-commit, but with the same content/tree as master. This can be done with git commit-tree. So this branch has a single commit with no commits in-common with master, but the root tree-hash is identical, it points to the same tree object in master and origin/master. It is not important that this is a single/squashed commit in order to discuss this - any history rewritten back to the root commit, with no common history will do.
  • git push origin HEAD # push squashed

From observations of the performance of this with a large repository, and the number of objects sent, I suspect that push, send-pack and receive-pack and associated thin-pack negotiation over the Smart Protocol does something like:

  • Confirms that the commit being pushed squashed has no common-history with any commit origin currently has.
  • Is oblivious to the fact that squashed points to a tree that is not only in origin, but is the tree for a current HEAD ref.
  • Packs and sends everything.

In this case the trees are identical. If a subsequent change is made in squashed ... either an additional commit, or a new squash that changes a file in a0, 2 trees (/ and a0) would have changed, and the other 1109 would be unchanged. The root tree has changed, which means a next-level search would be required to see whether it is worth searching for further common sub-trees. This might require a heuristic, as without comparing all sub-trees down-to the leaves, it is not possible to infer the number of descendant trees in-common from the trees at any particular depth.

Of course if there are multiple commits in the nothing-in-common history being pushed, this negotiation would need to be repeated for each commit.

Does it sound reasonable that the Smart API could consider already-held common sub-trees, or at the very least, the root-tree, as it considers each commit? Or should Git already be doing this and there is something wrong with my client or server?

git version 2.8.2

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Checking git's source and trying it with git daemon and GIT_TRACE_PACKET says you're correct about what it's doing: git negotiates at the commit level only. If the history isn't shared, git won't detect the shared content.

Does it sound reasonable that the Smart API could consider already-held common sub-trees, or at the very least, the root-tree, as it considers each commit?

If the already-held common subtrees can't be identified by already-held common commits, then to identify those subtrees it'd have to send their ids.

The thing is, for anything short of a complete readout, I can construct a plausible-sounding corner case that sends an arbitrarily-large amount of redundant data -- but sending every existing subtree id every time to avoid that possibility is clearly a huge loss. Don't forget that round-trip latency is horrendously expensive. So, at what point do you become likely to be spending more time negotiating when considering added overhead across all fetches, in the aggregate? If you're going to argue that some particular alternate method would save time overall, you're going to have to show up with hard data on actual production traffic.

Also remember that you can construct packs yourself. It's not hard, you feed object id's to git pack-objects pack and drop the output into .git/objects/pack, congratulations, you've just fetched exactly those objects into that repo.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...