What's git's heuristic for assigning content modifications to file paths?

Question

Welcome To Ask or Share your Answers For Others

What's git's heuristic for assigning content modifications to file paths?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

What's git's heuristic for assigning content modifications to file paths?

Short version:

short of poring over git's source code, where can I find a full description of the heuristics that git uses to associate chunks of content with specific tracked pathnames?

Detailed version:

In the (Unix) shell demo interaction below, two files, a and b, are "git-commit'ted", then they are modified so as to (effectively) transfer most of a's content to b, and finally the two files are once more commited.

The key thing to look for is that the output of the second git commit ends with the line

rename a => b (99%)

even though no renaming of files (in the usual sense) ever took place (!?!).

Before showing the demo, this brief description may make it easier to follow.

The contents of the files a and b are generated by combining the contents of the three auxiliary files, ../A, ../B, and ../C. Symbolically, the states of a and b could be represented as

../A + ../C -> a
../B        -> b

right before the first commit, and

../A        -> a
../B + ../C -> b

right before the second one.

OK, here's the demo.

First, we display the contents of auxiliary files ../A, ../B, and ../C:

head ../A ../B ../C
# ==> ../A <==
# ...
# 
# ==> ../B <==
# ###
# 
# ==> ../C <==
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================

(Lines beginning with # correspond to output to the terminal; the actual output lines do not have the leading #.)

Next, we create files a and b, display their contents, and commit them

cat ../A ../C > a
cat ../B      > b
head a b
# ==> a <==
# ...
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# 
# ==> b <==
# ###

git add a b
git commit --allow-empty-message -m ''
# [master (root-commit) 3576df7] 
#  2 files changed, 8 insertions(+)
#  create mode 100644 a
#  create mode 100644 b

Next, we modify files a and b, and display their new contents:

cat ../A      > a
cat ../B ../C > b
head a b
# ==> a <==
# ...
#
# ==> b <==
# ###
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================

Finally, we commit the modified a and b; note the output of git commit:

git add a b
git commit --allow-empty-message -m ''
# [master 25b806f] 
#  2 files changed, 2 insertions(+), 8 deletions(-)
#  rewrite a (99%)
#  rename a => b (99%)

I rationalize this behavior as follows.

As I understand it, git treats directory structure info (such as the pathnames of the files it's tracking) as secondary information—or metadata, if you will—, to be associated with the primary information it tracks, namely various chunks of content.

Since both the contents as well as the names (including pathnames) of files may change between commits, git must use heuristics to associate pathnames to chunks of content. But heuristics, by their very nature, are not guaranteed to work 100% of the time. A failure of such heuristics here takes the form of a history that does not faithfully represent what actually happened (e.g. it reports a file renaming even though no file was renamed, in the usual sense).

A further confirmation of this interpretation (namely, that some heuristics are at play) is that, AFAICT, if the size of the transferred chunk is not sufficiently large, the output of git commit will not include the rewrite/rename lines. (I include a demonstration of this case at the end of this post, FWIW.)

My question is this: short of poring over git's source code, where can I find a full description of the heuristics that git uses to associate chunks of content with specific tracked pathnames?

This second demo is identical to the first one in every way, except that the auxiliary file ../C is one line shorter than before.

head ../A ../B ../C
# ==> ../A <==
# ...
# 
# ==> ../B <==
# ###
# 
# ==> ../C <==
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================

cat ../A ../C > a
cat ../B      > b
head a b
# ==> a <==
# ...
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# 
# ==> b <==
# ###

git add .
git commit -a --allow-empty-message -m ''
# [master (root-commit) a06a689] 
#  2 files changed, 7 insertions(+)
#  create mode 100644 a
#  create mode 100644 b

cat ../A      > a
cat ../B ../C > b
head a b
# ==> a <==
# ...
# 
# ==> b <==
# ###
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================

git add .
git commit -a --allow-empty-message -m ''
# [master 87415a1] 
#  2 files changed, 5 insertions(+), 5 deletions(-)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T23:34:36+0000

As you noticed, Git performs rename detection using a heuristic, rather than being told that a rename occurred. The git mv command, in fact, simply stages an add on the new file path and a remove of the old file path. Thus, rename detection is performed by comparing the contents of added files to the previously committed contents of deleted files.

First, candidates are collected. Any new files are possible rename targets and any deleted files are possible rename sources. In addition, rewriting changes are broken such that a file that is more than 50% different than its previous revision is both a possible rename source and a possible rename target.

Next, identical renames are detected. If you rename a file without making any changes, then the file will hash identically. These can be detected just performing comparisons of the hash in the index without reading the file contents, so removing these from the candidate list will reduce the number of comparisons you need to perform.

Finally, the similarity comparison is performed. Each line in each candidate file is hashed and collected in a sorted list. Long lines are split at 60 characters. Whitespace only lines may be stripped on the assumption that they don't contribute greatly to the similarity matching. The line hashes from each candidate source are compared to the line hashes from each candidate target. If two lists are 60% similar, they are deemed a rename.

Categories

What's git's heuristic for assigning content modifications to file paths?

What's git's heuristic for assigning content modifications to file paths?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags