git log - How to improve git log performance?

Question

Welcome To Ask or Share your Answers For Others

git log - How to improve git log performance?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

git log - How to improve git log performance?

I am trying to extract git logs from a few repositories like this:

git log --pretty=format:%H%ae%an%at%s --numstat

For larger repositories (like rails/rails) it takes a solid 35+ seconds to generate the log.

Is there a way to improve this performance?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:05:46+0000

TLDR; as mentioned in GitMerge 2019:

git config --global core.commitGraph true
git config --global gc.writeCommitGraph true
cd /path/to/repo
git commit-graph write

Actually (see at the end), the first two config are not needed with Git 2.24+ (Q3 2019): they are true by default.

As T4cC0re mentions in the comments:

If you are on git version 2.29 or above you should rather run:
git commit-graph write --reachable --changed-paths
This will pre-compute file paths, so that git log commands that are scoped to files also benefit from this cache.

Git 2.18 (Q2 2018) will improve git log performance:

See commit 902f5a2 (24 Mar 2018) by René Scharfe (rscharfe).
See commit 0aaf05b, commit 3d475f4 (22 Mar 2018) by Derrick Stolee (derrickstolee).
See commit 626fd98 (22 Mar 2018) by brian m. carlson (bk2204).
^{(Merged by Junio C Hamano -- gitster -- in commit 51f813c, 10 Apr 2018)}

sha1_name: use bsearch_pack() for abbreviations

When computing abbreviation lengths for an object ID against a single packfile, the method find_abbrev_len_for_pack() currently implements binary search.
This is one of several implementations.
One issue with this implementation is that it ignores the fanout table in the pack-index.

Translate this binary search to use the existing bsearch_pack() method that correctly uses a fanout table.

Due to the use of the fanout table, the abbreviation computation is slightly faster than before.

For a fully-repacked copy of the Linux repo, the following 'git log' commands improved:
* git log --oneline --parents --raw
  Before: 59.2s
  After:  56.9s
  Rel %:  -3.8%

* git log --oneline --parents
  Before: 6.48s
  After:  5.91s
  Rel %: -8.9%

The same Git 2.18 adds a commits graph: Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking.

See commit 7547b95, commit 3d5df01, commit 049d51a, commit 177722b, commit 4f2542b, commit 1b70dfd, commit 2a2e32b (10 Apr 2018), and commit f237c8b, commit 08fd81c, commit 4ce58ee, commit ae30d7b, commit b84f767, commit cfe8321, commit f2af9f5 (02 Apr 2018) by Derrick Stolee (derrickstolee).
^{(Merged by Junio C Hamano -- gitster -- in commit b10edb2, 08 May 2018)}

commit: integrate commit graph with commit parsing

Teach Git to inspect a commit graph file to supply the contents of a struct commit when calling parse_commit_gently().
This implementation satisfies all post-conditions on the struct commit, including loading parents, the root tree, and the commit date.

If core.commitGraph is false, then do not check graph files.

In test script t5318-commit-graph.sh, add output-matching conditions on read-only graph operations.

By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commit walks.

Here are some performance results for a copy of the Linux repository where 'master' has 678,653 reachable commits and is behind 'origin/master' by 59,929 commits.
| Command                          | Before | After  | Rel % |
|----------------------------------|--------|--------|-------|
| log --oneline --topo-order -1000 |  8.31s |  0.94s | -88%  |
| branch -vv                       |  1.02s |  0.14s | -86%  |
| rev-list --all                   |  5.89s |  1.07s | -81%  |
| rev-list --all --objects         | 66.15s | 58.45s | -11%  |

To know more about commit graph, see "How does 'git log --graph' work?".

The same Git 2.18 (Q2 2018) adds lazy-loading tree.

The code has been taught to use the duplicated information stored in the commit-graph file to learn the tree object name for a commit to avoid opening and parsing the commit object when it makes sense to do so.

See commit 279ffad (30 Apr 2018) by SZEDER Gábor (szeder).
See commit 7b8a21d, commit 2e27bd7, commit 5bb03de, commit 891435d (06 Apr 2018) by Derrick Stolee (derrickstolee).
^{(Merged by Junio C Hamano -- gitster -- in commit c89b6e1, 23 May 2018)}

commit-graph: lazy-load trees for commits

The commit-graph file provides quick access to commit data, including the OID of the root tree for each commit in the graph. When performing a deep commit-graph walk, we may not need to load most of the trees for these commits.

Delay loading the tree object for a commit loaded from the graph until requested via get_commit_tree().
Do not lazy-load trees for commits not in the graph, since that requires duplicate parsing and the relative peformance improvement when trees are not needed is small.

On the Linux repository, performance tests were run for the following command:
git log --graph --oneline -1000

Before: 0.92s
After:  0.66s
Rel %: -28.3%

Git 2.21 (Q1 2019) adds loose cache.

See commit 8be88db (07 Jan 2019), and commit 4cea1ce, commit d4e19e5, commit 0000d65 (06 Jan 2019) by René Scharfe (rscharfe).
^{(Merged by Junio C Hamano -- gitster -- in commit eb8638a, 18 Jan 2019)}

object-store: use one oid_array per subdirectory for loose cache

The loose objects cache is filled one subdirectory at a time as needed.
It is stored in an oid_array, which has to be resorted after each add operation.
So when querying a wide range of objects, the partially filled array needs to be resorted up to 255 times, which takes over 100 times longer than sorting once.

Use one oid_array for each subdirectory.
This ensures that entries have to only be sorted a single time. It also avoids eight binary search steps for each cache lookup as a small bonus.

The cache is used for collision checks for the log placeholders %h, %t and %p, and we can see the change speeding them up in a repository with ca. 100 objects per subdirectory:
$ git count-objects
  26733 objects, 68808 kilobytes

Test                        HEAD^

Categories

git log - How to improve git log performance?

git log - How to improve git log performance?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

`sha1_name`: use `bsearch_pack()` for abbreviations

`commit`: integrate commit graph with commit parsing

`commit-graph`: lazy-load trees for commits

`object-store`: use one `oid_array` per subdirectory for loose cache

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

git log - How to improve git log performance?

git log - How to improve git log performance?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

sha1_name: use bsearch_pack() for abbreviations

commit: integrate commit graph with commit parsing

commit-graph: lazy-load trees for commits

object-store: use one oid_array per subdirectory for loose cache

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

`sha1_name`: use `bsearch_pack()` for abbreviations

`commit`: integrate commit graph with commit parsing

`commit-graph`: lazy-load trees for commits

`object-store`: use one `oid_array` per subdirectory for loose cache