Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
534 views
in Technique[技术] by (71.8m points)

hash - Git is moving to new hashing algorithm SHA-256 but why git community settled on SHA?256

I just learned from this HN-post that git is moving to new hashing algorithm ( from SHA-1 to SHA-256 )

I wanted to know what makes SHA-256 best fit for git's use case. Is there any/many strong technical reason or is it possible that SHA-256 popularity is a strong factor ? ( I am making a guess ) Looking at https://en.wikipedia.org/wiki/Comparison_of_cryptographic_hash_functions page I see thee are many modern and older alternatives present. some of them are more ( almost same if not more ) performant and stronger than SHA-256 ( example https://crypto.stackexchange.com/q/26336 )

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I have presented that move in "Why doesn't Git use more modern SHA?" in Aug. 2018

The reasons were discussed here by Brian M. Carlson:

I've implemented and tested the following algorithms, all of which are 256-bit (in alphabetical order):

  • BLAKE2b (libb2)
  • BLAKE2bp (libb2)
  • KangarooTwelve (imported from the Keccak Code Package)
  • SHA-256 (OpenSSL)
  • SHA-512/256 (OpenSSL)
  • SHA3-256 (OpenSSL)
  • SHAKE128 (OpenSSL)

I also rejected some other candidates.
I couldn't find any reference or implementation of SHA256×16, so I didn't implement it.
I didn't consider SHAKE256 because it is nearly identical to SHA3-256 in almost all characteristics (including performance).

SHA-256 and SHA-512/256

These are the 32-bit and 64-bit SHA-2 algorithms that are 256 bits in size.

I noted the following benefits:

  • Both algorithms are well known and heavily analyzed.
  • Both algorithms provide 256-bit preimage resistance.

Summary

The algorithms with the greatest implementation availability are SHA-256, SHA3-256, BLAKE2b, and SHAKE128.

In terms of command-line availability, BLAKE2b, SHA-256, SHA-512/256, and SHA3-256 should be available in the near future on a reasonably small Debian, Ubuntu, or Fedora install.

As far as security, the most conservative choices appear to be SHA-256, SHA-512/256, and SHA3-256.

The performance winners are BLAKE2b unaccelerated and SHA-256 accelerated.

The suggested conclusion was based on:

Popularity

Other things being equal we should be biased towards whatever's in the widest use & recommended for new projects.

Hardware acceleration

The only widely deployed HW acceleration is for the SHA-1 and SHA-256 from the SHA-2 family, but notably nothing from the newer SHA-3 family (released in 2015).

Age

Similar to "popularity" it seems better to bias things towards a hash that's been out there for a while, i.e. it would be too early to pick SHA-3.

The hash transitioning plan, once implemented, also makes it easier to switch to something else in the future, so we shouldn't be in a rush to pick some newer hash because we'll need to keep it forever, we can always do another transition in another 10-15 years.

Result: commit 0ed8d8d, Git v2.19.0-rc0, Aug 4, 2018.

SHA-256 has a number of advantages:

  • It has been around for a while, is widely used, and is supported by just about every single crypto library (OpenSSL, mbedTLS, CryptoNG, SecureTransport, etc).

  • When you compare against SHA1DC, most vectorized SHA-256 implementations are indeed faster, even without acceleration.

  • If we're doing signatures with OpenPGP (or even, I suppose, CMS), we're going to be using SHA-2, so it doesn't make sense to have our security depend on two separate algorithms when either one of them alone could break the security when we could just depend on one.

So SHA-256 it is.

The idea remains: Any notion of SHA1 is being removed from Git codebase and replaced by a generic "hash" variable.
Tomorrow, that hash will be SHA2, but the code will support other hashes in the future.

As Linus Torvalds delicately puts it (emphasis mine):

Honestly, the number of particles in the observable universe is on the order of 2**256. It's a really really big number.

Don't make the code base more complex than it needs to be.
Make a informed technical decision, and say "256 bits is a lot".

The difference between engineering and theory is that engineering makes trade-offs.
Good software is well engineered, not theorized
.

Also, I would suggest that git default to "abbrev-commit=40", so that nobody actually sees the new bits by default.
So the perl scripts etc that use "[0-9a-f]{40}" as a hash pattern would just silently continue to work.

Because backwards compatibility is important (*)

(*) And 2**160 is still a big big number, and hasn't really been a practical problem, and SHA1DC is likely a good hash for the next decade or longer.

(SHA1DC, for "Detecting(?) Collision", was discussed in early 2017, after the collision attack shattered.io instance: see commit 28dc98e, Git v2.13.0-rc0, March 2017, from Jeff King, and "Hash collision in git")


See more in Documentation/technical/hash-function-transition.txt

The transition to SHA-256 can be done one local repository at a time.

a. Requiring no action by any other party.
b. A SHA-256 repository can communicate with SHA-1 Git servers (push/fetch).
c. Users can use SHA-1 and SHA-256 identifiers for objects interchangeably (see "Object names on the command line", below).
d. New signed objects make use of a stronger hash function than SHA-1 for their security guarantees.


That transition is facilitated with Git 2.27 (Q2 2020), and its git fast-import --rewrite-submodules-from/to=<name>:<file>

See commit 1bdca81, commit d9db599, commit 11d8ef3, commit abe0cc5, commit ddddf8d, commit 42d4e1d, commit e02a714, commit efa7ae3, commit 3c9331a, commit 8b8f718, commit cfe3917, commit bf154a8, commit 8dca7f3, commit 6946e52, commit 8bd5a29, commit 1f5f8f3, commit 192b517, commit 9412759, commit 61e2a70, commit dadacf1, commit 768e30e, commit 2078991 (22 Feb 2020) by brian m. carlson (bk2204).
(Merged by Junio C Hamano -- gitster -- in commit f8cb64e, 27 Mar 2020)

fast-import: add options for rewriting submodules

Signed-off-by: brian m. carlson

When converting a repository using submodules from one hash algorithm to another, it is necessary to rewrite the submodules from the old algorithm to the new algorithm, since only references to submodules, not their contents, are written to the fast-export stream.
Without rewriting the submodules, fast-import fails with an "Invalid dataref" error when encountering a submodule in another algorithm.

Add a pair of options, --rewrite-submodules-from and --rewrite-submodules-to, that take a list of marks produced by fast-export and fast-import, respectively, when processing the submodule.
Use these marks to map the submodule commits from the old algorithm to the new algorithm.

We read marks into two corresponding struct mark_set objects and then perform a mapping from the old to the new using a hash table. This lets us reuse the same mark parsing code that is used elsewhere and allows us to efficiently read and match marks based on their ID, since mark files need not be sorted.

Note that because we're using a khash table for the object IDs, and this table copies values of struct object_id instead of taking references to them, it's necessary to zero the struct object_id values that we use to insert and look up in the table. Otherwise, we would end up with SHA-1 values that don't match because of whatever stack garbage might be left in the unused area.

The git fast-import documentation now includes:

Submodule Rewriting

--rewrite-submodules-from=<n

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...