Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
777 views
in Technique[技术] by (71.8m points)

command line interface - How to split a CSV or JSON file for optimal Snowflake ingestion?

Snowflake recommends splitting large files before ingesting:

To optimize the number of parallel operations for a load, we recommend aiming to produce data files roughly 100-250 MB (or larger) in size compressed. https://docs.snowflake.com/en/user-guide/data-load-considerations-prepare.html

What's the best way to split my large files, and compress them?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is the best command line sequence I could come up with:

cat bigfile.json  | split -C 1000000000 -d -a4 - output_prefix --filter='gzip > $FILE.gz'

Replace the first step with anything that will output JSON or CSV to stdout, depending on the source file. If it's a plain file cat will do, if it's a .gz then gzcat, if it's a .zstd then unzstd --long=31 -c file.zst, etc.

Then split:

  • -C 1000000000 creates 1GB files, but respects end-lines for row integrity.
  • -d gives a numeric suffix to each file (I prefer this to the default letters_
  • -a4 makes the numeric suffix length 4 (instead of only 2)
  • - will read the output from the previous cat in the pipeline
  • output_prefix is the base name for all output files
  • --filter='gzip > $FILE.gz' compresses the 1GB files on the fly with gzip, so each final file will end up with a size around 100MB.

Snowflake can ingest .gz files, so this final compression step will help us moving the files around the network.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...