This is the best command line sequence I could come up with:
cat bigfile.json | split -C 1000000000 -d -a4 - output_prefix --filter='gzip > $FILE.gz'
Replace the first step with anything that will output JSON or CSV to stdout, depending on the source file. If it's a plain file cat
will do, if it's a .gz
then gzcat
, if it's a .zstd
then unzstd --long=31 -c file.zst
, etc.
Then split
:
-C 1000000000
creates 1GB files, but respects end-lines for row integrity.
-d
gives a numeric suffix to each file (I prefer this to the default letters_
-a4
makes the numeric suffix length 4 (instead of only 2)
-
will read the output from the previous cat
in the pipeline
output_prefix
is the base name for all output files
--filter='gzip > $FILE.gz'
compresses the 1GB files on the fly with gzip, so each final file will end up with a size around 100MB.
Snowflake can ingest .gz
files, so this final compression step will help us moving the files around the network.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…