An approach I use with my large XML files - 130GB or bigger - is to upload the whole file into a temporary unlogged table and from there I extract the content I want. Unlogged tables
are not crash-safe, but are much faster than logged ones, which totally suits the purpose of a temporary table ;-)
Considering the following table ..
CREATE UNLOGGED TABLE tmp (raw TEXT);
.. you can import this 1GB file using a single psql
line from your console (unix)..
$ cat 1gb_file.txt | psql -d db -c "COPY tmp FROM STDIN"
After that all you need is to apply your logic to query and extract the information you want. Depending on the size of your table, you can create a second table from a SELECT
, e.g.:
CREATE TABLE t AS
SELECT
trim((string_to_array(raw,','))[1]) AS operation,
trim((string_to_array(raw,','))[2])::timestamp AS tmst,
trim((string_to_array(raw,','))[3]) AS txt
FROM tmp
WHERE raw LIKE '%DEBUG%' AND
raw LIKE '%ghtorrent-40%' AND
raw LIKE '%Repo EFForg/https-everywhere exists%'
Adjust the string_to_array
function and the WHERE
clause to your logic! Optionally you can replace these multiple LIKE
operations to a single SIMILAR TO
.
.. and your data would be ready to be played with:
SELECT * FROM t;
operation | tmst | txt
-----------+---------------------+------------------------------------------------------------------
DEBUG | 2017-03-23 10:02:27 | ghtorrent-40 -- ghtorrent.rb:Repo EFForg/https-everywhere exists
(1 Zeile)
Once your data is extracted you can DROP TABLE tmp;
to free some disk space ;)
Further reading: COPY
, PostgreSQL array functions
and pattern matching
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…