Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
140 views
in Technique[技术] by (71.8m points)

Minimising Sqlite db size when indexing 2M files with long paths

I need to index around 2 million files on each of several linux systems and I am worried the naive way to do this might create an unnecessarily large data file because of the longish path names (IE the text of an average path, perhaps /home/user/Android/gradle/blah/blah/blah/filename, times 2 million).

Assuming I put the filename in a column of its own and the path in a different column, with identical text (ie the full path) which is repeated frequently in a table, will Sqlite automatically store the full text once and just use a pointer to it in each row? If not is there a way I can instruct it to do this behind the scenes without having to code it? To code it will be a PITA (quite rusty with sql at the mo) and if I put the text in a separate table I'm wondering if it will slow it all down at run time too.

Will probably use perl for this and the intention is to find replicated data across machines with slow interconnections, so index files, make hashes of all files, transfer db files and test against other machines.

TIA, Pete

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here is a very basic schema and corresponding query for storing file paths. Similar queries could be crafted to, e.g., get all files for a particular folder, or get a relative path of a file. Other meta data can be added to either table or to an auxiliary table. I am not claiming efficiency for any particular purpose other than this schema will avoid storing redundant path strings.

First the schema and sample data. Notice the recursive foreign-key relationship on the folders table, referring to itself:

BEGIN TRANSACTION;

DROP TABLE IF EXISTS files;
DROP TABLE IF EXISTS folders;

CREATE temp TABLE folders (
       id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, 
       parent_id INTEGER,
       name NOT NULL,
       --! UNIQUE(parent_id, name), --* Changed to unique index definition
       FOREIGN KEY (parent_id) REFERENCES folders (id)
);
--* Multiple identical null values were allowed with a unique constraint defined on the table.
--* Instead define a unique index that explicitly converts null values to an effective id value
CREATE UNIQUE INDEX folders_unique_root ON folders
       (coalesce(parent_id, -1), name);

CREATE temp TABLE files (
       id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, 
       folder_id INTEGER,
       name NOT NULL,
       UNIQUE(folder_id, name),
       FOREIGN KEY (folder_id) REFERENCES folders (id)
);

INSERT INTO folders
       (id, parent_id, name)
VALUES
       --* Although id is AUTOINCREMENT, explicitly assign here to make hierarchical relationships obvious
       (1, null, 'home'),
       (2,    1, 'dev'),
       (3,    2, 'SO'),
       (4,    1, 'docs'),
       (5,    1, 'sys');

INSERT INTO files
       (folder_id, name)
VALUES
       (1, 'readme.txt'),
       (3, 'recursive.sql'),
       (3, 'foobar'),
       (4, 'homework.txt');

COMMIT;

Now a query for recovering full paths of the files. I'll add a few comments, but to understand each detail I defer to the official docs for the WITH statement and Window functions:

WITH RECURSIVE
     file_path_levels AS (
         --* Prime the recursion with file rows
         SELECT id AS file_id,
                1 AS level,
                folder_id,
                name
         FROM files
         --WHERE name == 'foobar' --For selecting particular file
         UNION ALL
         --* Continue recursion by joining to next folder in relationship
         SELECT file_id,
                level + 1,
                folders.parent_id,
                folders.name
         FROM file_path_levels  --* Refer to own name to cause recursion
              LEFT JOIN folders --* Get row of parent folder
                   ON file_path_levels.folder_id = folders.id
         WHERE folders.id is not null --Necessary condition to terminate recursion
     ),
     file_path_steps AS (
         --* Now concatenate folders into path string
         SELECT file_id,
                level,
                '/' || group_concat(name, '/') 
                    OVER (PARTITION BY file_id 
                          ORDER BY level DESC)
                    AS full_path
         FROM file_path_levels
     )
SELECT file_id, full_path
FROM file_path_steps
WHERE level == 1;

This produces

file_id   full_path
1         /home/readme.txt
2         /home/dev/SO/recursive.sql
3         /home/dev/SO/foobar
4         /home/docs/homework.txt

It's possible to inspect details of intermediate results by replacing the final query with something to retrieve rows from the other named CTE queries, optionally excluding WHERE conditions to see what each step produces. This will help in learning what each part does. For example, try this after the WITH clause:

SELECT *
FROM file_path_levels
ORDER BY file_id, level;

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...