Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
525 views
in Technique[技术] by (71.8m points)

maven - Import billions of nodes and relationships to Neo4j using Batch Import on Windows

I want to insert a few billions nodes and relationships to Neo4j. Using "LOAD CSV" is being cancelled after 30 min by the browser (Chrome) as the working memory is overloaded, though I have 16GB RAM.

Large datasets apparently can be imported to Neo4j using the Batch Importer (Documentation & Download, Explanation for Linux ).

To simply use it (no source/git/maven required):

1. download 2.2 zip
2. unzip
3. run import.sh test.db nodes.csv rels.csv (on Windows: import.bat)
4. after the import point your /path/to/neo4j/conf/neo4j-server.properties 
to this test.db directory, or copy the data over to your server cp -r 
test.db/* /path/to/neo4j/data/graph.db/

You provide one tab separated csv file for nodes and one for 
relationships (optionally more for indexes)

I struggle to use the plugin on Windows. In the Linux-Video by Rik Van Bruggen (link above) he mentions "installation of the batch importer".

I unzipped the file "download 2.2 zip". I have my CSVs in another folder. How do I use the "import.bat" command mentioned in the Documentation on WIndows? In cmd the command can't be found...

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Before using the tool for gigantic datasets, I can suggest you few things I just learned importing millions of nodes in few minutes (Neo4j Community Edition for Windows).

Regarding Neo4j import tips:

  • Don't use the web interface to import such big datasets, memory overload is inevitable.

  • Instead, use a programming language to interact with Neo4j (I recently used the official Python module and it's simply to learn but you can do the same with the good-old Java).

  • Before using the LOAD CSV, remember to write the USING PERIODIC COMMIT instructions in order to import big sets of data each iteration.

  • Before importing relations from CSV, remember to use CREATE CONSTRAINT ON <...> ASSERT <...> IS UNIQUE for the key-properties of your labels. It will have a huge impact on relationships creation.

  • Use MATCH(...), not CREATE(...) for the relationship procedure. It will avoids duplicates.

Regarding Neo4j performance:

  • First of all: read the official Neo4j page for tuning performance: https://neo4j.com/docs/operations-manual/current/performance/

  • Set a proper memory configuration for your Windows machine: configure manually the dbms.memory.pagecache.size parameter (in neo4j.conf file), if necessary.

  • Remember: the Java Virtual Machine is not a black box; you can improve its performance specifically for your application (editing the neo4j-community.vmoptions file). For example, you can set the max memory usage for the JVM (-Xmx parameter), you can also set the -XX:+UseG1GC parameter to using the G1 Garbage Collector (high performance, suggested by Oracle for production enviroment) (https://docs.oracle.com/cd/E40972_01/doc.70/e40973/cnf_jvmgc.htm#autoId0)

I'll post my neo4j.conf custom lines used for my configuration (just for reference, it may be a wrong setup for your application, beware):

dbms.memory.pagecache.size=3g
dbms.jvm.additional=-XX:+UseG1GC
dbms.jvm.additional=-XX:-OmitStackTraceInFastThrow
dbms.jvm.additional=-XX:+AlwaysPreTouch
dbms.jvm.additional=-XX:+UnlockExperimentalVMOptions
dbms.jvm.additional=-XX:+TrustFinalNonStaticFields
dbms.jvm.additional=-XX:+DisableExplicitGC

And my neo4j-community.vmoptions custom lines (again, just for reference):

-Xmx1024m
-XX:+UseG1GC
-OmitStackTraceInFastThrow
-XX:+AlwaysPreTouch
-XX:+UnlockExperimentalVMOptions
-XX:+TrustFinalNonStaticFields
-XX:+DisableExplicitGC

My test machine is a weak notebook equipped with an Core i3 (dual core), with 8GB of RAM, Windows 10 and Neo4j 3.2.1 Community Edition.

I'm capable of importing 7 millions of nodes in less than 3 minutes and 3.5 millions of relationships in less than 5 minutes (no recursive relationships).

In a more capable machine, with a specific crafted setup, Neo4j can do WAY better than this. Hope it helps.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...