In a graph with 165k nodes and 266k relationships I'd like to run the following Cypher query:
START n=node:NodeIds('id:firstId'), t=node:NodeIds('id:secondId')
MATCH (n)-[:RELATIONSHIP_TYPE*1..3]-(t)
RETURN count(*)
where firstId
and secondId
is a valid entry for the NodeIds Lucene index.
The query takes about 4 seconds to execute from the Neo4j console and I'd like to understand why is it so slow and how it could be made faster.
The index lookup from this takes about 40ms (i.e. a query just returning the two nodes takes that much) so that can't be the issue.
I run Neo4j on a Windows 8 machine with the default settings by starting from Neo4j.bat. I don't think hardware can be an issue as the query only causes a short 10% CPU spike and a barely visible spike in disk usage.
BTW the first node has a degree of 40, the second 2 and the result is 1.
Any help would be appreciated.
Edit 1, memory config:
I was running Neo4j with OOTB config by starting from Neo4j.bat with the following defaults regarding memory (if I'm not mistaken and those are the only memory-related configs):
wrapper.java.initmemory=16
wrapper.java.maxmemory=64
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=50M
neostore.propertystore.db.mapped_memory=90M
neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M
Shooting one into the dark I raised these values to the following:
wrapper.java.initmemory=128
wrapper.java.maxmemory=1024
neostore.nodestore.db.mapped_memory=225M
neostore.relationshipstore.db.mapped_memory=250M
neostore.propertystore.db.mapped_memory=290M
neostore.propertystore.db.strings.mapped_memory=330M
neostore.propertystore.db.arrays.mapped_memory=330M
This indeed increased Neo4j memory usage (I mean the memory usage of the java.exe instance running Neo4j) without a good increase in performance (the query takes roughly the same time, with probably a 2-300ms increase occasionally). There are GBs of RAM free so there's no hardware constraint.
Edit 2, profiler data:
Running the profiler for the query in question yields the following results:
neo4j-sh (0)$ profile START n=node:NodeIds('id:4000'), t=node:NodeIds('id:64599') MATCH path = (n)-[:ASSOCIATIVY_CONNECTION*1..3]-(t) RETURN count(*);
==> +----------+
==> | count(*) |
==> +----------+
==> | 1 |
==> +----------+
==> 1 row
==> 0 ms
==>
==> ColumnFilter(symKeys=[" INTERNAL_AGGREGATE-939275295"], returnItemNames=["count(*)"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=[], aggregates=["( INTERNAL_AGGREGATE-939275295,CountStar)"], _rows=1, _db_hits=0)
==> ExtractPath(name="path", patterns=[" UNNAMED3=n-[:ASSOCIATIVY_CONNECTION*1..3]-t"], _rows=1, _db_hits=0)
==> PatternMatch(g="(n)-[' UNNAMED3']-(t)", _rows=1, _db_hits=0)
==> Nodes(name="t", _rows=1, _db_hits=1)
==> Nodes(name="n", _rows=1, _db_hits=1)
==> ParameterPipe(_rows=1, _db_hits=0)
It says 0ms but I don't know what that is supposed to mean: the result is returned after multiple seconds and the same query executed in the Data Browser's console takes about 3,5s (this is what it displays) and roughly the same amount of time fetched through the RESTful endpoint.
Edit 3, the real data set:
Enough with the theory :-), this is the data set what I'm really talking about: http://associativy.com/Media/Default/Associativy/Wiki.zip It's a graph generated by using the interlinks between Wikipedia articles, created from Wikipedia dump files. It's just the beginning.
The real query I'm trying to run is actually the following one, returning the nodes building up the paths between two nodes:
START n=node:NodeIds('id:4000'), t=node:NodeIds('id:64599') MATCH path = (n)-[:ASSOCIATIVY_CONNECTION*1..3]-(t) RETURN nodes(path) AS Nodes
I showed the count query because I wanted the simplest query that shows the symptoms.
Edit 4:
I opened another question specifically for the path-returning query.
See Question&Answers more detail:
os