Discussion:
[arangodb-google] Request database optimisation
Killian Janod
2018-11-09 15:04:30 UTC
Permalink
Dear arango community,

I need a little help =) I'll try to make this as simple as possible. I
have a not that big database ( 25G on drive ) with two main collections
(Let's call them A and B ). A contain 31,025,557 doc(s) and B 20,785,230.
There is many indexes on both ( with high selectivity ). Data in A and B
are semantically linked. They described the same events.
Arango is in version 3.9.19 with rocksDB engine and default configuration (
installed with docker).

I'm trying to find every document b in B that have a least one counter part
a in A. Then return b and it's most recent a counter part.

My problem is that i made two AQL queries that does the job but use a
huge amount of RAM >60G.

Both request can be found here :
https://gist.github.com/Waateur/6c8c0b2c40d0dfa08cecfe780275cd9f

The first one use join and the second one use Graph.

I pretty sure I'm missing something logical. There must be something wrong
with my queries.
Can someone help me reduce these memory consumption ? ( I know time/memory
is a trade off, spending more time is okay in my workflow )

Any advice/criticisms are welcome.
Sincerely,

Killian Janod
--
You received this message because you are subscribed to the Google Groups "ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arangodb+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Killian Janod
2018-11-16 10:02:40 UTC
Permalink
Hi Simran, Thank's a lot for answering my questions.
I'm sorry I'have been slow to answer back.

There was a copy/paste mistake in the Gist. That's why there was an
undefined gfi_s. The Gist has been update to reflect better the query.
https://gist.github.com/Waateur/6c8c0b2c40d0dfa08cecfe780275cd9f

The collect bellow usually return under 5 results
COLLECT code=b.code, category_id=b.category_id INTO code_group

You are right about this filter FILTER LENGTH(a) == 1 is used to remove b
documents that have no match in a.

About the data distribution this is what i have in mind
A contain 31,025,557 docs and B 20,785,230 doc.
90% of A is link to one or more doc in B
60% of A is link to more than one B. ( mostly 3 and almost never over 5 )
20% of B is link to more than A. ( usually 2 or 3 )
These kind of duplicate are differentiated by the delivery string which is
a "date of upload" information.

As for example of documents, they were added to the gist.
I don't know if any other information can be useful.


Best,
Killian
Hi Killian,
it would be helpful to see your exact index definitions, some example
documents and to know a bit about the data distribution.
How many documents end up in one group here on average?
COLLECT code=b.code, category_id=b.category_id INTO code_group
What do you actually want to return here? "gfi_s" is not defined in the
FOR a_tmp IN A
...
RETURN gfi_s
What is this for? The sub-query is limited to one result anyway. Is this
for the case of no match or a missing attribute (gfi_s, see above)?
FILTER LENGTH(a) == 1
Best, Simran
--
You received this message because you are subscribed to the Google Groups "ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
Killian Janod
Datascientist @ iSmart / Kware
***@kware.fr
***@ismart.fr <***@kware.fr>
+33 (0) 6 61 33 34 76
--
You received this message because you are subscribed to the Google Groups "ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arangodb+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...