-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Neo4j csv publisher using apoc library for performance improvements - CLEAN #1877
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Once this is merged, I'll deploy into our dev environment and start testing other features out.
AFAIK this doesn't support multiple programmatic descriptions being added, so I'll work on that PR post-merge.
Excellent stuff!
Looking forward to having this PR merged! Would like to use this in the official build 🚀 |
Ahh looks like
(https://github.com/amundsen-io/amundsen/pull/1877/checks?check_run_id=6632089423) |
Signed-off-by: Zac Ruiz <zac@salt.io>
Signed-off-by: Zac Ruiz <zac@salt.io>
Signed-off-by: Zac Ruiz <zac@salt.io>
Signed-off-by: Zac Ruiz <zac@salt.io>
Signed-off-by: Zac Ruiz <zac@salt.io>
c51cb10
to
e62874d
Compare
I tested this in my environment and found out that the data is actually not ingested into neo4j. Good news is after a discussion and pair debugging with @zacr , we are able to locate the problem! The root cause is related to the neo4j version.
Basically, there are three main problems Publisher not showing neo4j commit errorWe will get this error message while running the query on the neo4j browser apoc.merge.node spec changed
|
Ahh yep, that would make sense with the |
Thanks so much @chonyy. I have an update in progress that addresses all three, hope to have it committed in the next day or so. |
Hey @zacr , FYI, I tested your code in neo4j After I changed all the key to uppercase, it works! Just want to let you know that case sensitivity seems to also be an issue in your provided environment. |
Awesome @chonyy . Agreed on the upper case. Can we assume the headers of the CSV files will always be upper case? |
Also, I am out of town this week, but here is the error handling code that goes right after session.run() in _execute_statement()
Was thinking of adding a flag to turn on\off this checking. |
Any updates on this PR? Would be great to see it merged! |
I'm thinking that maybe we don't even have to spend time supporting Neo4j What I can help here is to add the fix and also the extra part that zacr mentioned, in order to surface the error to the caller side. But we still need a commiter's help to look into this and approve the CI pipeline run 🥲 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I just checked your code and generally seems good to me, I have just a couple of nitpicks and one related to the index creation that it's a bug (so the performance improvement could be greater than your initial estimation).
One important thing that seems missing is the support for a database that is not the default one. Neo4j since version 4.X supports multi-tenancy so it would be good to have it in there.
CALL apoc.periodic.iterate( | ||
'UNWIND $rows AS row RETURN row', | ||
'CALL apoc.merge.node([row.label], {key:row.key}, row, row) YIELD node RETURN COUNT(*);', | ||
{batchSize: $batch_size, iterateList: True, parallel: False, params: { rows: $batch, tag: $publish_tag }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{batchSize: $batch_size, iterateList: True, parallel: False, params: { rows: $batch, tag: $publish_tag }} | |
{batchSize: $batch_size, parallel: False, params: { rows: $batch, tag: $publish_tag }} |
I would like to remove iterateList
as it's deprecated and use the default batchMode
which is BATCH
, that has the same behaviour.
return """ | ||
CALL apoc.periodic.iterate('UNWIND $rows AS row RETURN row', ' | ||
CALL apoc.merge.node([row.label], {key:row.key}, row, {published_tag:$tag,publisher_last_updated_epoch_ms:timestamp()}) YIELD node RETURN COUNT(*); | ||
', {batchSize: $batch_size, iterateList: True, parallel: False, params: { rows: $batch, tag: $publish_tag }}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
', {batchSize: $batch_size, iterateList: True, parallel: False, params: { rows: $batch, tag: $publish_tag }}) | |
', {batchSize: $batch_size, parallel: False, params: { rows: $batch, tag: $publish_tag }}) |
like above
return """ | ||
CALL apoc.periodic.iterate('UNWIND $rows AS row RETURN row', ' | ||
CALL apoc.merge.node([row.label], {key:row.key}, row, row) YIELD node RETURN COUNT(*); | ||
', {batchSize: $batch_size, iterateList: True, parallel: False, params: { rows: $batch }}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
', {batchSize: $batch_size, iterateList: True, parallel: False, params: { rows: $batch }}) | |
', {batchSize: $batch_size, parallel: False, params: { rows: $batch }}) |
like above
# CALL apoc.schema.assert(null, data, dropExisting: false) YIELD label, key, keys, unique, action | ||
# """ | ||
stmt = """ | ||
CREATE CONSTRAINT ON (node:label) ASSERT node.key IS UNIQUE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The label is fixed you cannot pass it as a parameter as you do here you have to format the string upfront otherwise it won't work
/rebase |
Hello, I know it has been a while since there has been activity on this PR, but I wanted to follow up for those who were interested in a faster publisher. A few months ago I worked on one myself in this PR using the |
Closing as done. |
Summary of Changes
New Neo4JCsvPublisher implementation using the APOC library for 5x performance improvement. This is a fresh PR because original one wasn't created on a clean branch.
Tests
I didn't see any automated tests for Neo4JCsvPublisher. This PR was tested manually by several community members.
Documentation
CheckList
Make sure you have checked all steps below to ensure a timely review.