Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No saved queries are shown if SQLGlot is unable to parse any of the saved queries #27581

Closed
3 tasks
john-bodley opened this issue Mar 19, 2024 · 4 comments
Closed
3 tasks
Assignees

Comments

@john-bodley
Copy link
Member

john-bodley commented Mar 19, 2024

Bug description

@betodealmeida and @michael-s-molina I've found another wrinkle with the SQLGlot integration which was introduced in #26476. The /savedqueryview/list/ which displays a table of your saved queries includes a column named "Tables" which uses the ParsedQuery class to extract the tables names from every saved query.

These saved queries may contain Jinja templating—including the latest[_sub]_partition] macros resulting in SQL statements which SQLGlot is unable to parse—especially if the is no proper escaping. The resulting UX is users see this error and no saved queries are listed.

The long-term fix (as discussed previously) of having a multi-phase approach to extracting the table names without having to render aspects of the Jinja templates should resolve this, however I'm not entirely sure what the short-term solution should be.

Some possible solutions could be:

  1. Rendering the entire query—irrespective of how inefficient it may be and/or the unnecessary load/burden on the analytical engines, which need to execute the latest[_sub]_partition] Jinja macros.
  2. Simply removing the "Tables" column, though, I gather, this would be deemed a breaking change. I'm not entirely sure how valuable this is to users.
  3. Reverting back to using sqlparse and only introduce SQLGlot when i) the various workflows are hardened, and ii) [SIP-117] Improve SQL parsing is fully implemented. For context we've had about half a dozen reported regressions with the SQLGlot integration and thus we might not quite be ready for its widespread release.

How to reproduce the bug

  1. Create a saved query comprising of a SQL statement which can be parsed by SQLGlot, i.e., SELECT 1.
  2. Create a saved query comprising of a SQL-esque statement which cannot be parsed by SQLGlot, i.e., SELECT '{{ trino.latest_partition('foo') }}'.
  3. Go to either:
  • /superset/welcome/ and see the There was an issue fetching your saved queries: [object Response] error.
  • /savedqueryview/list/ and see the An error occurred while fetching Saved queriess: Unable to parse SQL ... error and observe that no saved queries are listed.

Screenshots/recordings

No response

Superset version

3.1.1

Python version

3.9

Node version

16

Browser

Chrome

Additional context

No response

Checklist

  • I have searched Superset docs and Slack and didn't find a solution to my problem.
  • I have searched the GitHub issue tracker and didn't find a similar bug report.
  • I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.
@betodealmeida
Copy link
Member

betodealmeida commented Mar 19, 2024

I'm not a fan of (3). I know we've had regressions with the introduction of #26476, but we also had multiple CVEs caused by sqlparse being very lenient. These regressions we're seeing after #26476 are also a result of sqlparse being so accepting, to the that point that we never noticed that we're passing Jinja templates to a SQL parser, instead of valid SQL. IMHO these are not regressions, but subtle bugs and potential security issues that we're only finding now because we switched to a stricter parser.

(2) might be OK, since it's the result of fixing a CVE. But let's try to fix this before going down that route.

One thing we could do is add a referenced_tables column to the saved_query table, which could be populated when the query is first run. We would probably have to leave it empty for existing queries when the migration runs, since I know AirBnB has saved queries going back a decade.

@john-bodley what happened to your PR that traversed the Jinja nodes? It seems like we need a helper function extract_tables_from_jinja_sql, which uses the node traversal to fetch tables from macros, and then the SQL parser to extract tables from a custom-rendered version of the query.

@betodealmeida
Copy link
Member

@john-bodley to be clear, my suggestion is to do something like this (pseudocode):

def get_referenced_tables(template: str) -> set[Table]:
    tables = get_tables_from_jinja_macros(template)
    sql = render_jinja(template, mode="dummy")  # do not run DB queries to fetch latest partitions
    return tables & ParsedQuery(sql).tables

@john-bodley
Copy link
Member Author

john-bodley commented Mar 20, 2024

@betodealmeida sorry I scrubbed my previous comment after realizing what you meant in terms of a helper function. I was thinking along the same lines as your pseudo code.

I'll work on a fix.

@michael-s-molina
Copy link
Member

Fixed by #27470

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants