Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggressively cache entry points in process #29625

Merged
merged 2 commits into from
Feb 25, 2023

Conversation

uranusjr
Copy link
Member

importlib.metadata.distributions() reads information from the actual installations, which is a lot of IO that we can avoid by caching.

The benefit of this depends on how many packages you have in your installation. It’s nearly zero with a bare Airflow installation, and I observed a ~7% save (17s to 16s) for the webserver to finish init (launch until when_ready is emitted) in a setup with all official providers installed.

The downside is we are now persisting a lot of small objects in memory. I wonder whether there’s a good time we can purge those.

@uranusjr uranusjr force-pushed the cache-importlib-metadata branch 2 times, most recently from 55945e1 to b83d310 Compare February 20, 2023 10:35
loaded: set[str] = set()
mapping: dict[str, list[EPnD]] = collections.defaultdict(list)
for dist in metadata.distributions():
try:
key = canonicalize_name(dist.metadata["Name"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was profling memory usage (not speed), and this seemed to be a cause of a lot of bloat -- and it's suprisingly expensive to get the dist name (involves parsing a lot of files for every dist in airflow).

I think there are three options here:

  1. Only compute this key if the entrypoint group matches (this limits the expensive operation to just dists we actually care about, instead of all
  2. Use the path component to be the cache key (see below)
  3. Cache based on the entrypoint classpath
  4. Don't cache at all. This only catches a case when you have multiple copies of the same dist in multiple cases (which is rare to hit outside of being an Airflow developer anyway).

On point 2, it doesn't seem possible to do this using public methods, but this:

In [14]: d._path
Out[14]: PosixPath('/home/ash/airflow/.venv/lib/python3.11/site-packages/greenlet-2.0.2.dist-info')

In [15]: d._path.stem
Out[15]: 'greenlet-2.0.2'

Doing this might break for less common ways of shipping dists though, so it's likely not a good option, not without a fallback anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the callers and currently this is used by loading airflow.plugins and apache_airflow_provider. Both of these implement their own deduplication logic, so I think it is safe to remove this entirely. Although this would not actually help the latter case, which still accesses metadata anyway…

@uranusjr uranusjr requested a review from ashb February 21, 2023 05:34
@uranusjr uranusjr force-pushed the cache-importlib-metadata branch from 449de1f to 7ad6b88 Compare February 21, 2023 17:08
importlib.metadata.distributions() reads information from the actual
installations, which is a lot of IO that we can avoid by caching.
@uranusjr uranusjr force-pushed the cache-importlib-metadata branch from 7ad6b88 to ec0fa55 Compare February 22, 2023 05:29
@uranusjr uranusjr marked this pull request as ready for review February 22, 2023 05:29
@uranusjr uranusjr merged commit 9f51845 into apache:main Feb 25, 2023
@uranusjr uranusjr deleted the cache-importlib-metadata branch February 25, 2023 05:36
@pierrejeambrun pierrejeambrun added this to the Airflow 2.5.2 milestone Feb 27, 2023
@pierrejeambrun pierrejeambrun added the type:misc/internal Changelog: Misc changes that should appear in change log label Feb 27, 2023
pierrejeambrun pushed a commit that referenced this pull request Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:misc/internal Changelog: Misc changes that should appear in change log
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants