-
Notifications
You must be signed in to change notification settings - Fork 14.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggressively cache entry points in process #29625
Conversation
55945e1
to
b83d310
Compare
airflow/utils/entry_points.py
Outdated
loaded: set[str] = set() | ||
mapping: dict[str, list[EPnD]] = collections.defaultdict(list) | ||
for dist in metadata.distributions(): | ||
try: | ||
key = canonicalize_name(dist.metadata["Name"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was profling memory usage (not speed), and this seemed to be a cause of a lot of bloat -- and it's suprisingly expensive to get the dist name (involves parsing a lot of files for every dist in airflow).
I think there are three options here:
- Only compute this key if the entrypoint group matches (this limits the expensive operation to just dists we actually care about, instead of all
- Use the path component to be the cache key (see below)
- Cache based on the entrypoint classpath
- Don't cache at all. This only catches a case when you have multiple copies of the same dist in multiple cases (which is rare to hit outside of being an Airflow developer anyway).
On point 2, it doesn't seem possible to do this using public methods, but this:
In [14]: d._path
Out[14]: PosixPath('/home/ash/airflow/.venv/lib/python3.11/site-packages/greenlet-2.0.2.dist-info')
In [15]: d._path.stem
Out[15]: 'greenlet-2.0.2'
Doing this might break for less common ways of shipping dists though, so it's likely not a good option, not without a fallback anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the callers and currently this is used by loading airflow.plugins
and apache_airflow_provider
. Both of these implement their own deduplication logic, so I think it is safe to remove this entirely. Although this would not actually help the latter case, which still accesses metadata
anyway…
449de1f
to
7ad6b88
Compare
importlib.metadata.distributions() reads information from the actual installations, which is a lot of IO that we can avoid by caching.
7ad6b88
to
ec0fa55
Compare
(cherry picked from commit 9f51845)
importlib.metadata.distributions()
reads information from the actual installations, which is a lot of IO that we can avoid by caching.The benefit of this depends on how many packages you have in your installation. It’s nearly zero with a bare Airflow installation, and I observed a ~7% save (17s to 16s) for the webserver to finish init (launch until
when_ready
is emitted) in a setup with all official providers installed.The downside is we are now persisting a lot of small objects in memory. I wonder whether there’s a good time we can purge those.