Aggressively cache entry points in process #29625

uranusjr · 2023-02-20T05:52:40Z

importlib.metadata.distributions() reads information from the actual installations, which is a lot of IO that we can avoid by caching.

The benefit of this depends on how many packages you have in your installation. It’s nearly zero with a bare Airflow installation, and I observed a ~7% save (17s to 16s) for the webserver to finish init (launch until when_ready is emitted) in a setup with all official providers installed.

The downside is we are now persisting a lot of small objects in memory. I wonder whether there’s a good time we can purge those.

ashb · 2023-02-20T11:30:16Z

airflow/utils/entry_points.py

    loaded: set[str] = set()
+    mapping: dict[str, list[EPnD]] = collections.defaultdict(list)
    for dist in metadata.distributions():
        try:
            key = canonicalize_name(dist.metadata["Name"])


I was profling memory usage (not speed), and this seemed to be a cause of a lot of bloat -- and it's suprisingly expensive to get the dist name (involves parsing a lot of files for every dist in airflow).

I think there are three options here:

Only compute this key if the entrypoint group matches (this limits the expensive operation to just dists we actually care about, instead of all

Use the path component to be the cache key (see below)

Cache based on the entrypoint classpath

Don't cache at all. This only catches a case when you have multiple copies of the same dist in multiple cases (which is rare to hit outside of being an Airflow developer anyway).

On point 2, it doesn't seem possible to do this using public methods, but this:

In [14]: d._path Out[14]: PosixPath('/home/ash/airflow/.venv/lib/python3.11/site-packages/greenlet-2.0.2.dist-info') In [15]: d._path.stem Out[15]: 'greenlet-2.0.2'

Doing this might break for less common ways of shipping dists though, so it's likely not a good option, not without a fallback anyway.

I checked the callers and currently this is used by loading airflow.plugins and apache_airflow_provider. Both of these implement their own deduplication logic, so I think it is safe to remove this entirely. Although this would not actually help the latter case, which still accesses metadata anyway…

importlib.metadata.distributions() reads information from the actual installations, which is a lot of IO that we can avoid by caching.

(cherry picked from commit 9f51845)

uranusjr force-pushed the cache-importlib-metadata branch 2 times, most recently from 55945e1 to b83d310 Compare February 20, 2023 10:35

ashb reviewed Feb 20, 2023

View reviewed changes

uranusjr requested a review from ashb February 21, 2023 05:34

uranusjr force-pushed the cache-importlib-metadata branch from 449de1f to 7ad6b88 Compare February 21, 2023 17:08

uranusjr added 2 commits February 22, 2023 13:28

Aggressively cache entry points in process

85f371c

importlib.metadata.distributions() reads information from the actual installations, which is a lot of IO that we can avoid by caching.

Remove somewhat costly metadata access

ec0fa55

uranusjr force-pushed the cache-importlib-metadata branch from 7ad6b88 to ec0fa55 Compare February 22, 2023 05:29

uranusjr marked this pull request as ready for review February 22, 2023 05:29

kaxil approved these changes Feb 23, 2023

View reviewed changes

uranusjr merged commit 9f51845 into apache:main Feb 25, 2023

uranusjr deleted the cache-importlib-metadata branch February 25, 2023 05:36

pierrejeambrun added this to the Airflow 2.5.2 milestone Feb 27, 2023

pierrejeambrun added the type:misc/internal Changelog: Misc changes that should appear in change log label Feb 27, 2023

pierrejeambrun pushed a commit that referenced this pull request Mar 7, 2023

Aggressively cache entry points in process (#29625)

f6390ec

(cherry picked from commit 9f51845)

pierrejeambrun modified the milestones: Airflow 2.5.2, Airflow 2.6.0 Mar 8, 2023

potiuk mentioned this pull request Jun 29, 2023

Fix behaviour of LazyDictWithCache when import fails #32248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggressively cache entry points in process #29625

Aggressively cache entry points in process #29625

uranusjr commented Feb 20, 2023

ashb Feb 20, 2023

uranusjr Feb 21, 2023

Aggressively cache entry points in process #29625

Aggressively cache entry points in process #29625

Conversation

uranusjr commented Feb 20, 2023

ashb Feb 20, 2023

Choose a reason for hiding this comment

uranusjr Feb 21, 2023

Choose a reason for hiding this comment