Moving on with Optimized Privilege Evaluation for 3.0 release #5181

nibix · 2025-03-17T08:07:52Z

I am filing this issue because we still need a decision on how to move on with the Optimized Privilege Evaluation (#4380) for 3.0 release.

See the PR discussion at #4898 for further context.

Personally, I would strongly recommend not to have it wrapped in a feature flag in the 3.0 release.

However: If we go this route, we also need to adjust the a property which controls if compat DLS/FLS headers are sent to other nodes. This is important for mixed clusters. See the note at the top of #4898 and in this comment: #4898 (review)

If we do not go this route, we urgently need to build alternative ways to handle this.

So, how shall we move on here?

See also #5013

@cwperks @krishna-ggk @kumargu @shikharj05

cwperks · 2025-03-17T15:06:49Z

[Triage] Thank you for filing this issue. Also tagging the repo maintainers for input @peternied @willyborankin @reta @DarshitChanpura @derek-ho @RyanL1997

peternied · 2025-03-17T16:33:13Z

@nibix Thanks for writing this up. What is the risk to OpenSearch customers for one path vs the other? At its face, this seems like this is the time to making a breaking API on the eve of v3.0.0.

kumargu · 2025-03-17T17:55:57Z

@nibix Unfortunately, AWS hasn't yet release 2.19 version for its managed service customers, so AWS don't actually have enough data to make the decision that the change hasn't had regressions in real production env (that was major motovation of the flag). Carrying the change in 3.0 without the flag cannot be considered safe.

Initially, I was alinged to get rid of the flag in 3.0, for the complexity you mentioned. But unfortunately due to nature of the delay, I would vote to keep the flag in 3.0 (hence 3.x)

nibix · 2025-03-18T04:23:19Z

@peternied

Thanks for writing this up. What is the risk to OpenSearch customers for one path vs the other? At its face, this seems like this is the time to making a breaking API on the eve of v3.0.0.

We actually have no significant breaking change that is visible to the user in this. The only user visible breaking change is the discontinuation of support of an undocumented config option described in #4495

nibix · 2025-03-18T04:45:28Z

@kumargu

While I understand your assessment of the risks, I need to object to the absolute way you are putting this:

Carrying the change in 3.0 without the flag cannot be considered safe.

Yes, there are risks. Like many changes to the security plugin have. Yes, this is a big change that is bigger than many changes to the security plugin. Still, many small changes have also the potential to carry big risks, even if they seem trivial on the first sight. In the end, we are talking about a plugin which creates security critical code. Must we then consider most of the changes to it as "not safe" then?

Please do not understand me wrong, I am understanding about your concerns and open to discuss options.

nibix · 2025-03-18T05:00:05Z

@peternied

What is the risk to OpenSearch customers for one path vs the other?

So, the two paths are these:

Discontinue the sending of DLS/FLS headers by default (for performance reasons). Compatibility code will check the version of the peer node. If the peer node is still on an OpenSearch version that does require the presence of DLS/FLS headers and the sent message requires it, the compatibility code will add the DLS/FLS headers to the message.
Have a feature flag which decides which implementation is used (So, this is not about whether DLS/FLS headers shall be sent or not, but rather which implementation of PrivilegeEvaluator, DlsFlsValve, and all other DLS/FLS supporting code is used). This brings the challenge how to toggle this feature flag in a way that there is no time window where nodes disagree about the implementation. If there would be a time window where nodes disagree about the implementation, then data leaks could occur because DLS/FLS is ineffective. This challenge is yet unsolved. Option 1 has the nice property that this can be told from the node version stored in the Connection object. Option 2 does no longer have that property that could help us in this regard, as the version cannot be longer used to tell what implementation a node is running. Another dimension to this is that the code base will need to carry two large implementations of privilege evaluation and DLS/FLS (estimate 8000 LOC per implementation). There must be scrutiny that any change to that functionality is applied to both implementations to address the risk that contributors are only aware of one implementation. Additional side effects are: higher maintenance cost of the code and higher turnaround times due to increased CI run time (as both implementations need to be tested).

Common to both paths is the risk of unknowns.

Option 1 is the originally designed method, which is actually already merged to the main branch. The only necessary change is the adaption of the release version of the feature.

Option 2 was requested from AWS managed service side to make roll backs easier in case of unforeseen issues.

kumargu · 2025-03-18T06:27:37Z

Must we then consider most of the changes to it as "not safe" then?

No.

The concern with this change is purely around the "very" large nature of it. We cannot introduce a safety flag for every other change, it will make the software complex. Safety around most of the other changes are covered by unit, functional, integration tests. Not saying we don't have sufficient coverage--but the larger the nature of change, harder its for humans to reason if all aspects of it are test covered. Just to give you additional context: this change has been bit delayed in AWS 2.19 because of a push-back to do an security-penetration-testing.

I think we all agreed why we needed the flag and that's why you did all the extra work to support the flag. My point here is: we haven't been able to exercise the code in production, so how do we decide to remove the flag?

kumargu · 2025-03-18T06:32:55Z

@nibix given the situation, I would like to hear your point of view on preserving the flag. We can then take votes from others and talk about the path forward.

nibix · 2025-03-18T07:37:29Z

@kumargu

Of course, the situation is complex. In a classic setting, having the opportunity to roll out a big change at a major version release which goes through an extended alpha/beta/GA phase is of course a great opportunity for quality assurance of changes, as the major release with alpha/beta phase carries the message "this brings big changes, please everyone invest great scrutiny to verify it".

Given the current situation, I don't think that we have a precise decision yet on how to move on. It's not just about preserving the flag, but also about handling the DLS/FLS changes which are at the moment not part of the feature flag protected version of the code. We had the decision "let's collect data and based on that we decide on how to move on ".

I'll just go through some options we can decide about

revert the PR from the main branch and forward port the commit with the feature flag from the 2.x branch. This version won't have any DLS/FLS improvements. Thus, we would need to release the DLS/FLS improvements with a later minor version, missing the additional testing visibility of the major release.
revert the PR from the main branch, forward port the commit from the 2.x branch and add on top the DLS/FLS support. This would give us the testing opportunity but we'd need to rush to get a suitable concept and an implementation ready for 3.0.
utilize the alpha beta phase to collect data for the present implementation.

Certainly there are more options.

kumargu · 2025-03-18T08:34:00Z

thanks @nibix for the proposing the options.

I was also thinking of option 1 as the path forward (i.e bringing DLS/FLS in a later minor version). This is the simplest and safest (with the timelines of 3.0-beta).

The perf improvements we have seen just from the current optimisations in 2.x is already huge and a lot of big AWS customers are waiting for this change to be generally available. The point I am trying to make is: we might be good without the DLS/FLS changes for now. DLS/FLS in 3.1 will bring more improvements.

(the green lines are CPU utilization without the optimizations while the red line shows the CPU utilization with optimizations)

Regarding option 3:

utilize the alpha beta phase to collect data for the present implementation.

At least I think its really hard to collect data from the community. I'd be frank here: the change is so effective but I haven't heard anyone giving us feedback on the effectiveness. I am pretty sure we will get a lot of good data points when this change is available in AWS (or other cloud providers).

cwperks · 2025-03-18T14:41:04Z

The point I am trying to make is: we might be good without the DLS/FLS changes for now. DLS/FLS in 3.1 will bring more improvements.

Just wanted to bring up that there have been a number of instances where this change would help and its been hit both in clusters managed by AWS or users running self-managed clusters.

Prior to Optimized Privilege Evaluation, I tried a separate approach to optimize that reduced the data structure that was serialized in the ThreadContext headers but it was closed in favor of OPE: #4706

kumargu · 2025-03-18T16:10:14Z

Agreed that the DLS/FLS changes are super important too and holding them off unmerged is also painful. But again, I think it might be the simplest path forward.

Open for Option 2, if there are strong concerns of holding off the DLS/FLS changes. Note 3.0 Beta is going to be pushed for 2 weeks further opensearch-project/.github#300, so if we think we can make the changes with enough testing, I am open for Option 2.

nibix · 2025-03-19T11:59:53Z

@kumargu @kkhatua @cwperks @peternied @willyborankin @reta @DarshitChanpura @derek-ho @RyanL1997

As we already missed the alpha, we really need to expedite the decision making process.

cwperks · 2025-03-19T13:31:59Z

I know we had discussed letting this bake in the 2.x line for up to 3 minor versions before removing the flag and so far we've only had a single 2.x version that uses the pre-computed data structures for privilege evaluation. I would support introducing the feature flag for 3.0 and removing in one of the first few 3.x releases.

kumargu · 2025-03-19T14:20:36Z

+1 on @cwperks proposal. Also keeping things simple and less hectic, lets introduce DLS/FLS changes in 3.1.

let's try to close it this week.

willyborankin · 2025-03-19T15:54:27Z

I personally believe that introducing a flag could break backward compatibility in the 3.x series. This might lead to a situation similar to what happened with ES in the past (I'm unsure how it works now), where a breaking or undocumented change was introduced, only to be planned for removal in a future minor version. In other words, I’m in favor of keeping the current approach without introducing a feature flag.

nibix · 2025-03-20T14:20:26Z

Looking at #5193 one thing just crossed my mind, which I should point out:

For porting the feature flag to 3.x, we must not only pick the commit from #4998, but we also must restore all V6 config classes which were already removed from the main branch: #4546 . This will probably also require that #5193 is rolled back again.

kumargu · 2025-03-20T14:40:41Z

thanks @nibix for pointing that out. There seems to be a lot of friction involved to get the flag in-3.0. I fear we might introduce regression just managing commits.

Looking at the comment stream it seems @willyborankin @peternied @nibix upvotes for removing the flag in 3.0. While @kumargu @cwperks suggests to continue the flag usage in 3.0.

Going with the quorum and urgency of decision making: Let's go ahead with the removal of the flag in 3.0. Let's conclude on this first before we talk about DLS/FLS changes.

Given Opensearch 3.0 release is out to be in May first week, the 2.19 in AWS managed service will have sufficient time (probably 2 months) to get the current change in 2.19 exercised in production. Any bug fixes identified during this time will have to be backported to 3.0.

(please thumbs up if y'all are aligned and let's try to close this today)

kumargu · 2025-03-21T13:34:31Z

@nibix are you good with the above path-forward?

nibix · 2025-03-21T16:35:42Z

@kumargu

@nibix are you good with the above path-forward?

absolutely. i will file a PR to make the adjustment to the release version mentioned above beginning of next week.

kumargu · 2025-03-23T10:25:48Z

On DLS/FLS changes: I am still conservative and want to hold it until 3.1. My mental model is to get the Privileged Evaluation optimizations in a good clean state in 3.0 (no ambiguities) and then roll the DLS/FLS changes. We can make the DLS/FLS changes without a feature flag, because the surface of changes would be reduced.

github-actions bot added the untriaged Require the attention of the repository maintainers and may need to be prioritized label Mar 17, 2025

cwperks added triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Mar 17, 2025

nibix mentioned this issue Mar 18, 2025

Fix logic for determining if a role has no DLS/FLS/FieldMasking restrictions #5184

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving on with Optimized Privilege Evaluation for 3.0 release #5181

Moving on with Optimized Privilege Evaluation for 3.0 release #5181

nibix commented Mar 17, 2025 •

edited

Loading

cwperks commented Mar 17, 2025

peternied commented Mar 17, 2025

kumargu commented Mar 17, 2025

nibix commented Mar 18, 2025

nibix commented Mar 18, 2025

nibix commented Mar 18, 2025 •

edited

Loading

kumargu commented Mar 18, 2025

kumargu commented Mar 18, 2025

nibix commented Mar 18, 2025 •

edited

Loading

kumargu commented Mar 18, 2025

cwperks commented Mar 18, 2025

kumargu commented Mar 18, 2025

nibix commented Mar 19, 2025

cwperks commented Mar 19, 2025

kumargu commented Mar 19, 2025

willyborankin commented Mar 19, 2025 •

edited

Loading

nibix commented Mar 20, 2025

kumargu commented Mar 20, 2025

kumargu commented Mar 21, 2025

nibix commented Mar 21, 2025

kumargu commented Mar 23, 2025

Moving on with Optimized Privilege Evaluation for 3.0 release #5181

Moving on with Optimized Privilege Evaluation for 3.0 release #5181

Comments

nibix commented Mar 17, 2025 • edited Loading

cwperks commented Mar 17, 2025

peternied commented Mar 17, 2025

kumargu commented Mar 17, 2025

nibix commented Mar 18, 2025

nibix commented Mar 18, 2025

nibix commented Mar 18, 2025 • edited Loading

kumargu commented Mar 18, 2025

kumargu commented Mar 18, 2025

nibix commented Mar 18, 2025 • edited Loading

kumargu commented Mar 18, 2025

cwperks commented Mar 18, 2025

kumargu commented Mar 18, 2025

nibix commented Mar 19, 2025

cwperks commented Mar 19, 2025

kumargu commented Mar 19, 2025

willyborankin commented Mar 19, 2025 • edited Loading

nibix commented Mar 20, 2025

kumargu commented Mar 20, 2025

Going with the quorum and urgency of decision making: Let's go ahead with the removal of the flag in 3.0. Let's conclude on this first before we talk about DLS/FLS changes.

kumargu commented Mar 21, 2025

nibix commented Mar 21, 2025

kumargu commented Mar 23, 2025

nibix commented Mar 17, 2025 •

edited

Loading

nibix commented Mar 18, 2025 •

edited

Loading

nibix commented Mar 18, 2025 •

edited

Loading

willyborankin commented Mar 19, 2025 •

edited

Loading