Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an aggregator for IPv4 and IPv6 subnets #82410

Merged

Conversation

salvatore-campagna
Copy link
Contributor

@salvatore-campagna salvatore-campagna commented Jan 11, 2022

This PR addresses a request for an IP address subnet aggregator (see #57964).

Parameters accepted by the aggregator include:

  • prefix_length (integer, required): defines the network size of the subnet mask
  • is_ipv6 (boolean, optional, default: false): defines whether the prefix applies to IPv6 (true) or IPv4 (false) IP addresses
  • min_doc_count (integer, optional, default: 1): defines the minimum number of documents for a bucket to be returned in the results
  • append_prefix_length (boolean, optional, default: false): defines if the prefix length is appended to the IP address key when returning results
  • keyed (boolean, optional, default: false): defines whether the result is returned keyed or as an array of buckets

Each bucket returned by the aggregator represents a different subnet. IPv4 subnets also include a netmask field set to the subnet mask value (i.e. "255.255.0.0" for a /16 subnet).

Related to: #57964 and elastic/kibana#68424

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 11, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@@ -65,6 +65,7 @@
"geotile_grid",
"global",
"histogram",
"ip_prefix",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@salvatore-campagna
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/part-1

@salvatore-campagna
Copy link
Contributor Author

I am aware I need to add documentation about this new aggregation...I am just waiting for some feedback about the interface.

@flash1293
Copy link
Contributor

The API makes a lot of sense to me - this would be a very good fit for how Kibana does things today 👍

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Jan 11, 2022

This is a sample request including an aggregation on an IPv4 field and an aggregation on an IPv6 field:

curl -X GET "localhost:9200/network/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "/16": {
      "ip_prefix": {
        "field": "ipv4",
        "prefix_len": 16,
        "is_ipv6": false,
        "min_doc_count": 1,
        "append_prefix_len": false,
        "keyed": false
      }
    },
    "/64": {
      "ip_prefix": {
        "field": "ipv6",
        "prefix_len": 64,
        "is_ipv6": true,
        "min_doc_count": 1,
        "append_prefix_len": false,
        "keyed": false
      }
    }
  }
}'

And this is the response:

{
  "took" : 29,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "/16" : {
      "buckets" : [
        {
          "key" : "192.168.0.0",
          "netmask" : "255.255.0.0",
          "doc_count" : 4,
          "is_ipv6" : false,
          "prefix_len" : 16
        },
        {
          "key" : "192.169.0.0",
          "netmask" : "255.255.0.0",
          "doc_count" : 1,
          "is_ipv6" : false,
          "prefix_len" : 16
        }
      ]
    },
    "/64" : {
      "buckets" : [
        {
          "key" : "2001:db8:85a3::",
          "doc_count" : 1,
          "is_ipv6" : true,
          "prefix_len" : 64
        },
        {
          "key" : "2001:db8:85ab::",
          "doc_count" : 1,
          "is_ipv6" : true,
          "prefix_len" : 64
        },
        {
          "key" : "2001:dc4:11b2:7::",
          "doc_count" : 1,
          "is_ipv6" : true,
          "prefix_len" : 64
        },
        {
          "key" : "2001:dc4:11b3::",
          "doc_count" : 1,
          "is_ipv6" : true,
          "prefix_len" : 64
        },
        {
          "key" : "2004:db3:85f4:10::",
          "doc_count" : 1,
          "is_ipv6" : true,
          "prefix_len" : 64
        }
      ]
    }
  }
}

@salvatore-campagna
Copy link
Contributor Author

NOTE: I put the code under org.elasticsearch.search.aggregations.bucket.range which might not be ideal since this is not really a range aggregation (like other range aggregations). Initially this was done supporting multiple ranges and I just left the code there. Looking at other packages I don't see a clear candidate for this...

@not-napoleon
Copy link
Member

You can create a new package under o.e.s.aggregations.bucket for this agg. That's a pretty standard pattern for us.

One thing we should probably test - what happens if we get mixed IPv4 and IPv6 addresses in the same run? That doesn't seem like something the user would do on purpose, but for example if they're using an index pattern, it's not hard to imagine that the same field might have changed from IPv4 in one index to IPv6 in another. We should do something sensible in that case, even if it's just failing the shard if the stored data doesn't match what the aggregation expected.

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Jan 12, 2022

I am adding a test were I index IPv6 addresses and then try to aggregate with is_ipv6 = false and viceversa, index documents with IPv4 addresses and try to aggregate with is_ipv6 = true. I think here the result is unpredictable (or at least implementation dependent). The issue is that all IP addresses are encoded as 16-bytes arrays. Then the mask is applied "shifting" the actual mask by 12 bytes if the address is IPv4. I have been thinking of a clever way to handle this so that we don't need to know if the address is IPv4 or IPv6 but could not come up with something good.

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Jan 12, 2022

I was wondering if we should, in some way, encode the IP version in the IpFieldMapper together with the actual IP address (without having to distinguish two different types, IPv4 and IPv6, but still keeping just IP as the data type). I am not sure this is doable. I think this is not worth doing.

@salvatore-campagna
Copy link
Contributor Author

@elasticmachine update branch

@salvatore-campagna
Copy link
Contributor Author

@elasticmachine update branch

@salvatore-campagna
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

@salvatore-campagna
Copy link
Contributor Author

@elasticmachine update branch

@salvatore-campagna
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/part-2

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things that I'd still like to fix before merging. Looking good overall!


@Override
protected LeafBucketCollector getLeafCollector(LeafReaderContext ctx, LeafBucketCollector sub) throws IOException {
return config.getValuesSource() == null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config.getValuesSource() should never return null. You want to check config.hasValues(). And you shouldn't even need to do that - ValuesSourceAggregatorFactory#createInternal() will check if there are values to aggregate, and call createUnmapped() if there aren't, which in turn can build a no-op aggregator.

// offset (12) to apply the subnet to the last 4 bytes (byes 12, 13, 14, 15)
// if the subnet mask is just a 4-bytes subnet mask.
int offset = subnetMask.length == 4 ? 12 : 0;
byte[] subnet = new byte[subnetMask.length];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think we should be allocating short lived objects here. I think you can just pass in the BytesRef and mutate it in place.

@elasticsearchmachine
Copy link
Collaborator

Hi @salvatore-campagna, I've created a changelog YAML for you.

@salvatore-campagna
Copy link
Contributor Author

@elasticmachine update branch

@salvatore-campagna salvatore-campagna self-assigned this Jan 26, 2022
@salvatore-campagna salvatore-campagna requested review from not-napoleon and removed request for nik9000 January 26, 2022 17:06
Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@salvatore-campagna salvatore-campagna merged commit 9de75c2 into elastic:master Jan 28, 2022
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Jan 31, 2022
* upstream/master: (100 commits)
  Avoid duplicate _type fields in v7 compat layer (elastic#83239)
  Bump bundled JDK to 17.0.2+8 (elastic#83243)
  [DOCS] Correct header syntax (elastic#83275)
  Add unit tests for indices.recovery.max_bytes_per_sec default values (elastic#83261)
  [DOCS] Add note that write indices are not replicated (elastic#82997)
  Add notes on indexing to kNN search guide (elastic#83188)
  Fix get-snapshot-api :docs:integTest (elastic#83273)
  FilterPathBasedFilter support match fieldname with dot (elastic#83178)
  Fix compilation issues in example-plugins (elastic#83258)
  fix ClusterStateListener javadoc (elastic#83246)
  Speed up Building Indices Lookup in Metadata (elastic#83241)
  Mute whole suite for elastic#82502 (elastic#83252)
  Make PeerFinder log messages happier (elastic#83222)
  [Docs] Add supported _terms_enum field types (elastic#83244)
  Add an aggregator for IPv4 and IPv6 subnets (elastic#82410)
  [CI] Fix 70_time_series/default sort yaml test failures (elastic#83217)
  Update test-failure Issue Template to include "needs:triage" label elastic#83226
  Add an index->step cache to the PolicyStepsRegistry (elastic#82316)
  Improve support for joda datetime to java datetime transition in Painless (elastic#83099)
  Fix joda migration for week based methods in Painless (elastic#83232)
  ...

# Conflicts:
#	x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/v2/TransportRollupAction.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants