Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Deduplication Behavior Between v1.2.0 and v1.3.0 #15163

Open
ShivamS136 opened this issue Mar 2, 2025 · 1 comment
Open

Inconsistent Deduplication Behavior Between v1.2.0 and v1.3.0 #15163

ShivamS136 opened this issue Mar 2, 2025 · 1 comment
Labels
bug dedup changes related to realtime ingestion dedup handling

Comments

@ShivamS136
Copy link

Issue Description

There appears to be a significant difference in deduplication behavior between Pinot v1.2.0 and v1.3.0. The behavior change affects how records are deduplicated based on the dedupTimeColumn and metadataTTL settings.

Environment

  • Affected Pinot Versions:
    • v1.3.0 (new behavior)
    • v1.2.0 (previous behavior)

Deduplication Behavior Differences

In v1.3.0:

  • Records only get deduped if at least one insertion record's dedupTimeColumn value is at most metadataTTL older than current time
  • If a record within TTL is inserted, then deduping works
  • Records outside TTL are successfully inserted even if the data is the same (potential duplicates)
  • If one record is encountered within TTL value, then the primary key is created and all future records with the same primary key value get deduped

In v1.2.0:

  • The dedupTimeColumn doesn't seem to affect deduplication
  • Any record inserted into Pinot gets the primary key generated irrespective of time column value
  • Future records with the same primary key value get deduped consistently

Expected Behavior

Deduplication should work consistently across versions and should properly deduplicate records based on the primary key, regardless of the time column values.

Table Configuration

Table Schema
{
	"schemaName": "leaderboard_entries",
	"dimensionFieldSpecs": [
		{
			"name": "leaderboard_id",
			"dataType": "LONG"
		},
		{
			"name": "participant_id",
			"dataType": "STRING"
		},
		{
			"name": "attempt_number",
			"dataType": "INT",
			"defaultNullValue": 1
		},
		{
			"name": "entry_meta",
			"dataType": "JSON",
			"defaultNullValue": "{}"
		}
	],
	"metricFieldSpecs": [
		{
			"name": "score",
			"dataType": "INT",
			"defaultNullValue": 0
		}
	],
	"dateTimeFieldSpecs": [
		{
			"name": "insertion_time",
			"dataType": "LONG",
			"format": "1:MILLISECONDS:EPOCH",
			"granularity": "1:MILLISECONDS"
		},
		{
			"name": "attempt_time",
			"dataType": "LONG",
			"format": "1:MILLISECONDS:EPOCH",
			"granularity": "1:MILLISECONDS"
		}
	],
	"primaryKeyColumns": ["leaderboard_id", "participant_id", "attempt_number"]
}
Table Config
{
	"tableName": "leaderboard_entries",
	"tableType": "REALTIME",
	"segmentsConfig": {
		"timeColumnName": "insertion_time",
		"replication": "2",
		"retentionTimeUnit": "DAYS",
		"retentionTimeValue": "90",
		"timeType": "MILLISECONDS"
	},
	"query": {
		"timeoutMs": "5000"
	},
	"tenants": {},
	"tableIndexConfig": {
		"sortedColumn": ["score"]
	},
	"fieldConfigList": [
		{
			"name": "leaderboard_id",
			"indexes": {
				"inverted": {}
			}
		},
		{
			"name": "participant_id",
			"indexes": {
				"bloom": {}
			}
		}
	],
	"ingestionConfig": {
		"streamIngestionConfig": {
			"streamConfigMaps": [
				{
					"streamType": "kafka",
					"stream.kafka.consumer.type": "lowlevel",
					"stream.kafka.topic.name": "leaderboard-entry",
					"stream.kafka.broker.list": "kafka:9092",
					"stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder",
					"stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka30.KafkaConsumerFactory",
					"stream.kafka.consumer.prop.auto.offset.reset": "smallest",
					"stream.kafka.consumer.prop.format": "JSON",
					"realtime.segment.flush.threshold.time": "4h",
					"realtime.segment.flush.threshold.rows": "0",
					"realtime.segment.flush.threshold.segment.rows": "0",
					"realtime.segment.flush.threshold.segment.size": "20M"
				}
			]
		}
	},
	"metadata": {
		"customConfigs": {}
	},
	"routing": {
		"instanceSelectorType": "strictReplicaGroup"
	},
	"dedupConfig": {
		"dedupEnabled": true,
		"hashFunction": "NONE",
		"dedupTimeColumn": "insertion_time",
		"metadataTTL": 600000,
		"enablePreload": true
	}
}

Observations

When using v1.2.0, the following warning appears during table addition, suggesting that the dedupTimeColumn and metadataTTL properties might not be recognized or used in this version:

{
  "unrecognizedProperties": {
    "/dedupConfig/dedupTimeColumn": "insertion_time",
    "/dedupConfig/metadataTTL": 600000
  },
  "status": "Table leaderboard_entries_REALTIME successfully added"
}

Impact

This behavior change can lead to:

  1. Unexpected duplicates in v1.3.0 when records are outside the TTL window
  2. Inconsistent deduplication behavior when migrating from v1.2.0 to v1.3.0
  3. Potential data integrity issues if applications rely on the previous deduplication behavior

Proposed Solution

Either:

  1. Restore the v1.2.0 behavior where deduplication works consistently regardless of time column values, or
  2. Clearly document this behavior change and provide configuration options to maintain backward compatibility

Additional Information

Related Slack thread with more info: https://apache-pinot.slack.com/archives/C011C9JHN7R/p1740757158048619

@klsince
Copy link
Contributor

klsince commented Mar 3, 2025

The old version didn't have TTL mechanism, so those two were unrecognized configs.

"unrecognizedProperties": {
    "/dedupConfig/dedupTimeColumn": "insertion_time",
    "/dedupConfig/metadataTTL": 600000

Could you try to set TTL to 0 to disable the TTL mechanism, and see if the dedup behavior become same as the old version? Thanks!

@Jackie-Jiang Jackie-Jiang added bug dedup changes related to realtime ingestion dedup handling labels Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug dedup changes related to realtime ingestion dedup handling
Projects
None yet
Development

No branches or pull requests

3 participants