Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Managed Iceberg] unbounded source #33504

Merged
merged 49 commits into from
Mar 20, 2025

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Jan 6, 2025

Unbounded (streaming) source for Managed Iceberg.

See design doc for high level overview: https://s.apache.org/beam-iceberg-incremental-source

Fixes #33092

@ahmedabu98 ahmedabu98 marked this pull request as draft January 6, 2025 18:16
…erg_streaming_source
@ahmedabu98 ahmedabu98 marked this pull request as ready for review January 30, 2025 21:09
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@ahmedabu98
Copy link
Contributor Author

R: @kennknowles
R: @regadas

Can y'all take a look? I still have to write some tests, but it's at a good spot for a first round of reviews. I ran a bunch of pipelines (w/Legacy DataflowRunner) at different scales and the throughput/scalability looks good.

Copy link
Contributor

github-actions bot commented Feb 3, 2025

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think all the pieces are in the right place. Just a question about why an SDF is the way it is and a couple code-level comments.

This seems like something you want to test a lot of different ways before it gets into a release. Maybe get another set of eyes like @chamikaramj or @Abacn too. But I'm approving and leaving to your judgment.

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait actually I forgot I want to have the discussion about the high level toggle between incremental scan source and bounded source.

@github-actions github-actions bot added the build label Feb 13, 2025
@ahmedabu98
Copy link
Contributor Author

@chamikaramj this is ready for another review

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM.

…remove window step; add --strea

ming=true validation; add IO links to Managed java doc
Copy link

codecov bot commented Mar 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.32%. Comparing base (c1d0fa4) to head (fce87dc).
Report is 17 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #33504      +/-   ##
============================================
+ Coverage     56.28%   56.32%   +0.03%     
  Complexity     3286     3286              
============================================
  Files          1166     1172       +6     
  Lines        178704   178936     +232     
  Branches       3398     3398              
============================================
+ Hits         100591   100786     +195     
- Misses        74860    74897      +37     
  Partials       3253     3253              
Flag Coverage Δ
python 81.29% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…Getsize; add resilience to watch snapshots transform
@chamikaramj
Copy link
Contributor

LGTM. Thanks!

/** Helper class for source operations. */
public class ReadUtils {
// default is 8MB. keep this low to avoid overwhelming memory
static final int MAX_FILE_BUFFER_SIZE = 1 << 18; // 256KB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want this to be configurble or pipeline option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we wait until there's a need to expose it

Copy link
Contributor

@scwhittle scwhittle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ahmedabu98 ahmedabu98 force-pushed the iceberg_streaming_source branch from 1f47712 to 1c0f7d7 Compare March 20, 2025 12:18
@ahmedabu98
Copy link
Contributor Author

Thanks for the review @scwhittle! If all else is good, I'll merge it when tests pass

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@github-actions github-actions bot removed the build label Mar 20, 2025
@github-actions github-actions bot added the build label Mar 20, 2025
@ahmedabu98 ahmedabu98 merged commit 332fe03 into apache:master Mar 20, 2025
110 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: {Managed IO Iceberg} - Allow users to run streaming reads
4 participants