-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOP-20958] Consume events in batches #92
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #92 +/- ##
===========================================
+ Coverage 91.69% 91.86% +0.16%
===========================================
Files 151 152 +1
Lines 3071 3256 +185
Branches 217 235 +18
===========================================
+ Hits 2816 2991 +175
- Misses 198 202 +4
- Partials 57 63 +6 ☔ View full report in Codecov by Sentry. |
dolfinus
force-pushed
the
feature/DOP-20958
branch
from
October 25, 2024 16:09
b0b438f
to
7acc27b
Compare
dolfinus
changed the title
[DOP-20958] Convert consumer to batching
[DOP-20958] Consume events in batches
Oct 25, 2024
dolfinus
force-pushed
the
feature/DOP-20958
branch
from
October 28, 2024 13:12
7acc27b
to
81a2db4
Compare
dolfinus
force-pushed
the
feature/DOP-20958
branch
from
October 28, 2024 13:17
81a2db4
to
6197431
Compare
dolfinus
force-pushed
the
feature/DOP-20958
branch
from
October 28, 2024 13:20
6197431
to
22e47f6
Compare
dolfinus
force-pushed
the
feature/DOP-20958
branch
from
October 28, 2024 13:28
22e47f6
to
12564d4
Compare
dolfinus
force-pushed
the
feature/DOP-20958
branch
from
October 28, 2024 13:50
12564d4
to
887a7a5
Compare
dolfinus
force-pushed
the
feature/DOP-20958
branch
from
October 29, 2024 07:59
887a7a5
to
e6bf65e
Compare
dolfinus
force-pushed
the
feature/DOP-20958
branch
from
October 29, 2024 08:05
e6bf65e
to
8f90ea5
Compare
dolfinus
force-pushed
the
feature/DOP-20958
branch
from
October 29, 2024 10:35
8f90ea5
to
6de41ef
Compare
TiGrib
approved these changes
Nov 1, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Summary
Make consumer fetching a list of OpenLineage events instead of fetching them one-by-one:
unique_key
and methodmerge
. This allows tracking multiple versions of the same DTO, and then merge into one DTO having all the non-null fields. For example, all RunDTOs with same id are merged into one RunDTO with non-null created_at and ended_at, and so on.BatchExtractResult
class which tracks each type of DTO, merges old items with new ones, and then allows to iterate other merged items.extract_batch
. It iterates other list of OpenLineage events (e.g. sequence of STARTED+RUNNING+SUCCESS) into a final one (SUCCESS with started_at + ended_at + ...).BatchExtractResult
, and creates all the extracted items in database. All cross-DTO links are resolved byBatchExtractResult
. For example, whendataset_dto.id = ...
is populated from database model, all InputDTOs and OutputDTOs with the same dataset got the same dataset id, which simplifies consumer logic.This reduces the number of IO operations a lot - instead of sequence of INSERT+UPDATE+UPDATE+UPDATE+... statements, we perform only one INSERT/UPDATE per unique object.
Benchmarks were run in the same environment (4 FastStream workers, 4 Kafka partitions) and the amount of data (239k events, 632MB with ZSTD compression).
Before this change:
After this change:
So x10 in RPS, -86% of Postgres requests, -47% of IO.
Related issue number
Checklist
docs/changelog/next_release/<pull request or issue id>.<change type>.rst
file added describing change(see CONTRIBUTING.rst for details.)