-
Notifications
You must be signed in to change notification settings - Fork 7
More profile metrics for Iceberg, S3 and Azure #1123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: antalya-25.8
Are you sure you want to change the base?
Conversation
| ProfileEvents::increment(ProfileEvents::AzureListObjects); | ||
| if (client->IsClientForDisk()) | ||
| ProfileEvents::increment(ProfileEvents::DiskAzureListObjects); | ||
| ProfileEventTimeIncrement<Microseconds> watch(ProfileEvents::AzureListObjectsMicroseconds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps putting it in a scope will be more accurate and safer (e.g, future changes to this method that introduce slow operations could lead to wrong values)?
Example:
ListBlobsPagedResponse blob_list_response;
{
ProfileEventTimeIncrement<Microseconds> watch(ProfileEvents::AzureListObjectsMicroseconds);
blob_list_response = client->ListBlobs(options);
}
The same comment applies to the other list object operations.
In any case, I see that it is already implemented without this "protection". So it is not a must to implement this 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I want to get time for S3/Azure communication, not for parsing for response.
| return removeEscapedSlashes(oss.str()); | ||
| } | ||
|
|
||
| void insertRowToLogTable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need the "old" overload? If so, add a comment explaining the difference between both overloads..
And btw, is dumpMetadataObjectToString that expensive that you need to optimize it in a few code paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other places insertRowToLogTable called with strings from other sources like https://github.com/Altinity/ClickHouse/blob/antalya-25.8/src/Storages/ObjectStorage/DataLakes/Iceberg/AvroForIcebergDeserializer.cpp#L119.
It is also JSON string, but with some complex procedure to get.
In my test on my desktop with ~7k parquet files metadata.json has size near 5Mb.
Query select count() from iceberg.table does not read data, gets size from metadata only.
Speed changed from 3.5 seconds to 0.3 seconds.
JSON parsing is fast, but serialization back to string isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. What if you unified the interface?
void insertRowToLogTable(
...
const std::function<const std::string &()> & metadata_string_resolver)
{
...
Context::getGlobalContextInstance()->getIcebergMetadataLog()->add(
...
.metadata_content = metadata_string_resolver();
}
The above would potentially allow a single method, but not sure it is a good idea. Adding a comment is enough I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But getting json after condition check make sense in all case. Changed for all.
arthurpassos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good 👍
|
|
||
| void insertRowToLogTable( | ||
| const ContextPtr & local_context, | ||
| String row, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That needs a comment that explains WHY we need NOT a value, but a function here... do I get it right that major reason is not to skew the the time measurements by getting metadata content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make row string only when actually required. Serialization from JSON structure to string can take a lot of time, in my test select count() from iceberg.table reduced from 3.5 seconds to 0.2 only with this optimization. Several thousands parquet files, metadata.json near 5Mb.
Enmk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor changes required
Refactor insertRowToLogTable to use get_row function for lazy evaluation and improve exit logic.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
More metrics for Iceberg, S3 and Azure
Documentation entry for user-facing changes
New Iceberg profile metrics:
IcebergAvroFileParsing- counter of parsed avro metadata filesIcebergAvroFileParsingMicroseconds- time spent on parsing avro metadata filesIcebergJsonFileParsing- counter of parsed json metadata filesIcebergJsonFileParsingMicroseconds- time spent on parsing json metadata filesNew S3 profile metrics:
S3ListObjectsMicroseconds- time spent on ListObjects requestsS3HeadObjectMicroseconds- time spent on HeadObject requestsNew Azure profile metric:
AzureListObjectsMicroseconds- time spent on ListObjects requestsSmall optimization -
dumpMetadataObjectToStringcalled always, including case wheninsertRowToLogTabledumps nothing and exits aftericeberg_metadata_log_levelsetting check. Now serialization is only when required.CI/CD Options
Exclude tests:
Regression jobs to run: