Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why Latin1 for report tables? #28

Open
gggeek opened this issue Dec 4, 2022 · 12 comments · May be fixed by #30
Open

Why Latin1 for report tables? #28

gggeek opened this issue Dec 4, 2022 · 12 comments · May be fixed by #30

Comments

@gggeek
Copy link

gggeek commented Dec 4, 2022

Is it ok to use eg. utf8 charset when defining the reports tables? It would seem to make sense, given the script/hostname data present in them

@anton-povarov
Copy link
Contributor

Shouldn't be an issue to use whatever charset you prefer. Have you tried?
The engine itself does not interpret strings, just stores them as a bunch of bytes.

@gggeek
Copy link
Author

gggeek commented Dec 7, 2022

Thx for the info. I will give it a try and come back with the results

@anton-povarov
Copy link
Contributor

anton-povarov commented Dec 7, 2022

One potential issue i can think of is protobuf-c implementation on the client side.
In case you use it from php - it might store them as c-strings (i.e. terminating null byte) - shouldn't be a problem for utf8, but maybe other charsets.
Engine itself, parses protobuf as data+length, so null bytes should not be a problem there.

In any case - curious to see if it works for you, if it doesn't - we'll investigate.

@gggeek
Copy link
Author

gggeek commented Dec 10, 2022

I did test this locally with "simple" utf8 script such as 'κόσμε', and it works.

The Pinba2 server in use was the official container from this project. The client-side were the php pinba extension (ver. 1.1.1) from Ubuntu Focal, and one pure-php protobuf implementation.

I created the reports table as

CREATE TABLE `report_by_script_name` (
    `script` varchar(64) NOT NULL,
    ...
) ENGINE=PINBA DEFAULT CHARSET=utf8 COMMENT='v2/request/60/~script/no_percentiles/no_filters';

And I had to add in the code querying its data: $db->set_charset('utf8');

@gggeek
Copy link
Author

gggeek commented Dec 10, 2022

ps: about NULL chars: indeed, if I add chr(0) in the middle of the script_name, the corresponding line does not show up any more in the db after calling flush on the client side. I did not investigate the raw protobuf packet to find out what was being sent (or not) in that case...

@anton-povarov
Copy link
Contributor

Ok, great to know.
I'll keep this one open for a bit longer, please let me know if you get to check out the null chars - is it a client php extension issue or an engine issue.

@gggeek
Copy link
Author

gggeek commented Dec 12, 2022

WDYT about changing the sql files which are part of this repo? At the very least we could add a comment in them mentioning Latin1 is not a requirement...

@anton-povarov
Copy link
Contributor

Sure, happy to accept a PR with utf8 everywhere in sql files.

@gggeek
Copy link
Author

gggeek commented Dec 12, 2022

Sure, will do.

Btw, I checked the "raw message" being sent when there is a NUL byte in the middle of the php string used to indicate the script_name, using Wireshark as protobuf dissector:

  • both the php extension and my pure-php polyfill do send a message in that scenario
  • the packet sent by the php extension is dissected as having a string with a value made of all chars up to the null one, excluding it
  • the packet sent by the polyfill apparently sends "garbled" data

So it seems that the code dropping those packets might be on the server-side (I will test that again with Pinba2, last tests were ran against Pinba1)

@gggeek
Copy link
Author

gggeek commented Dec 13, 2022

if I add chr(0) in the middle of the script_name, the corresponding line does not show up any more in the db after calling flush on the client side

Further testing disproved the above.
I ran fresh tests with a NUL char in the middle of hostname, server_name and script_name, after having patched my own lib to work the same way as the php extension (ie. truncate strings at the first NUL char found), and both pinba1 and pinba2 engines are happy with what they get.
I wonder why I got no-data results in the first rounds of testing... probably because sending a single NUL char results in an empty string, which might not be considered a valid script_name/server_name/hostname.
In any case: sending of NUL chars does not seem to have much to do with the chosen character set encoding for retrieving the data once it's in the DB (at least as long we are talking about UTF8 vs Latin1 - they both use the same encoding for NUL).
I will send a PR as discussed above.

@gggeek gggeek linked a pull request Dec 13, 2022 that will close this issue
@anton-povarov
Copy link
Contributor

I kind of assumed that pinba and pinba2 engines should treat strings with null-bytes in the middle differently, attributing to slightly different ways they unpack incoming protobuf packets.
Pinba1 - truncating, pinba2 - preserving.
For uts8 this point is moot since there is you're "not supposed to" havre null-bytes inside strings.

Thank you for the PR's, have been very busy lately, will review asap.

@gggeek
Copy link
Author

gggeek commented Dec 14, 2022

I kind of assumed that pinba and pinba2 engines should treat strings with null-bytes in the middle differently, attributing to slightly different ways they unpack incoming protobuf packets.

This is possible.

What I did verify is that, for the pinba php extension, the truncation happens on the client side - possibly within code generated by the protoc compiler - so in those tests the NUL char was not sent to the engine at all.

As for being able to send a NUL char as part of the protobuffer packet: I did manage to do that by using my own php implementation of the client, but the results were mixed. I tried to decode the raw protobuf packet to check its contents: using protoc --decode_raw, the string was decoded correctly, and NUL displayed as \u0000; using wireshark the string was highlighted as being malformed. This prompted me to alter the behaviour of the php code to behave the same way as the php extension, as there seems to be little point in being able to send NUL chars if in any case all it would be helpful for is to expose weird corner cases in the receiver side. For all I know, it could even be used for some hacking attempt :-) I can not report reliably of behaviour differences between pinba1 and pinba2.

For uts8 this point is moot since there is you're "not supposed to" have null-bytes inside strings.

Not to be pedantic, but the NUL character is valid in utf8, just as it is valid in ascii. It is just C strings that have issues with that ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants