-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can not parse row data with a “SHIFT_JIS” SAS file. #55
Comments
@printsev @TimurMikhaliov |
Hi @nixiejun -- would it be possible for you to share the source file? Then we would be able to take a look. Thanks. |
we cannot do much here unfortunately without the data set so I'm putting the "nodataset" label to the issue |
Here are some datasets for this. I haven't got time to look into the issue more closely, but hopefully this is helpful to others. |
This issue can be marked again as "nodataset". |
It is not correct to check compression method using String.contains() method over the header bytes converter to a multibyte string. Octets, related to different things, combined together and then converted to a multibyte string may form unpredictable characters. See: new String(new byte[]{-104, 'S', 'A', 'S', 'Y', 'Z', 'C', 'R', 'L'}, "Shift_JIS").contains("SASYZCRL"); //false for "牢ASYZCRL"
new String(new byte[]{-104, 'S', 'A', 'S', 'Y', 'Z', 'C', 'R', 'L'}, "UTF-8").contains("SASYZCRL"); //true for "�SASYZCRL"
new String(new byte[]{-104, 'S', 'A', 'S', 'Y', 'Z', 'C', 'R', 'L'}, "US-ASCII").contains("SASYZCRL"); //true for "�SASYZCRL" So, the compression method like any other fields should be extracted from an explicit position. |
It looks like `SasFileParser.FormatAndLabelSubheader.processSubheader` and and `SasFileConstants` files should use another offsets to calculate column format information correctly for x64 bitness. Before a fix: ```java long COLUMN_FORMAT_WIDTH_OFFSET = 8L; long COLUMN_FORMAT_PRECISION_OFFSET = 10L; subheaderOffset + COLUMN_FORMAT_WIDTH_OFFSET + intOrLongLength, subheaderOffset + COLUMN_FORMAT_PRECISION_OFFSET + intOrLongLength, ``` For 32-bit: COLUMN_FORMAT_WIDTH_OFFSET + intOrLongLength = 8 + 4 = 12 (correct) COLUMN_FORMAT_PRECISION_OFFSET + intOrLongLength = 10 + 4 = 14 (correct) For 64-bit: COLUMN_FORMAT_WIDTH_OFFSET + intOrLongLength = 8 + 8 = 16 (not correct) COLUMN_FORMAT_PRECISION_OFFSET + intOrLongLength = 10 + 8 = 18 (not correct) After the fix: ```java long COLUMN_FORMAT_WIDTH_OFFSET = 0L; long COLUMN_FORMAT_PRECISION_OFFSET = 2L; subheaderOffset + COLUMN_FORMAT_PRECISION_OFFSET + 3 * intOrLongLength, subheaderOffset + COLUMN_FORMAT_TEXT_SUBHEADER_INDEX_OFFSET + 3 * intOrLongLength, ``` For 32-bit: COLUMN_FORMAT_WIDTH_OFFSET + 3 * intOrLongLength = 0 + 3 * 4 = 12 (correct and the same as above) COLUMN_FORMAT_PRECISION_OFFSET + 3 * intOrLongLength = 2 + 3 * 4 = 14 (correct and the same as above) For 64-bit: COLUMN_FORMAT_WIDTH_OFFSET + 3 *intOrLongLength = 0 + 3 * 8 = 24 (correct) COLUMN_FORMAT_PRECISION_OFFSET + 3 * intOrLongLength = 2 + 3 * 8 = 26 (correct) So, this new calculation gives exactly the same offsets for 32-bit files and also fixes it for 64-bit files.
It used to be possible to misidentify compression method using the String.contains method over the whole subheader text (Imagine, you have a column name the same as one of compression method name within the file without compression at all. In this case compression type will be mistakenly detected). Also it was a problem with searching of compression name over the whole encoded subheader text containing multi-byte characters.
It used to be possible to misidentify compression method using the String.contains method over the whole subheader text (Imagine, you have a column name the same as one of compression method name within the file without compression at all. In this case compression type will be mistakenly detected). Also it was a problem with searching of compression name over the whole encoded subheader text containing multi-byte characters.
closed thanks to xantorohara |
Hi contributors:
I found my SAS file was not parsed correctly under "SHIFT_JIS" encoding.
This file seems has the compression method "SASYZCRL" however under this encoding the method name can not be recognized correctly. (The currengPageType is 0 so the subheader count is required not equal to 0.)

However , overriding "UTF-8" encoding to it I can import row data:
SasFileReaderImpl sasFileReader = new SasFileReaderImpl(fis, "UTF-8");
I found the compression method substring is correct.
The text was updated successfully, but these errors were encountered: