Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not parse row data with a “SHIFT_JIS” SAS file. #55

Closed
nixiejun opened this issue Dec 3, 2019 · 7 comments
Closed

Can not parse row data with a “SHIFT_JIS” SAS file. #55

nixiejun opened this issue Dec 3, 2019 · 7 comments

Comments

@nixiejun
Copy link

nixiejun commented Dec 3, 2019

Hi contributors:
I found my SAS file was not parsed correctly under "SHIFT_JIS" encoding.

This file seems has the compression method "SASYZCRL" however under this encoding the method name can not be recognized correctly. (The currengPageType is 0 so the subheader count is required not equal to 0.)
image

However , overriding "UTF-8" encoding to it I can import row data:
SasFileReaderImpl sasFileReader = new SasFileReaderImpl(fis, "UTF-8");
I found the compression method substring is correct.

image

@nixiejun
Copy link
Author

@printsev @TimurMikhaliov
Could you please help to take a look? Thanks.

@nixiejun nixiejun changed the title Can not parse row data with a new created SAS file. Can not parse row data with a “SHIFT_JIS” SAS file. Mar 19, 2020
@printsev
Copy link
Contributor

Hi @nixiejun -- would it be possible for you to share the source file? Then we would be able to take a look. Thanks.

@printsev
Copy link
Contributor

we cannot do much here unfortunately without the data set so I'm putting the "nodataset" label to the issue

@FriedEgg
Copy link
Contributor

Here are some datasets for this. I haven't got time to look into the issue more closely, but hopefully this is helpful to others.

charset-sjis.zip

@xantorohara
Copy link
Contributor

This issue can be marked again as "nodataset".
Provided "charset-sjis.zip" contains a copy of existing file "https://github.com/epam/parso/blob/master/src/test/resources/sas7bdat/charset_sjis.sas7bdat", and Parso successfully reads it.
As I see, the problem reported by @nixiejun is in a combination of SASYZCRL compression and SHIFT_JIS encoding (looks like compression method name is broken in a first character after the parsing).

@xantorohara
Copy link
Contributor

It is not correct to check compression method using String.contains() method over the header bytes converter to a multibyte string. Octets, related to different things, combined together and then converted to a multibyte string may form unpredictable characters.

See:

new String(new byte[]{-104, 'S', 'A', 'S', 'Y', 'Z', 'C', 'R', 'L'}, "Shift_JIS").contains("SASYZCRL"); //false for "牢ASYZCRL"
new String(new byte[]{-104, 'S', 'A', 'S', 'Y', 'Z', 'C', 'R', 'L'}, "UTF-8").contains("SASYZCRL"); //true for "�SASYZCRL"
new String(new byte[]{-104, 'S', 'A', 'S', 'Y', 'Z', 'C', 'R', 'L'}, "US-ASCII").contains("SASYZCRL"); //true for "�SASYZCRL"

So, the compression method like any other fields should be extracted from an explicit position.

xantorohara added a commit to xantorohara/parso that referenced this issue Dec 3, 2020
It looks like `SasFileParser.FormatAndLabelSubheader.processSubheader` and and `SasFileConstants`
files should use another offsets to calculate column format information correctly for x64 bitness.

Before a fix:
```java
long COLUMN_FORMAT_WIDTH_OFFSET = 8L;
long COLUMN_FORMAT_PRECISION_OFFSET = 10L;

subheaderOffset + COLUMN_FORMAT_WIDTH_OFFSET + intOrLongLength,
subheaderOffset + COLUMN_FORMAT_PRECISION_OFFSET + intOrLongLength,
```

For 32-bit:
COLUMN_FORMAT_WIDTH_OFFSET + intOrLongLength = 8 + 4 = 12 (correct)
COLUMN_FORMAT_PRECISION_OFFSET + intOrLongLength = 10 + 4 = 14 (correct)

For 64-bit:
COLUMN_FORMAT_WIDTH_OFFSET + intOrLongLength = 8 + 8 = 16 (not correct)
COLUMN_FORMAT_PRECISION_OFFSET + intOrLongLength = 10 + 8 = 18 (not correct)

After the fix:
```java
long COLUMN_FORMAT_WIDTH_OFFSET = 0L;
long COLUMN_FORMAT_PRECISION_OFFSET = 2L;

subheaderOffset + COLUMN_FORMAT_PRECISION_OFFSET + 3 * intOrLongLength,
subheaderOffset + COLUMN_FORMAT_TEXT_SUBHEADER_INDEX_OFFSET + 3 * intOrLongLength,
```

For 32-bit:
COLUMN_FORMAT_WIDTH_OFFSET + 3 * intOrLongLength = 0 + 3 * 4 = 12 (correct and the same as above)
COLUMN_FORMAT_PRECISION_OFFSET + 3 * intOrLongLength = 2 + 3 * 4 = 14 (correct and the same as above)

For 64-bit:
COLUMN_FORMAT_WIDTH_OFFSET + 3 *intOrLongLength = 0 + 3 * 8 = 24 (correct)
COLUMN_FORMAT_PRECISION_OFFSET + 3 * intOrLongLength = 2 + 3 * 8 = 26 (correct)

So, this new calculation gives exactly the same offsets for 32-bit files and also fixes it for 64-bit files.
xantorohara added a commit to xantorohara/parso that referenced this issue Dec 3, 2020
It used to be possible to misidentify compression method using the String.contains method over the whole subheader text (Imagine, you have a column name the same as one of compression method name within the file without compression at all. In this case compression type will be mistakenly detected).
Also it was a problem with searching of compression name over the whole encoded subheader text containing multi-byte characters.
printsev pushed a commit that referenced this issue Dec 3, 2020
It used to be possible to misidentify compression method using the String.contains method over the whole subheader text (Imagine, you have a column name the same as one of compression method name within the file without compression at all. In this case compression type will be mistakenly detected).
Also it was a problem with searching of compression name over the whole encoded subheader text containing multi-byte characters.
@printsev
Copy link
Contributor

printsev commented Dec 3, 2020

closed thanks to xantorohara

@printsev printsev closed this as completed Dec 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants