Can not parse row data with a “SHIFT_JIS” SAS file. #55

nixiejun · 2019-12-03T09:10:04Z

Hi contributors:
I found my SAS file was not parsed correctly under "SHIFT_JIS" encoding.

This file seems has the compression method "SASYZCRL" however under this encoding the method name can not be recognized correctly. (The currengPageType is 0 so the subheader count is required not equal to 0.)

However , overriding "UTF-8" encoding to it I can import row data:
SasFileReaderImpl sasFileReader = new SasFileReaderImpl(fis, "UTF-8");
I found the compression method substring is correct.

nixiejun · 2020-03-19T08:16:04Z

@printsev @TimurMikhaliov
Could you please help to take a look? Thanks.

printsev · 2020-08-19T08:52:55Z

Hi @nixiejun -- would it be possible for you to share the source file? Then we would be able to take a look. Thanks.

printsev · 2020-11-28T05:16:29Z

we cannot do much here unfortunately without the data set so I'm putting the "nodataset" label to the issue

FriedEgg · 2020-11-30T20:06:04Z

Here are some datasets for this. I haven't got time to look into the issue more closely, but hopefully this is helpful to others.

charset-sjis.zip

xantorohara · 2020-12-01T21:08:19Z

This issue can be marked again as "nodataset".
Provided "charset-sjis.zip" contains a copy of existing file "https://github.com/epam/parso/blob/master/src/test/resources/sas7bdat/charset_sjis.sas7bdat", and Parso successfully reads it.
As I see, the problem reported by @nixiejun is in a combination of SASYZCRL compression and SHIFT_JIS encoding (looks like compression method name is broken in a first character after the parsing).

xantorohara · 2020-12-02T13:19:11Z

It is not correct to check compression method using String.contains() method over the header bytes converter to a multibyte string. Octets, related to different things, combined together and then converted to a multibyte string may form unpredictable characters.

See:

new String(new byte[]{-104, 'S', 'A', 'S', 'Y', 'Z', 'C', 'R', 'L'}, "Shift_JIS").contains("SASYZCRL"); //false for "牢ASYZCRL"
new String(new byte[]{-104, 'S', 'A', 'S', 'Y', 'Z', 'C', 'R', 'L'}, "UTF-8").contains("SASYZCRL"); //true for "�SASYZCRL"
new String(new byte[]{-104, 'S', 'A', 'S', 'Y', 'Z', 'C', 'R', 'L'}, "US-ASCII").contains("SASYZCRL"); //true for "�SASYZCRL"

So, the compression method like any other fields should be extracted from an explicit position.

It looks like `SasFileParser.FormatAndLabelSubheader.processSubheader` and and `SasFileConstants` files should use another offsets to calculate column format information correctly for x64 bitness. Before a fix: ```java long COLUMN_FORMAT_WIDTH_OFFSET = 8L; long COLUMN_FORMAT_PRECISION_OFFSET = 10L; subheaderOffset + COLUMN_FORMAT_WIDTH_OFFSET + intOrLongLength, subheaderOffset + COLUMN_FORMAT_PRECISION_OFFSET + intOrLongLength, ``` For 32-bit: COLUMN_FORMAT_WIDTH_OFFSET + intOrLongLength = 8 + 4 = 12 (correct) COLUMN_FORMAT_PRECISION_OFFSET + intOrLongLength = 10 + 4 = 14 (correct) For 64-bit: COLUMN_FORMAT_WIDTH_OFFSET + intOrLongLength = 8 + 8 = 16 (not correct) COLUMN_FORMAT_PRECISION_OFFSET + intOrLongLength = 10 + 8 = 18 (not correct) After the fix: ```java long COLUMN_FORMAT_WIDTH_OFFSET = 0L; long COLUMN_FORMAT_PRECISION_OFFSET = 2L; subheaderOffset + COLUMN_FORMAT_PRECISION_OFFSET + 3 * intOrLongLength, subheaderOffset + COLUMN_FORMAT_TEXT_SUBHEADER_INDEX_OFFSET + 3 * intOrLongLength, ``` For 32-bit: COLUMN_FORMAT_WIDTH_OFFSET + 3 * intOrLongLength = 0 + 3 * 4 = 12 (correct and the same as above) COLUMN_FORMAT_PRECISION_OFFSET + 3 * intOrLongLength = 2 + 3 * 4 = 14 (correct and the same as above) For 64-bit: COLUMN_FORMAT_WIDTH_OFFSET + 3 *intOrLongLength = 0 + 3 * 8 = 24 (correct) COLUMN_FORMAT_PRECISION_OFFSET + 3 * intOrLongLength = 2 + 3 * 8 = 26 (correct) So, this new calculation gives exactly the same offsets for 32-bit files and also fixes it for 64-bit files.

It used to be possible to misidentify compression method using the String.contains method over the whole subheader text (Imagine, you have a column name the same as one of compression method name within the file without compression at all. In this case compression type will be mistakenly detected). Also it was a problem with searching of compression name over the whole encoded subheader text containing multi-byte characters.

printsev · 2020-12-03T12:43:56Z

closed thanks to xantorohara

nixiejun changed the title ~~Can not parse row data with a new created SAS file.~~ Can not parse row data with a “SHIFT_JIS” SAS file. Mar 19, 2020

printsev added the nodataset label Nov 28, 2020

printsev added datasetexists and removed nodataset labels Dec 1, 2020

printsev closed this as completed Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not parse row data with a “SHIFT_JIS” SAS file. #55

Can not parse row data with a “SHIFT_JIS” SAS file. #55

nixiejun commented Dec 3, 2019 •

edited

Loading

nixiejun commented Mar 19, 2020

printsev commented Aug 19, 2020

printsev commented Nov 28, 2020

FriedEgg commented Nov 30, 2020

xantorohara commented Dec 1, 2020

xantorohara commented Dec 2, 2020

printsev commented Dec 3, 2020

Can not parse row data with a “SHIFT_JIS” SAS file. #55

Can not parse row data with a “SHIFT_JIS” SAS file. #55

Comments

nixiejun commented Dec 3, 2019 • edited Loading

nixiejun commented Mar 19, 2020

printsev commented Aug 19, 2020

printsev commented Nov 28, 2020

FriedEgg commented Nov 30, 2020

xantorohara commented Dec 1, 2020

xantorohara commented Dec 2, 2020

printsev commented Dec 3, 2020

nixiejun commented Dec 3, 2019 •

edited

Loading