-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting linebreaks inside double quoted csv fields #4
Comments
This comment was marked as outdated.
This comment was marked as outdated.
Sorry for the late response. Yes, working with CSVs containing commas and newlines inside of a cell value is a pain... It does mean that CSV is not context free and the file line count doesn't equal the row count. It's also the kind of place where Rainbow CSV would be the most helpful. If my CSVs had no newlines or commas in records I wouldn't have needed Rainbow CSV. It's these more complex files where I really wish it supported quoted values. Consider a setting? It's a valid stance to say CSV has no definitive standard and you chose to not handle quoted values at all, but I believe that's less useful in practice. I would assume RFC 4180-ish parsing would be the more common expectation, but maybe I'm wrong? Yes RFC 4180's choice of line endings is bizarre and a sad concession for Excel compatibility. I believe any sane parser would handle any style of line endings. Ours does. Most open-source parsers that I have looked at do. I've tried a number of popular CSV editors including Excel, Google Sheets, Libre Office, CSVPad, and various less popular editors. All of them allow values containing commas and newlines inside of a cell and quote cells using these special characters. As is Rainbow CSV's value is a bit limited for anyone working with CSVs containing arbitrary user data emitted by any popular editor. My use case is translation tables which often have text containing commas and newlines. We can't forbid commas, so we follow RFC 4180 with exception of the line endings. I'm editing parser test suite data by hand and generally avoiding opening a more heavyweight editor for small table tweaks. Editing a non-trivial RFC 4180-ish CSV by hand with no visual assist can be challenging. Rainbow CSV would be all I'd need. Comments on your numbered points:
For all the rest of your points... Those are all real problems. Ignoring them is a an option but I'm not sure if it's the most appropriate one? They might just be intrinsic to the format... |
OK, You are right, it would be reasonable to support linebreaks in fields in a special mode.
This smaller regex, in turn, consists of 3 parts separated by OR
So it is easy to modify part 1 and 3, so that quoted field would match across linebreaks and "error" would match till the end of the file. Note that the regular expressions are matched against only a single line of the document at a time. That means it is not possible to use a pattern that matches multiple lines. The reason for this is technical: being able to restart the parser at an arbitrary line and having to re-parse only the minimal number of lines affected by an edit. In most situations it is possible to use the begin/end model to overcome this limitation. So it is currently not possible to use the modified regexp to match across multiple lines (unless VSCode would introduce a special "xlmatch" grammar keyword that would allow usage of I will probably try to modify rainbow grammar in Vim to support linebreaks in fields just to see whether it would work or not. And again there is a fundamental difference between csv which allow linebreaks and csv that doesn't. |
I made a proof of concept in Vim When highlighting gets glitchy this command fixes it:
|
Jut want to comment I would love to see this implemented in the vs code verison |
How is going on? I hope this will help. |
@ffxivvillein VSCode language grammars do not match across newlines, so even if we include newline in regexp it won't work. |
Based on the duplicate issues being opened, there really is a practical need to have this feature. |
@Rots, I understand that there is a practical need for this. Actually I've recently implemented an option for RBQL console to correctly handle multiline values. This can be useful for cases when it is not required to keep the original file i.e. it is acceptable to replace newlines with e.g. 4 spaces. In this case you can write a query like this (with Python backend): |
I implemented multiline fields support in Sublime Text version of "Rainbow CSV": https://packagecontrol.io/packages/rainbow_csv |
What about multi-line strings (marked by double-quotes) being correctly interpreted in other formats (e.g. programming languages) in VSCode? Is it done by something else rather than the syntax engine you mentioned? |
@JanisE Good question! Yes, it may seem that many other grammars in VSCode don't have any problems with multiline strings. But this is because they are using |
I came looking for exactly this. I was getting confused/frustrated that the CSVLint and highlighting wasn't working for our "valid" quoted multi-line values, especially given that I saw the option for it in the RBQL Console which worked perfectly. I realize this is due to the technical limitations outlined above but it's still really really unfortunate, our use case warrants being able to have multi-line values. |
@jjspace Thanks, I think it is possible to adjust CSVLint so that it can accept quoted multi-line strings, and I agree that this could be a useful improvement in some cases. |
I am exploring a potential path forward to support multiline highlighting using a combination of these two VSCode API features:
This approach looks promising, but there are some issues associated with it (including the dynamic nature of the highlighting and out-of-theme colors) so I am not sure if the end result would be decent enough to publish. |
I stumbled on this issue the other day, and decided to try my hand at it. I am not certain it can be extended to all valid CSV separators, but I found a way to achieve multiline quote support using nested "begin" and "end" matchers. I thought I'd share it in case it turns out to be useful. Basically, each pattern matches the entire rest of the row, but defines a nested pattern for the next color. Once you get to the last color, you use {
"name": "csv syntax",
"scopeName": "text.csv",
"fileTypes": ["csv"],
"patterns": [
{
"name": "variable.other.rainbow1",
"begin": "^|\\,",
"end": "(?=\\n)",
"patterns": [
{ "include": "#quotedvalue" },
{
"name": "keyword.rainbow2",
"begin": "\\,",
"end": "(?=\\n)",
"patterns": [
{ "include": "#quotedvalue" },
{
"name": "entity.name.function.rainbow3",
"begin": "\\,",
"end": "(?=\\n)",
"patterns": [
{ "include": "#quotedvalue" },
{ "include": "$self" }
]
}
]
}
]
}
],
"repository": {
"quotedvalue": {
"begin": "\"",
"end": "\""
}
},
"uuid": "ca03e352-04ef-4340-9a6b-9b99aae1c418"
}
|
Thank you very much, @ajhyndman! |
I just published version 3.0.0 which supports multiline fields with some limitations and raw edges, the new RFC4180-compatible dialect is implemented a separate "syntax" which is called "dynamic CSV" - it means that highlighting is done through VSCode syntax tokens mechanism instead of a pre-build grammar. The correct multiline "dynamic CSV" dialect can trigger in 2 cases:
If everything goes well I will soon publish version 4.0.0 with the syntax proposed by @ajhyndman which will combine two CSV dialects into one. |
I've not been able to get this to work with the demo csv from the top of this thread
Running the command
Running the command |
@petervandivier I just tested and it worked for me. Although I discovered that when the filetype is already "Dynamic CSV" the highlighting doesn't change immediately and I have to click on another tab and then back for the new highlighting to take effect, this has to be fixed ( I would also have to fix the amusing side effect, sorry. The separator should not be an empty string, of course). And I noticed on your screenshot that the filetype is CSV, but it should be "Dynamic CSV", are you sure that the comma character is selected when you are running the command? This is how it looks for me: |
I thought maybe I was having a bad interaction with another extension but I've tried this in a VM with a fresh VS Code install and Rainbow CSV as the only extension and the behavior still seems off Notably(?), Dynamic CSV doesn't lint at all for me in either my normal install or the VM 😕 Sorry I can't figure it out. Hope the feedback is helpful. |
@petervandivier Could you please try it at http://vscode.dev please and perhaps check if |
@petervandivier Actually I was just able to reproduce this, one of the problems is that the file is too short (if you make it 10 lines at least is will work differently) for it to be recognized as multiline csv (I might fix this later) so it gets assigned the standard csv dialect instead of rfc csv (dynamic csv) because it has .csv extension. Now when you switch it to "Dynamic CSV" through the filetype selection menu (instead of using |
Confirm the large file works locally for me (although it was briefly auto-detected as Dynamic CSV before my markdown extension highjacked the language mode to markdown and I had to manually set it back). The large file also works on vscode.dev The small file also works locally now on 3.1.0 - worth noting though that the comma character must be selected when invoking Thanks! 🥳 |
Update: I was also experimenting with the grammar created by @ajhyndman which I want to use as the default syntax for csv files. So far I discovered one issue - if a csv file (even a regular one, without multiline fields) contains a comment line with unbalanced double quotes (this probably doesn't happen very often, but still) it would mess up highlighting for the whole file, e.g.
Setting There are some workarounds for this, e.g. using |
There were no major issues reported related to the new "dynamic csv" decoration-based highlighting since its launch, so we can assume that it works well enough. I am finally closing this issue. Further discussion of this topic can be held in the new more focused github issues. Thanks, everyone! |
According to RFC 4180 the quoted entry spanning a couple lines should be a single cell (i.e. everything after "Hello" in this example should be in blue)
The text was updated successfully, but these errors were encountered: