Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting linebreaks inside double quoted csv fields #4

Closed
jovannic opened this issue Feb 26, 2018 · 26 comments
Closed

Supporting linebreaks inside double quoted csv fields #4

jovannic opened this issue Feb 26, 2018 · 26 comments
Labels

Comments

@jovannic
Copy link

Key,Text
X,"Hello, mechatroner,
This, incorrectly interpreted as a new row with two columns"

image

According to RFC 4180 the quoted entry spanning a couple lines should be a single cell (i.e. everything after "Hello" in this example should be in blue)

@mechatroner

This comment was marked as outdated.

@jovannic
Copy link
Author

jovannic commented Mar 4, 2018

Sorry for the late response.

Yes, working with CSVs containing commas and newlines inside of a cell value is a pain...

It does mean that CSV is not context free and the file line count doesn't equal the row count. It's also the kind of place where Rainbow CSV would be the most helpful. If my CSVs had no newlines or commas in records I wouldn't have needed Rainbow CSV. It's these more complex files where I really wish it supported quoted values. Consider a setting?

It's a valid stance to say CSV has no definitive standard and you chose to not handle quoted values at all, but I believe that's less useful in practice. I would assume RFC 4180-ish parsing would be the more common expectation, but maybe I'm wrong?

Yes RFC 4180's choice of line endings is bizarre and a sad concession for Excel compatibility. I believe any sane parser would handle any style of line endings. Ours does. Most open-source parsers that I have looked at do.

I've tried a number of popular CSV editors including Excel, Google Sheets, Libre Office, CSVPad, and various less popular editors. All of them allow values containing commas and newlines inside of a cell and quote cells using these special characters.

As is Rainbow CSV's value is a bit limited for anyone working with CSVs containing arbitrary user data emitted by any popular editor. My use case is translation tables which often have text containing commas and newlines. We can't forbid commas, so we follow RFC 4180 with exception of the line endings. I'm editing parser test suite data by hand and generally avoiding opening a more heavyweight editor for small table tweaks. Editing a non-trivial RFC 4180-ish CSV by hand with no visual assist can be challenging. Rainbow CSV would be all I'd need.

Comments on your numbered points:

  1. Many "programmers" can't even handle fizzbuzz. I'm not sure if most programmers making a simplistic assumption is an argument for not supporting it? As tool/library writers our job is to think about those things for the people who won't. For any use case with user data you can't just say "no commas allowed." and in that case line.split(',') falls apart.

For all the rest of your points... Those are all real problems. Ignoring them is a an option but I'm not sure if it's the most appropriate one? They might just be intrinsic to the format...

@mechatroner
Copy link
Owner

OK, You are right, it would be reasonable to support linebreaks in fields in a special mode.
So I spent some time trying to implement a grammar that would do it.
I've started with the grammar that I use in Atom, it is similar to VSCode grammar (since both editors use TextMate-based grammars), but has some differences. So the main regex in Atom consists of 10 identical parts:

(?:((?:"(?:[^"]*"")*[^"]*"(?:,|$))|(?:[^",]*(?:,|$)))|(.+))?

This smaller regex, in turn, consists of 3 parts separated by OR | character:

  1. (?:"(?:[^"]*"")*[^"]*"(?:,|$)) - here we try to match double quoted csv field e.g. the second field in this line: 1234,"hello,""world""",1234
  2. (?:[^",]*(?:,|$)) - tries to match unquoted csv field
  3. (.+) - If neither of 1 and 2 has matched - that means that csv line is incorrectly formatted, so we produce "error" match until the end of the line.

So it is easy to modify part 1 and 3, so that quoted field would match across linebreaks and "error" would match till the end of the file.
part 1 would look like this (?:"(?:(?:[^"]|\n)*"")*(?:[^"]|\n)*"(?:,|$)) and part 3 like this: ((?:.|\n)+).
But this will not work, because Oniguruma/TextMate regexps will not match across linebreaks quote:

Note that the regular expressions are matched against only a single line of the document at a time. That means it is not possible to use a pattern that matches multiple lines. The reason for this is technical: being able to restart the parser at an arbitrary line and having to re-parse only the minimal number of lines affected by an edit. In most situations it is possible to use the begin/end model to overcome this limitation.

So it is currently not possible to use the modified regexp to match across multiple lines (unless VSCode would introduce a special "xlmatch" grammar keyword that would allow usage of \n in it). Someone could try to use the mentioned "begin/end" TextMate model, but I would be very suprised if it would work.

I will probably try to modify rainbow grammar in Vim to support linebreaks in fields just to see whether it would work or not.

And again there is a fundamental difference between csv which allow linebreaks and csv that doesn't.
Many algorithms (like syntax highlighting, column number query, random sampling) will work in O(N) on the former and in O(1) on the later.

@mechatroner mechatroner changed the title Commas or newlines in quoted entries are handled incorrectly Supporting linebreaks inside double quoted csv fields Mar 8, 2018
@mechatroner
Copy link
Owner

mechatroner commented Mar 10, 2018

I made a proof of concept in Vim

image

When highlighting gets glitchy this command fixes it:

:syntax sync fromstart

@krazyito65
Copy link

Jut want to comment I would love to see this implemented in the vs code verison

@ffxivvillein
Copy link

How is going on? I hope this will help.
jjuback/gc-excelviewer@30a4398

@mechatroner
Copy link
Owner

@ffxivvillein VSCode language grammars do not match across newlines, so even if we include newline in regexp it won't work.

@Rots
Copy link

Rots commented Feb 3, 2020

Based on the duplicate issues being opened, there really is a practical need to have this feature.
From my personal perspective, I'd prefer to have the special mode on by default (in our environment, we write and read CSV files programmatically with csv libraries, and in all of them multi-line values are treated as valid cells), but I'd be happy to manually switch to "the multi-line mode" explicitly, if necessary.

@mechatroner
Copy link
Owner

@Rots, I understand that there is a practical need for this. Actually I've recently implemented an option for RBQL console to correctly handle multiline values. This can be useful for cases when it is not required to keep the original file i.e. it is acceptable to replace newlines with e.g. 4 spaces. In this case you can write a query like this (with Python backend): select a1.replace('\n', ' '), a2.replace('\n', ' ') And get a new file with newlines replaced by the special sequence. Also "rfc_csv" syntax is now fully supported in Vim version of "Rainbow CSV".
Existing textmate syntax engine for VSCode does not support newlines in regular expression.
Currently VScode team is considering to adopt a new tokenization engine: microsoft/vscode#77140 I will leave a comment on that ticket to list the set of features that we might need to implement multi-line values highlighting.

@mechatroner
Copy link
Owner

I implemented multiline fields support in Sublime Text version of "Rainbow CSV": https://packagecontrol.io/packages/rainbow_csv
Still no solution in sight for VSCode.

@JanisE
Copy link

JanisE commented Sep 2, 2021

What about multi-line strings (marked by double-quotes) being correctly interpreted in other formats (e.g. programming languages) in VSCode? Is it done by something else rather than the syntax engine you mentioned?

@mechatroner
Copy link
Owner

@JanisE Good question! Yes, it may seem that many other grammars in VSCode don't have any problems with multiline strings. But this is because they are using begin/end matching which Rainbow CSV can't use. The 2 types of matching are described here: https://macromates.com/manual/en/language_grammars#language_rules Rainbow CSV has to rely only on the first matching type which is provided in a single match expression. We can't use begin/end matching because CSV-s are non-recursive.

@jjspace
Copy link

jjspace commented Nov 11, 2021

I came looking for exactly this. I was getting confused/frustrated that the CSVLint and highlighting wasn't working for our "valid" quoted multi-line values, especially given that I saw the option for it in the RBQL Console which worked perfectly. I realize this is due to the technical limitations outlined above but it's still really really unfortunate, our use case warrants being able to have multi-line values.
I guess a halfway ask is whether the CSVLint processor can accept quoted multi-line strings as valid even if the highlighting can't show it? I'd be perfectly fine with a setting to enable/disable as desired, same as in the RBQL Console.

@mechatroner
Copy link
Owner

@jjspace Thanks, I think it is possible to adjust CSVLint so that it can accept quoted multi-line strings, and I agree that this could be a useful improvement in some cases.

@mechatroner
Copy link
Owner

I am exploring a potential path forward to support multiline highlighting using a combination of these two VSCode API features:

  1. onDidChangeTextEditorVisibleRanges event which would allow dynamic highlighting of the currently visible file fragment.
  2. setDecorations method (or alternatively semantic fragment coloring - haven't tested this feature yet).

This approach looks promising, but there are some issues associated with it (including the dynamic nature of the highlighting and out-of-theme colors) so I am not sure if the end result would be decent enough to publish.

@ajhyndman
Copy link

ajhyndman commented Jul 21, 2022

I stumbled on this issue the other day, and decided to try my hand at it.

I am not certain it can be extended to all valid CSV separators, but I found a way to achieve multiline quote support using nested "begin" and "end" matchers. I thought I'd share it in case it turns out to be useful.

Basically, each pattern matches the entire rest of the row, but defines a nested pattern for the next color. Once you get to the last color, you use "$self" to loop back to the start of the grammar. I also use a lookahead matcher to define "end", so that a single newline character can end all active rainbow patterns.

{
  "name": "csv syntax",
  "scopeName": "text.csv",
  "fileTypes": ["csv"],
  "patterns": [
    {
      "name": "variable.other.rainbow1",
      "begin": "^|\\,",
      "end": "(?=\\n)",
      "patterns": [
        { "include": "#quotedvalue" },
        {
          "name": "keyword.rainbow2",
          "begin": "\\,",
          "end": "(?=\\n)",
          "patterns": [
            { "include": "#quotedvalue" },
            {
              "name": "entity.name.function.rainbow3",
              "begin": "\\,",
              "end": "(?=\\n)",
              "patterns": [
                { "include": "#quotedvalue" },
                { "include": "$self" }
              ]
            }
          ]
        }
      ]
    }
  ],
  "repository": {
    "quotedvalue": {
      "begin": "\"",
      "end": "\""
    }
  },
  "uuid": "ca03e352-04ef-4340-9a6b-9b99aae1c418"
}

@mechatroner
Copy link
Owner

Thank you very much, @ajhyndman!
Your syntax is very impressive, took me some time to understand how it works. I was absolutely sure that there is no way to implement something like this using TextMate grammar, turns out I was wrong about it.
I will definitely consider using your syntax as a backup plan or as an improvement for the solution that I am currently developing in the rfc branch - I recently discovered two other approaches using VSCode tokenization API, one of them being vscode.languages.registerDocumentRangeSemanticTokensProvider - it allows to highlight just the lines which are currently visible in the editor window + some margin lines around them. This allows to fully support RFC 4180 and also use dynamic highlighting for all possible separator strings (even multi-char separators) and also highlight comment lines. But the semantic tokenization method has some drawbacks too, soo it is great that we now have this alternative solution!
I hope to publish something within the next month.

@mechatroner
Copy link
Owner

I just published version 3.0.0 which supports multiline fields with some limitations and raw edges, the new RFC4180-compatible dialect is implemented a separate "syntax" which is called "dynamic CSV" - it means that highlighting is done through VSCode syntax tokens mechanism instead of a pre-build grammar. The correct multiline "dynamic CSV" dialect can trigger in 2 cases:

  1. Automatically, when Rainbow CSV detects that the newly opened CSV file contains multiline fields.
  2. Manually, you need to select separator (currently only , and ; are supported) and run Rainbow CSV: Set separator - Quoted Multiline Fields command.

If everything goes well I will soon publish version 4.0.0 with the syntax proposed by @ajhyndman which will combine two CSV dialects into one.
BTW version 3.0.0 has also some other features such as comment lines highlighting and support for infinite number of single-character and multicharacter separators - this is the reason why dynamic syntax highlighting is crucial and will remain as one of the highlighting mechanisms of the extension anyway. But pre-build grammars are still superior in many ways so I am planning to continue using them for popular dialects such as csv, tsv, pipe-separated files, and some others.

@petervandivier
Copy link

I've not been able to get this to work with the demo csv from the top of this thread

Key,Text
X,"Hello, mechatroner,
This, incorrectly interpreted as a new row with two columns"

image

Running the command Rainbow CSV: Set separator - Quoted Multiline Fields results in the following error

Only comma and semicolon separators are currently supported to use with multiline fields.

Running the command Rainbow CSV: Set separator - Basic appears to have no effect when a comma , is selected; although it does have an amusing side effect when no character is selected 😝

image

@mechatroner
Copy link
Owner

@petervandivier I just tested and it worked for me. Although I discovered that when the filetype is already "Dynamic CSV" the highlighting doesn't change immediately and I have to click on another tab and then back for the new highlighting to take effect, this has to be fixed ( I would also have to fix the amusing side effect, sorry. The separator should not be an empty string, of course). And I noticed on your screenshot that the filetype is CSV, but it should be "Dynamic CSV", are you sure that the comma character is selected when you are running the command? This is how it looks for me:
rainbow_screen

@petervandivier
Copy link

petervandivier commented Sep 14, 2022

I thought maybe I was having a bad interaction with another extension but I've tried this in a VM with a fresh VS Code install and Rainbow CSV as the only extension and the behavior still seems off ☹️ (also tried tab switching though that's not reflected in the screencaps)

image

image

Notably(?), Dynamic CSV doesn't lint at all for me in either my normal install or the VM 😕

Sorry I can't figure it out. Hope the feedback is helpful.

@mechatroner
Copy link
Owner

@petervandivier Could you please try it at http://vscode.dev please and perhaps check if Set separator -Basic works with some exotic separators like 0 (this would ensure that a non-builtin highlighting is used). Also, I just published version 3.1.0 (with some bug fixes mostly) which allows selecting a separator in the pop-up menu after Dynamic CSV filetype selection in the bottom right corner. Can I also ask you to rename the file (with non-csv extension) and perform the experiment again and maybe also just copy the file content into a new empty text file (File -> New Text File) and maybe also open a large text file with multiline fields e.g. this one: https://github.com/okfn/publicbodies/blob/main/data/br.csv (the test is important because it allows the autodetection to kick in, which would be the most frequent use case)?

@mechatroner
Copy link
Owner

@petervandivier Actually I was just able to reproduce this, one of the problems is that the file is too short (if you make it 10 lines at least is will work differently) for it to be recognized as multiline csv (I might fix this later) so it gets assigned the standard csv dialect instead of rfc csv (dynamic csv) because it has .csv extension. Now when you switch it to "Dynamic CSV" through the filetype selection menu (instead of using Rainbow CSV: Set separator - Quoted Multiline Fields, the old dialect (tuple separator, policy) gets carried over (the 3.0.0 had a bug/feature to prevent it from working with the standard quoted csv dialect). So there are some other issues here that I can fix e.g. not to carry over dialects which were assigned automatically, only carry over to dynamic csv manually selected dialects, so I will fix this. Plus finally enabling pre-build universal rfc-compatible csv grammar will fix this too. The bottom line is that we will get it right. And thanks for reporting this, I will do some adjustments in the next release.

@petervandivier
Copy link

Confirm the large file works locally for me (although it was briefly auto-detected as Dynamic CSV before my markdown extension highjacked the language mode to markdown and I had to manually set it back). The large file also works on vscode.dev

The small file also works locally now on 3.1.0 - worth noting though that the comma character must be selected when invoking Rainbow CSV: Set separator - Quoted Multiline Fields

image

Thanks! 🥳

@mechatroner
Copy link
Owner

Update:
I recently published version 3.3 added support for column alignment for CSV files with multiline fields - this provides feature parity between single-line and multiline csv files.

I was also experimenting with the grammar created by @ajhyndman which I want to use as the default syntax for csv files. So far I discovered one issue - if a csv file (even a regular one, without multiline fields) contains a comment line with unbalanced double quotes (this probably doesn't happen very often, but still) it would mess up highlighting for the whole file, e.g.

# This is a table with various brands, such as "Foo", "Bar, "Foobar"
Brand,Year
Foo,1997
Bar,2003
XYZ,1998
ETC,2003

Setting # as a comment prefix doesn't help here because semantic highlighting and grammar highlighting work on different levels.

There are some workarounds for this, e.g. using dynamic csv when the user chooses a comment prefix (instead of using dynamic csv for rfc files as it is currently implemented).

@mechatroner
Copy link
Owner

There were no major issues reported related to the new "dynamic csv" decoration-based highlighting since its launch, so we can assume that it works well enough. I am finally closing this issue. Further discussion of this topic can be held in the new more focused github issues. Thanks, everyone!

Repository owner locked as resolved and limited conversation to collaborators Sep 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

9 participants