Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String interpolation for Idris2 #555

Closed
3 tasks done
andrevidela opened this issue Aug 13, 2020 · 12 comments · Fixed by #2013
Closed
3 tasks done

String interpolation for Idris2 #555

andrevidela opened this issue Aug 13, 2020 · 12 comments · Fixed by #2013

Comments

@andrevidela
Copy link
Collaborator

andrevidela commented Aug 13, 2020

Current progress

  • Implemented syntax Implement interpolated string #1056
  • Implement automatic show inference
  • Implement runtime control using multiplicity annotations (see updated linearity section)
  • Update documentation and Changelog

Summary

String interpolation (Also known as variable interpolation, variable expansion, or variable substitution) is a common feature of modern programming languages that allows the programmer to write a string literal and insert executable code inline for it to be automatically inserted within the string by the compiler.

Motivation

String manipulation is ubiquitous in programming and Idris is no exception. Currently we use string concatenation for interleaving string literal with runtime strings, and while this practice has served us well, it has some obvious limitations in terms of readability. The following section will highlight examples where string concatenation falls short and how string interpolation improve the situation.

Proposed solution

We propose introducing this new syntax for string literals

"this is a normal string"
s"This is an {interpolated} string"
"normal strings are {raw} strings"
s"interpolated string \{need\} to escape curly braces if you want to show them."

Expressions in curly braces must return a type that implements Show.

Expressions

Expressions within brackets can be arbitrary complex as long as they return a type that implements Show :

s"This is a string with a complex value: {replace (plusCommutative n m) $ map isEven record.field}"

Strings

Strings aren't "shown" with show

s"this shows a string : {"hello"}"
-- outputs: this shows a string: hello

s"this shows a string : {show "hello"}"
-- outputs: this shows a string: "hello"

2. Technical details

String interpolation has 2 goals:

  • Improve readability
  • Inform the compiler of the user’s intention about string manipulation

While the first one can only be subjectively evaluated, the second one has a more concrete consequence for the compiler. Take this example

let name = "Susan"
    greeting = s"hello {name}" in 
    putStrLn greeting

In it, the compiler can see that the variable name comes from a string literal. From which the compiler can perform the substitution at compile time rather than at runtime, outputting the code (given some inlining)

putStrLn "hello Susan"

instead of

putStrLn ("hello " <+> "Susan")

In other words, the compiler can avoid runtime concatenation operation and perform those text replacement at compile-time.

Edit: It turns out this does not work, the 0 linearity does not allow sharing a compile-time value with runtime value. When linearity 0 is used the value is always removed from the runtime, that means it cannot typecheck with ++ even if it's inlined. Of course this restriction could be lifted but it simply means the semantics of 0 are wrong when used with string interpolation. The real solution is to introduce a new grade/multiplicity that track compile-time availability (or staging) and informs both the compiler and the user that a value is available for inlining, partial evaluation and sometimes erasure.

A taste of linearity

Interestingly enough, since this substitution is performed at compile time it is semantically equivalent to a function call with linearity 0, that is, the following should compile:

let 0 name = "Susan"
    1 greeting = s"hello {name}" in
    putStrLn greeting

This observation leads us to believe that String Interpolation could make use of linearity (And linearity inference) to decide how to generate the correct output code. In addition, this allows the programmer to predict when the compiler will be able to perform such optimisation, and when it cannot. In the next example we can be assured the string concatenation will happen because there is no way to annotate name with linearity 0

main : IO ()
main = do name <- readLine
          putStrLn s"hello {name}"

In conclusion, erased terms are concatenated at compile time and linear and unrestricted terms are compiled into classical runtime concatenation.

Edit: Turns out this is not possible with the current semiring 0, 1, ω. The reason is that we cannot share a multiplicity 0 value between runtime and compile-time values. we could hack this in but it would plainly be a violation of the guarantee of the type system that values with multiplicity 0 are always erased.

%stringlit

Idris2 has a feature allowing string literals to be overloaded with additional meaning provided by the FromString interface.

interface FromString ty where
  fromString : String -> ty

This feature is particularly useful when designing an embedded DSL. For example

data URLPath : Type where
    
    Component : String -> URLPath
    (/) : URLPath -> URLPath -> URLPath
    END : URLPath

path : URLPath
path = "hello" / "world" / END

The path is equivalent to fromString "hello" / fromString "world" / END, or Component "hello" / Component "world" / END both of which are much noisier.

The rule for fromString is simply that every string literal is replaced by a call to fromString using the literal as argument. However this rule breaks down if we try to apply it to interpolated strings. Take this example:

let v : MyString = s"hello {"world"}" in ?rest

Should it be converted to

let v : MyString = fromString "hello " <+> fromString "world" in ?rest

(In this example we assume that string interpolation is extended to work with types which have a FromString instance and a Monoid instance.)

or

let v : MyString = fromString "hello world" in ?rest

Additionally, the linearity intuition breaks down too because though both translations are valid, the first one assume a non-erased linearity and the second one makes uses of the erased linearity and there is no way to predict which one it will be.

For those reasons, making stringlit compatible with string interpolation is a non-goal.

Alternatives considered

This section is going to look like a FAQ in order to showcase the result of our research.

Why can’t you just use string concatenation?

We can, and we’ve been doing so for years, but String manipulation is a cornerstone of software engineering. Humans communicate with text, display text, encode data in text and even replace type systems with text (see Stringly typed programs). The ubiquity of strings manifests itself in all sorts of ways. Take the following programs that calls a shell script

system $ idris ++ " " ++ file ++ " -o " ++ name

or this show implementation

    show (FunDecl fc n args b) = "\n\n" ++ "(FunDecl (" ++ show fc ++ ") " ++ show n ++ " " ++ show args ++ " " ++ show b ++ ")"

They work but since Strings are so ubiquitous we think they should be:

  • Easy to read
  • Easy to write
  • Performant to use/works well with linearity

String concatenation fails in those three aspects. As an example, the proposed syntax changes those two examples into.

system s"idris {file} -o {name}"
    show (FunDecl fc n args b) = s"\n\n(FunDecl ({fc}) {n} {args} {b})"

We already have dependent types, why not use printf like we should?

Type driven development in Idris has a wonderful example of implementing a type-safe printf function leveraging dependent types. It has the following type:

printf : (fmt : String) -> PrintfType (toFormat (unpack fmt))

and is used like this:

printf "hello %s" "Susan"

with printf "%s %s %s" having the type String -> String -> String -> String.

While this is a great exercise in dependently typed golfing, this also fails our 3 metrics, though not as badly as concatenation:

  • It’s hard to read because we have to count and tell the difference between the format string and the argument values. The human brain cannot transparently replace occurrences of %s with the correct corresponding argument (how is it supposed to know which element it is in the argument list? One would need to count to figure it out)
  • It’s not incredibly hard to write, but knowing which argument corresponds to which position isn’t as easy, specially when you get to more than 3 arguments.
  • It still performs runtime concatenation and cannot take erased arguments.

We already have Data.String.Interpolation in contrib

Again, lets reuse both our examples and our 3 criteria:

system $ F[idris, " ", file, " -o ", name]
    show (FunDecl fc n args b) = F["\n\n", "(FunDecl (", show fc, ") ", show n, " ", show args, " ", show b, ")"]

As we can see, the examples are shorter and more readable, this is definitely an improvement over string concatenation, however:

  • typing “F” “[“ “]” instead of “”…”” isn’t actually easier
  • this only replaces ++ by , which is welcome, but not a sufficient improvement compared to string interpolation
  • This still performs runtime concatenation and doesn’t work with erased arguments
  • This function is limited to String instead of Show a => a
    While some of those aspects could be improved (notably the last one), the best this solution could ever achieve is:
    show (FunDecl fc n args b) = F["\n\n", "(FunDecl (", fc, ") ", n, " ", args, " ", b, ")"]

which is still an unacceptable amount of noise required in order to separate terms by single spaces.

I don’t like your choice, what about *insert other language’s* string interpolation syntax

Let’s explore a few of them.

Scala

val name = "Susan"
println(s"Hello, ${name}")
  • s prefix
  • Dollar sign + curly braces
  • Curly brace optional when used with single identifier: $name

C#

var name = "Susan"
Console.WriteLine($"Hello {name}");
  • $ prefix
  • Curly braces

Swift

let name = "Susan"
print("hello \(name)")
  • No prefix
  • Escaped left parenthesis

Javascript

var name = "Susan"
console.log(`hello ${name}`)
  • Backticks
  • Dollar sign + curly braces

Python

name = "Susan"
print(f"Hello {name}")
  • f prefix
  • Curly braces

Conclusion

We needed a prefix in order to avoid clashing semantics between raw strings and interpolated strings. Given the existing examples our choices are either s, f or $. The f prefix is unsuitable since it refers to “format” but our string interpolation proposal does not include any feature wrt to formatting (Scala has a special prefix f to distinguish between formatted strings and interpolated ones). The $ would be fine, however, $ already has meaning in Idris as the function application operator. If we were to use it, those two functions would do different things despite looking very similar

trim $ "hello {name}  "
trim $"hello {name}   "

Which leaves us with the s prefix, it would stand for “string” which isn’t too outlandish.

As for bracketing the expressions, almost all languages use curly braces. Some of them prefix the curly braces by a $ we would rather avoid it for the same reason we do not use it as prefix and because it makes the expression harder to type (an additional shift+4 to type on US keyboards). We feel like curly braces are the right balance of unlikeliness to appear in strings displayed to the user and ease to type. Indeed, parenthesis are ubiquitous, and square brackets are often used for list bounds. Curly braces find themselves for JSON values and Object-oriented records, but those rarely appear in typical idris code, and when they do, the string is machine generated rather than typed by a human.

It is worth noting that the curly braces are already used in Idris for records and implicit function parameters. However, those three uses are unlikely to clash in an unreadable manner, even the worst case scenario:

nightmare : {val : Rec} -> {0 v : record { field = s"hello {record { foo = bar } rcd}" } val } -> Type

Isn’t hard to read because of the interpolated string but because of the record syntax.

Thanks to @ShinKage who co-authored this proposal, we’re ready to answer your questions.

@gallais
Copy link
Member

gallais commented Aug 13, 2020

👍

One thing I have not seen answered in this proposal:

The proper implementation of nested comments ignores closing tokens for
comments contained in a string literal. This allows you to make sure that
commenting out any well-formed program yields a well-formed program.

E.g. this is a valid single multi-line comment:

{-

myString : String
myString = "It's a trap! -}"

-}

At the moment we can parse nested comments by only looking for -} and
" and switching between parsing nested comments & parsing string literals.

With your proposal allowing users to embed arbitrary code inside these
{...} we may need to parse arbitrary Idris expressions just to detect
when an interpolated string has ended (cf. your record-in-splice example
where a closing brace is not the end of the splice).

What about only allowing identifiers in these splices? Would forcing users
to use let be such an annoyance?

@gallais gallais added the syntax label Aug 13, 2020
@ShinKage
Copy link
Contributor

I think arbitrary expression should be allowed. As for regular string literals inside a block comment, we generate a single token for the comment with a simple string, thus when we find a " we just go to another state in the abstract lexer automata that ignores -} until we find another " and any other character is just added to the comment string without further analysis. The stringLit combinator used for string literal parsing, does just that, ignores everything that is not a ". There is no reason we could not give the same treatment for interpolated literals.

The only additional problem, with interpolated strings and not regular string literals, is nested strings (as in f"foo {"bar"}"). However, nested enclosed content is already dealt by nested comments, so we could adopt the same solution for nested strings.

If these changes to the lexer results in performance hits the issue pertains the lexing phase, that even if our implementation is not optimized, shoud still be asymptotically linear.

@gallais
Copy link
Member

gallais commented Aug 13, 2020

However, nested enclosed content is already dealt by nested comments,

Note that for nested comments the opening & closing tokens are distinct.

@melted
Copy link
Collaborator

melted commented Aug 13, 2020

Nicely worked out proposal!

@andylokandy
Copy link
Contributor

andylokandy commented Feb 2, 2021

@andrevidela Did you figure out how to lex the InterpolatedMiddle yet? I'm gonna to give modal lexer a try to see if it helps.

@andrevidela
Copy link
Collaborator Author

I was going to work on that today actually.

'the plan is to look at { when parsing strings and then count the number of opening braces and closing braces (in the same way we count opening comment tokens), when all the braces are closed (and the count drops to 0) it means we are done with the interpolated part and we art starting to parse InterpolatedMiddle until we reach another {.

You can give it a go since I won't be able to take a look until this evening

@ohad
Copy link
Collaborator

ohad commented Feb 2, 2021

Probably needs a little bit more care, since the spliced code might contain nested comments, and those might contain }. There was some previous discussion about it earlier in this page (here).

@andylokandy
Copy link
Contributor

andylokandy commented Feb 2, 2021

I've written a draft for the modal lexer (gist). I've made it compile but not tested yet. However, it's too late here so I've to go to bed now. Hope it helps your hacking tonight @andrevidela.

@danielkroeni
Copy link
Contributor

danielkroeni commented Jul 21, 2021

Hi, thanks for this nice feature. I found the following surprising behaviour:

> idris2       
     ____    __     _         ___                                          
    /  _/___/ /____(_)____   |__ \                                         
    / // __  / ___/ / ___/   __/ /     Version 0.4.0-5f3480120
  _/ // /_/ / /  / (__  )   / __/      https://www.idris-lang.org          
 /___/\__,_/_/  /_/____/   /____/      Type :? for help                    

Welcome to Idris 2.  Enjoy yourself!
Main> "\{ if True then "a" else "b" }"
Couldn't parse any alternatives:
1: Expected 'case', 'if', 'do', application or operator expression.
(Interactive):1:5--1:7
 1 | "\{ if True then "a" else "b" }"
         ^^
... (43 others)

Main> "\{ (if True then "a" else "b") }"
"a"

The error says that an 'if' would be expected, but at the same time it reports the 'if' as an error.

Edit:
As discussed with Andre I opened a new issue: #1767

@mixphix
Copy link

mixphix commented Sep 19, 2021

Coming to Idris from Haskell, I’m very excited for this proposal! The current state of affairs requires that the argument to an interpolation be of type String, forcing users to state a conversion function in each place they’d like to interpolate some other Idris value. This brings back some of the encumbrance this was intended to reduce.

One of my personal favourite interpolation libraries in Haskell is called shakespeare, and the way that it handles Haskell value interpolation in text is through quasiquotes and the #{…} syntax. However, the way it works is slightly more sophisticated, in that the interpolation accepts any value of type (ToText a) => a; that is, as long as it has an instance of the ToText class. The interpolation calls its only method, toText :: a -> Text on the passed value.

I think a similar mechanism could work using Idris’ interfaces. However, I strongly recommend the creation of a new interface, for example StringInterpolate, rather than using Show. In Haskell’s case, Show is a derivable class, which in many cases will simply print out the representation of the value as it would be written in code, for use in debugging. However, many library maintainers choose to write their own instances of Show, which may or may not have custom formatting or pretty-printed parts. This is especially annoying when displaying data with multiple types whose Show instances behave differently. To avoid the spread of this duplicitous tendency, I suggest using an interface specifically geared toward interpolation so there can be no confusion as to what it “should” or “should not” be used for.

@IFcoltransG
Copy link

I'd like to see a new interface (not Show) used for interpolation. My only suggestion is that the solution should be future proof, for if we later decide to add formatting parameters. For example, Rust passes a formatter struct to the fmt function, which it uses for converting to a string. The struct holds data from the format string, like precision, radix, whether to pad with spaces, and can be safely ignored if the implementor doesn't need those arguments. Python defers to a third method for converting to strings, .__format__(), which has a default implementation.

@andylokandy
Copy link
Contributor

andylokandy commented Sep 22, 2021

@bi-functor @IFcoltransG I contributed to the interpolation string for Idris and I'm also a daily developer in Rust, I'd like to share my point of view on the Show interface on interpolation strings. First of all, Show is more like the Debug in Rust which should be derivable and should print the structure of the value. And besides that, Pretty is like the Display and has provided more powerful tools for formatting. Back to the interpolation string, Pretty is out of the game since we only desire a String rather than a Doc, and also Show is not a good choice for it because Show will always add quotes on Strings. So as a result, I chose to get rid of any auto conversion/formatting by interfaces but left it for programmers to convert the value into strings. There are some upsides to this approach: 1. you can specify any conversion implementation (eg. Show, custom function, named Show implementation) 2. you (or std) can define functions like hex, decimal, bin or more for formatting value into a string so that we don't have to implement special syntax in format strings like Rust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants