Against Escape Characters 🔗

Author: Nox


Suppose I want to display some html inside a web page. Here is what I want to display:

<p> 1 & 2 </p>

But when you view the html source of this page, the above code block looks like this:

&lt;p&gt; 1 &amp; 2 &lt;/p&gt;

And now this second code block looks like this:

&amp;lt;p&amp;gt; 1 &amp;amp; 2 &amp;lt;/p&amp;gt;

It's quickly devolved into a mess.

Language in a Language 🔗

Similarly, observe this code:

print("Hello, world!")

What if I wanted to print the above code? I could do so like this:

print("print(\"Hello, world!\")")

And now if I want to print that code block, it would look like this:

print("print(\"print(\\\"Hello, world!\\\")\")")

The deeper I go, the more escaping I have to do, and the less readable it is.

I acknowledge that this last example is contrived. How often do you want three levels of languages within languages within languages? But note that there is one very common case where you want a language inside another language: regular expressions.

If you've ever programmed in a language where regular expressions don't have special syntax like Java or Emacs Lisp, you've probably encountered regexes that look like this:

"#\\({[^}\n\\]*\\(\\\\.[^}\n\\]*\\)*}\\|\\(?:\\$\\|@\\|@@\\)\\(\\w\\|_\\)+\\|\\$[^a-zA-Z \n]\\)"

(This is a real regex, copied from ruby-mode in emacs. The regex is licensed GPLv3+.)

How many backslashes are in there? And this is just one language inside one other language!

Delim System 🔗

Here is a proposal for a few syntax rules that would avoid this issue.

We define three sets of delimiters: (, which is matched with ), [, which is matched with ], and {, which is matched with }.

Every delimiter must always be properly matched, and cannot be escaped. The only exception is when it is in a raw delimiter. Raw delimiters look like this:

`(A raw delmiter :). It can contain unmatched parens.)`
`pat(This one has a pattern. It can contain )` safely. It ends here->)pat`

Raw delimiters start with a backtick, an optional pattern, and then an opening delimiter, and are terminated by the matching closing delimiter, followed by the same pattern, and finished by a backtick. It can contain anything except its own terminating sequence.

And that's all the rules!

Any language that follows these rules can be nested arbitrarily inside of each other without using any escape characters. In fact, they can be nested inside each other without even using raw delimiters! You only need to use raw delimiters when you want to embed languages that don't follow these rules.

To show the usefulness of this, let's rewrite the original html example, in a world where html had been designed to follow these rules. This version of html would have <raw></raw> tags, written (raw ), to contain raw text. To be fair, I'll replace the & in the text with a smiley. So what we want to display is this pseudo-html code:

`(p 1 :) 2)`

When viewing the source of that code block, it would look like this:

(raw `(p 1 :) 2)`)

And viewing the source of this second code block:

(raw (raw `(p 1 :) 2)`))

This isn't perfect, but it's still much more readable than the html, which I'll remind you looked like this:

&amp;lt;p&amp;gt; 1 &amp;amp; 2 &amp;lt;/p&amp;gt;

Conclusion 🔗

I'm not against all escape characters. For example, being able to use \n for newlines inside of strings is useful. I suppose what I'm against is using escape characters to escape the delimiters, as well as languages that don't provide any syntax for raw strings.

The delim system would be especially useful in a language like Racket, which is designed for creating DSLs. It would still useful in most other languages. At the very least, even if a language rejects the full delim system, it should have a syntax for raw strings, if only to avoid problems like too many backslashes in regexes.

Meta 🔗

Home Page

Date: 2022-10-22

Tags: post | syntax | programming | programming_languages