Problem of escaping in programming languages.

Almost any kind of textual data format, you'll have to implement escaping of some sort. For example:

  1. In programming languages, string literals are delimited by a quote or double-quote (' or ").
  2. Variants of SGML (including XML and HTML) uses <, >, and / extensively to define markups.

In the above languages, ', ", <, > and / all have special meaning in the data format. However, we may still want to represent them within the data format. For example, how would we distinguish between a quote that starts a string literal ("foo") versus a quote inside a string literal ("& and " are special characters")?

To solve this problem, we introduce the concept of escaping. Special characters has special meaning within the data format are represented using two or more characters in the data format. In many programming languages, \" is used to represent double quote.

To explain this in more complicated way - this problems arises because we're trying to represent more information than we already can. A document with N bytes can represent 256N different variation of data. However, by defining a data format on top of sequence of bytes, we're trying to add more semantic on top of it. If we still want to represent 256N different variation with additional semantic, we need to use more bytes to accomodate that.

There are few problems with escaping, though.

Escaping can introduce invalid or uncertainty to data formats

There is no such thing as invalid text, because every variation of 256N is a valid string. However, as soon as you introduce escaping, you're also introducing

In many languages, \t is a tab, \n is a newline, \\ is a backslash. What is \a, \b, or \c? If no such escape sequence exists, what should exactly happen?

  • Input is rejected (Java)
  • Input is interpreted as is, as if \ is not an escape character (Python)
  • Backslash is ignored (C)

strlen("\k") is 1 in C. len("\k") is 2 in Python, and "\k".length() doesn't even compile in Java. This is problematic because very few people remember all the escape sequences, and the variations between different programming languages.

Introducing escaping in a format introduces more escaping

To have a double quote in a string literal, one needs to type \". Now we can have double quotes in string literals, but how about a backslash? Now that backslash has additional semantic (used to initiate escape sequence), to enter a backslash, it must be escaped. The lesson here is that characters that initiate escape sequence now need to be escaped. This is same for & in XML.

If you have a data embedded in another data, you may need to perform multiple escaping.

No, we're not going to have CSV file with XML document in the cells, where XML document contains Java code. This sounds like an anti-pattern, but it occurs quite common in real life.

  • Regular Expressions in Java. Similar to other languages, backslash is used to initiate escape sequence in string literal. However backslash is also used to initiate escape sequence within regular expression Thus, to have a regular expression to match against a backslash, you need to write 4 backslashes (!!!!) in the string literal.
  • Embedded Javascript in HTML. Javascript code in <script> tags need to be aware of escaping of HTML. This question in Stack Overflow (http://stackoverflow.com/questions/4176511/embedding-json-objects-in-script-tags) gives a detailed discussion.
This entry was posted in Uncategorized. Bookmark the permalink.