Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
178 views
in Technique[技术] by (71.8m points)

javascript - How do I escape quotes in HTML attribute values?

I'm building up a row to insert in a table using jQuery by creating a html string, i.e.

var row = "";
row += "<tr>";
row += "<td>Name</td>";
row += "<td><input value='"+data.name+"'/></td>";
row += "</tr>";

data.name is a string returned from an ajax call which could contain any characters. If it contains a single quote, ', it will break the HTML by defining the end of the attribute value.

How can I ensure that the string is rendered correctly in the browser?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Actually you may need one of these two functions (this depends on the context of use). These functions handle all kind of string quotes, and also protect from the HTML/XML syntax.

1. The quoteattr() function for embeding text into HTML/XML:

The quoteattr() function is used in a context, where the result will not be evaluated by javascript but must be interpreted by an XML or HTML parser, and it must absolutely avoid breaking the syntax of an element attribute.

Newlines are natively preserved if generating the content of a text elements. However, if you're generating the value of an attribute this assigned value will be normalized by the DOM as soon as it will be set, so all whitespaces (SPACE, TAB, CR, LF) will be compressed, stripping leading and trailing whitespaces and reducing all middle sequences of whitespaces into a single SPACE.

But there's an exception: the CR character will be preserved and not treated as whitespace, only if it is represented with a numeric character reference! The result will be valid for all element attributes, with the exception of attributes of type NMTOKEN or ID, or NMTOKENS: the presence of the referenced CR will make the assigned value invalid for those attributes (for example the id="..." attribute of HTML elements): this value being invalid, will be ignored by the DOM. But in other attributes (of type CDATA), all CR characters represented by a numeric character reference will be preserved and not normalized. Note that this trick will not work to preserve other whitespaces (SPACE, TAB, LF), even if they are represented by NCR, because the normalization of all whitespaces (with the exception of the NCR to CR) is mandatory in all attributes.

Note that this function itself does not perform any HTML/XML normalization of whitespaces, so it remains safe when generating the content of a text element (don't pass the second preserveCR parameter for such case).

So if you pass an optional second parameter (whose default will be treated as if it was false) and if that parameter evaluates as true, newlines will be preserved using this NCR, when you want to generate a literal attribute value, and this attribute is of type CDATA (for example a title="..." attribute) and not of type ID, IDLIST, NMTOKEN or NMTOKENS (for example an id="..." attribute).

function quoteattr(s, preserveCR) {
    preserveCR = preserveCR ? '&#13;' : '
';
    return ('' + s) /* Forces the conversion to string. */
        .replace(/&/g, '&amp;') /* This MUST be the 1st replacement. */
        .replace(/'/g, '&apos;') /* The 4 other predefined entities, required. */
        .replace(/"/g, '&quot;')
        .replace(/</g, '&lt;')
        .replace(/>/g, '&gt;')
        /*
        You may add other replacements here for HTML only 
        (but it's not necessary).
        Or for XML, only if the named entities are defined in its DTD.
        */ 
        .replace(/
/g, preserveCR) /* Must be before the next replacement. */
        .replace(/[
]/g, preserveCR);
        ;
}

Warning! This function still does not check the source string (which is just, in Javascript, an unrestricted stream of 16-bit code units) for its validity in a file that must be a valid plain text source and also as valid source for an HTML/XML document.

  • It should be updated to detect and reject (by an exception):
    • any code units representing code points assigned to non-characters (like uFFFE and uFFFF): this is an Unicode requirement only for valid plain-texts;
    • any surrogate code units which are incorrectly paired to form a valid pair for an UTF-16-encoded code point: this is an Unicode requirement for valid plain-texts;
    • any valid pair of surrogate code units representing a valid Unicode code point in supplementary planes, but which is assigned to non-characters (like U+10FFFE or U+10FFFF): this is an Unicode requirement only for valid plain-texts;
    • most C0 and C1 controls (in the ranges u0000..u1F and u007F..u009F with the exception of TAB and newline controls): this is not an Unicode requirement but an additional requirement for valid HTML/XML.
  • Despite of this limitation, the code above is almost what you'll want to do. Normally. Modern javascript engine should provide this function natively in the default system object, but in most cases, it does not completely ensure the strict plain-text validity, not the HTML/XML validity. But the HTML/XML document object from which your Javascript code will be called, should redefine this native function.
  • This limitation is usually not a problem in most cases, because the source string are the result of computing from sources strings coming from the HTML/XML DOM.
  • But this may fail if the javascript extract substrings and break pairs of surrogates, or if it generates text from computed numeric sources (converting any 16-bit code value into a string containing that one-code unit, and appending those short strings, or inserting these short strings via replacement operations): if you try to insert the encoded string into a HTML/XML DOM text element or in an HTML/XML attribute value or element name, the DOM will itself reject this insertion and will throw an exception; if your javascript inserts the resulting string in a local binary file or sends it via a binary network socket, there will be no exception thrown for this emission. Such non-plain text strings would also be the result of reading from a binary file (such as an PNG, GIF or JPEG image file) or from your javascript reading from a binary-safe network socket (such that the IO stream passes 16-bit code units rather than just 8-bit units: most binary I/O streams are byte-based anyway, and text I/O streams need that you specify a charset to decode files into plain-text, so that invalid encodings found in the text stream will throw an I/O exception in your script).

Note that this function, the way it is implemented (if it is augmented to correct the limitations noted in the warning above), can be safely used as well to quote also the content of a literal text element in HTML/XML (to avoid leaving some interpretable HTML/XML elements from the source string value), not just the content of a literal attribute value ! So it should be better named quoteml(); the name quoteattr() is kept only by tradition.

This is the case in your example:

data.value = "It's just a "sample" <test>.
Try & see yourself!";
var row = '';
row += '<tr>';
row += '<td>Name</td>';
row += '<td><input value="' + quoteattr(data.value) + '" /></td>';
row += '</tr>';

Alternative to quoteattr(), using only the DOM API:

The alternative, if the HTML code you generate will be part of the current HTML document, is to create each HTML element individually, using the DOM methods of the document, such that you can set its attribute values directly through the DOM API, instead of inserting the full HTML content using the innerHTML property of a single element :

data.value = "It's just a "sample" <test>.
Try & see yourself!";
var row = document.createElement('tr');
var cell = document.createElement('td');
cell.innerText = 'Name';
row.appendChild(cell);
cell = document.createElement('td');
var input = document.createElement('input');
input.setAttribute('value', data.value);
cell.appendChild(input);
tr.appendChild(cell);
/*
The HTML code is generated automatically and is now accessible in the
row.innerHTML property, which you are not required to insert in the
current document.

But you can continue by appending tr into a 'tbody' element object, and then
insert this into a new 'table' element object, which ou can append or insert
as a child of a DOM object of your document.
*/

Note that this alternative does not attempt to preserve newlines present in the data.value, becase you're generating the content of a text element, not an attribute value here. If you really want to generate an attribute value preserving newlines using &#13;, see the start of section 1, and the code within quoteattr() above.

2. The escape() function for embedding into a javascript/JSON literal string:

In other cases, you'll use the escape() function below when the intent is to quote a string that will be part of a generated javascript code fragment, that you also want to be preserved (that may optionally also be first parsed by an HTML/XML parser in which a larger javascript code could be inserted):

function escape(s) {
    return ('' + s) /* Forces the conversion to string. */
        .replace(/\/g, '\\') /* This MUST be the 1st replacement. */
        .replace(//g, '\t') /* These 2 replacements protect whitespaces. */
        .replace(/
/g, '\n')
        .replace(/u00A0/g, '\u00A0') /* Useful but not absolutely necessary. */
        .replace(/&/g, '\x26') /* These 5 replacements protect from HTML/XML. */
        .replace(/'/g, '\x27')
        .replace(/"/g, '\x22')
        .replace(/</g, '\x3C')
        .replace(/>/g, '\x3E')
        ;
}

Warning! This source code does not check for the validity of the encoded document as a valid plain-text document. However it should never raise an exception (except for out of memory condition): Javascript/JSON source strings are just unrestricted streams of 16-bit code units and do not need to be valid plain-text or are not restricted by HTML/XML document syntax. This means that the code is incomplete, and should also replace:

  • all other code units representing C0 and C1 controls (with the exception of TAB and LF, handled above, but that may be left intact without substituting them) using the xNN notation;
  • all code units that are assigned to non-characters in Unicode, which should be replaced using the uNNNN notation (for example uFFFE or uFFFF);
  • all code units usable as Unicode surrogates in the range uD800..DFFF, like this:
    • if they are not correctly paired into a valid UTF-16 pair representing a valid Unicode code point in the full range U+0000..U+10FFFF, these surrogate code units should be individually replaced using the notation uDNNN;
    • else if if the code point that the code unit pair represents is not valid in Unicode plain-text, because the code point is assigned to a non-character, the two code points should be replaced using the notation U00NNNNNN;
  • finally, if the code point represented by the code unit (or the pair of code units representing a code point in a supplementary plane), independantly of if that code point is assigned or reserved/unassigned, is also invalid in HTML/XML source documents (see their specification), the code point should be replaced using the uNNNN notation (if the code point is in the BMP) or the u00NNNNNN (if the code point is in a supplementary plane) ;

Note also that the 5 last replacements are not really necessary. But it you don't include them, you'll sometimes need to use the <![CDATA[ ... ]]> compatibility "hack" in some cases, such as further including the generated javascript in HTML or XML (see the example below where this "hack" is used in a <script>...</script> HTML element).

The escape() function has the advantage that it does not insert any HTML/X


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...