To Parse or Not to Parse: CDATA Explained

To parse, or not to parse: that is the problem when your XML element values have characters that would otherwise be interpreted by an XML parser as syntax. However, we can use a special XML wrapper around our potentially dangerous text to solve this problem.

All the data in an XML document is initially considered PCDATA, or parsed character data. When interpreting a XML file, a parser will build the nodes by looking for a <element> tag to start an element, and a </element> to end the same element. In short, a parser is looking for the less than sign (<) to parse the text following it as an element name and attributes or the close of an element. After a element definition or closure finished, signified by the greater than sign (>), the parser will continue to look for another element and, if inside an existing element, the close of an element.

Because of this parsing across all the text of an XML document, some data stored in XML can not be parsed. Storing HTML and/or JavaScript in your XML is often the culprit in this scenario, as HTML code relies on the same type of <element></element> structure, and JavaScript often uses the less than and greater than signs, as well as other illegal characters. However, we can still store these, and other types of data, in an XML document. The solution is to wrap these potentially XML breaking data blocks in a CDATA block. CDATA stands for character data, which is different from PCDATA in that it is missing the ‘P’, i.e. parsing. Wrapping your data in an <![CDATA[ data ]]> tag will cause an XML parser to skip parsing that section of data, and thus ignore any special characters/syntax inside.

Now the question of to parse or not to parse is in your control. Choose wisely!

Michael Marr
About Michael Marr
Michael Marr is a staff writer for WebProNews

Leave a Reply