'How do you keep .NET XML parsers from expanding parameter entities in XML?
When I try and parse the xml below (with code below) I keep getting <sgml>&question;&signature;</sgml>
expanded to
<sgml>Why couldn’t I publish my books directly in standard SGML? — William Shakespeare.</sgml>
OR
<sgml></sgml>
Since I am working on an XML 3-way Merging algorithm I would like to retrieve the un-expanded
<sgml>&question;&signature;</sgml>
I have tried:
- Parsing the xml normaly (this results in the expanded sgml tag)
- Removing the Doctype from the beginning on the xml this results in empty sgml tag)
- Various XmlReader DTD settings
I have the following XML file:
<!DOCTYPE sgml [
<!ELEMENT sgml ANY>
<!ENTITY std "standard SGML">
<!ENTITY signature " — &author;.">
<!ENTITY question "Why couldn’t I publish my books directly in &std;?">
<!ENTITY author "William Shakespeare">
]>
<sgml>&question;&signature;</sgml>
Here is the code I have tried (several attempts):
using System.IO;
using System.Xml;
using System.Xml.Linq;
using System.Reflection;
class Program
{
static void Main(string[] args)
{
string xml = @"C:\src\Apps\Wit\MergingAlgorithmTest\MergingAlgorithmTest\Tests\XMLMerge-DocTypeExpansion\DocTypeExpansion.0.xml";
var xmlSettingsIgnore = new XmlReaderSettings
{
CheckCharacters = false,
DtdProcessing = DtdProcessing.Ignore
};
var xmlSettingsParse = new XmlReaderSettings
{
CheckCharacters = false,
DtdProcessing = DtdProcessing.Parse
};
using (var fs = File.Open(xml, FileMode.Open, FileAccess.Read))
{
using (var xmkReaderIgnore = XmlReader.Create(fs, xmlSettingsIgnore))
{
// Prevents Exception "Reference to undeclared entity 'question'"
PropertyInfo propertyInfo = xmkReaderIgnore.GetType().GetProperty("DisableUndeclaredEntityCheck", BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic);
propertyInfo.SetValue(xmkReaderIgnore, true, null);
var doc = XDocument.Load(xmkReaderIgnore);
Console.WriteLine(doc.Root.ToString()); // outputs <sgml></sgml> not <sgml>&question;&signature;</sgml>
}// using xml ignore
fs.Position = 0;
using (var xmkReaderIgnore = XmlReader.Create(fs, xmlSettingsParse))
{
var doc = XDocument.Load(xmkReaderIgnore);
Console.WriteLine(doc.Root.ToString()); // outputs <sgml>Why couldn't I publish my books directly in standard SGML? - William Shakespeare.</sgml> not <sgml>&question;&signature;</sgml>
}
fs.Position = 0;
string parseXmlString = String.Empty;
using (StreamReader sr = new StreamReader(fs))
{
for (int i = 0; i < 7; ++i) // Skip DocType
sr.ReadLine();
parseXmlString = sr.ReadLine();
}
using (XmlReader xmlReaderSkip = XmlReader.Create(new StringReader(parseXmlString),xmlSettingsParse))
{
// Prevents Exception "Reference to undeclared entity 'question'"
PropertyInfo propertyInfo = xmlReaderSkip.GetType().GetProperty("DisableUndeclaredEntityCheck", BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic);
propertyInfo.SetValue(xmlReaderSkip, true, null);
var doc2 = XDocument.Load(xmlReaderSkip); // Empty sgml tag
}
}//using FileStream
}
}
Solution 1:[1]
Linq-to-XML does not support modeling of entity references -- they are automatically expanded to their values (source 1, source 2). There simply is no subclass of XObject
defined for a general entity reference.
However, assuming your XML is valid (i.e. the entity references exist in the DTD, which they do in your example) you can use the old XML Document Object Model to parse your XML and insert XmlEntityReference
nodes into your XML DOM tree, rather than expanding the entity references into plain text:
using (var sr = new StreamReader(xml))
using (var xtr = new XmlTextReader(sr))
{
xtr.EntityHandling = EntityHandling.ExpandCharEntities; // Expands character entities and returns general entities as System.Xml.XmlNodeType.EntityReference
var oldDoc = new XmlDocument();
oldDoc.Load(xtr);
Debug.WriteLine(oldDoc.DocumentElement.OuterXml); // Outputs <sgml>&question;&signature;</sgml>
Debug.Assert(oldDoc.DocumentElement.OuterXml.Contains("&question;")); // Verify that the entity references are still there - no assert
Debug.Assert(oldDoc.DocumentElement.OuterXml.Contains("&signature;")); // Verify that the entity references are still there - no assert
}
the ChildNodes
of each XmlEntityReference
will have the text value of the general entity. If a general entity refers to other general entities, as one does in your case, the corresponding inner XmlEntityReference
will be nested in the ChildNodes
of the outer. You can then compare the old and new XML using the old XmlDocument
API.
Note you also need to use the old XmlTextReader
and set EntityHandling = EntityHandling.ExpandCharEntities
.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | reduckted |