August 1st 2010
String.Trim treats Unicode whitespace differently in .NET 4
Recently, we upgraded all of our projects to .NET 4. It was mostly painless. We did run into one problem though. A call to XmlDocument.LoadXml that previously used to work was now throwing an exception:
System.Xml.XmlException: Data at the root level is invalid. Line 1, position 1. at System.Xml.XmlTextReaderImpl.Throw(Exception e) at System.Xml.XmlTextReaderImpl.ParseRootLevelWhitespace() at System.Xml.XmlTextReaderImpl.ParseDocumentContent() at System.Xml.XmlLoader.Load(XmlDocument doc, XmlReader reader, Boolean preserveWhitespace) at System.Xml.XmlDocument.Load(XmlReader reader) at System.Xml.XmlDocument.LoadXml(String xml)
Our code was the exact same, just targeting .NET 4 instead of .NET 3.5 which it was previously running against. So, what do you do with code that works in .NET 3.5 but not .NET 4?
I needed a quick way to compare .NET 3.5 behavior to .NET 4. The trick I discovered was to use Snippet Compiler for some quick tests of .NET 3.5 functionality. But unfortunately, Snippet Compiler does not support .NET 4. Thus, I switched to LINQPad to test against .NET 4. I was able to run the two tools side by side to do very fast comparisons of the behavior of the two different .NET frameworks to see how they differed.
The particular piece of code that was causing problems was pulling a Unicode XML block from SQL Service Broker and then using LoadXml to load the XML into the XmlDocument.
So I started with a snippet of code that mimics the problem I was seeing in each tool:
byte[] buffer = new byte[]{255,254,60,0,114,0,47,0,62,0};
string str = System.Text.Encoding.Unicode.GetString(buffer);
str=str.Trim();
System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
doc.LoadXml(str);
And confirmed success in Snippet Compiler against .NET 3.5 and the exception with LINQPad against .NET 4.
The call to String.Trim is used to remove the the Unicode BOM from the string of Unicode. The BOM at the start of a Unicode string specifies the byte order. See Wikipedia for more information.
So let’s see what the strings look like before and after the call to String.Trim:
byte[] buffer = new byte[]{255,254,60,0,114,0,47,0,62,0};
string str = System.Text.Encoding.Unicode.GetString(buffer);
foreach (char c in str)
{
Console.WriteLine("{0} U+{1:x4} {2}", c, (int)c, (int)c);
}
str=str.Trim();
Console.WriteLine("================================");
foreach (char c in str)
{
Console.WriteLine("{0} U+{1:x4} {2}", c, (int)c, (int)c);
}
Aha! The Unicode BOM is still there in .NET 4 and the XML deserializer explodes when it hits it. However, in .NET 3.5 the Unicode BOM is removed. But why?
It turns out that the behavior of String.Trim has actually changed in .NET 4. Quote MSDN:
Notes to Callers
The .NET Framework 3.5 SP1 and earlier versions maintains an internal list of white-space characters that this method trims if trimChars is null or an empty array. Starting with the .NET Framework 4, if trimChars is null or an empty array, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).
So the Unicode BOM is removed in .NET 3.5 using String.Trim but it is not removed in .NET 4 using the exact same call.
Therefore, we need a new Trim method that will mimic .NET 3.5 functionality in .NET 4. This is a perfect opportunity for an extension method. We want to add new functionality to the core String object in .NET. Ideally, we would be able to override String.Trim using our extension method to restore the .NET 3.5 behavior. Unfortunately, or fortunately depending upon your perspective, you are unable to override the methods native to an object using an extension method, you can only add new functionality. So we make a new extension method:
public static string TrimWithUnicodeWhitespace(this string stringToTrim)
{
return stringToTrim.Trim().Trim(new char[] { '\uFEFF', '\u200B' });
}
And then replace all calls to String.Trim with String.TrimWithUnicodeWhitespace! Now the Unicode BOM is removed as the code expects when using String.Trim.
Lessons learned:
- Don’t assume core functionality does not change when upgrading .NET frameworks
- Easily compare behavior of .NET 3.5 versus .NET 4 using Snippet Compiler and LINQPad
- You cannot override functionality using extension methods