Wednesday, February 27, 2008

Detecting Duplicate XML Data in SQL Server

I've been working quite a bit with XML in SQL Server lately (I'll try to do a post on some xquery stuff at some point), and I had a need to check XML data that I'm pulling off disk against a table in SQL Server to see if the data I pulled off disk is a duplicate with data already in the database.


The problem I ran into is that SQL Server "collapses" empty XML nodes when you insert data as XML (e.g. <myXmlNode><myXmlNode/> is turned into <myXmlNode/>, by SQL Server) so if the XML you're checking against hasn't gone through this collapsing process, you won't find duplicates accurately.


The solution turned out to be pretty simple and was suggested to me by a co-worker. First, you can't compare XML to XML directly in a query because, like any binary datatype in SQL Server, the = operator can't be used. Given the issue outlined above, you also can't just convert the XML in the database and the XML from disk into nvarchar(max) because of the collapsed node issue.


The trick is to use SQL Server's CONVERT() function and convert the XML from disk to SQL Server XML within a query, and then compare the result of that with the data already in the database:



DECLARE @xmlToCheck xmlSELECT @xmlToCheck = CONVERT(xml, '#theXmlFromDisk#')SELECT COUNT(id) AS dupeCount FROM xmlTable WHERE CONVERT(nvarchar(max), xmlColumn) = CONVERT(nvarchar(max), @xmlToCheck)



If dupeCount comes back greater than 0, then you have a dupe on your hands. Hope that helps others since I spent more time than I had hoped wrangling with this issue.

No comments: