Storing XML in SQL Server

Usually, the design of our line-of-business applications contains lots and lots of entities that we abstract as classes in our code to represent the various real-world objects with which the application needs to deal. For example, the following fragment of code represents a hypothetical class “Employee”, which encapsulates various pieces of data related to the employees in the company:

public class Employee
{
    public int EmpoyeeId { get; set; }
    public string Name { get; set; }
    public int DepartmentID { get; set; }
    public decimal Salary { get; set; }
    public DateTime HireDate { get; set; }

    public override string ToString()
    {
        StringBuilder sb = new StringBuilder("{");
        foreach (PropertyInfo pi in typeof(Employee).GetProperties())
        {
            sb.Append(" " + pi.Name + "="
              + pi.GetValue(this, null) + ";");
        }
        sb.Remove(sb.Length - 1, 1);
        return sb.ToString() + " }";
    }
}

The application will instantiate this class, populate the various fields, pass it as an argument to several methods, and perform different operations that access the values of its properties.
At some point, it will be necessary to save these entities into permanent storage so that they can later be retrieved. The most common approach uses a database engine, such as SQL Server. Traditionally, a table is created in the database, and it is defined in such a way that the fields in the table match the properties in the class (or at least, those properties that need to be saved). For instance, our Employee class might be saved into a table defined in this way:

CREATE TABLE Employees (
   EmployeeId INT PRIMARY KEY,
   Name VARCHAR(50),
   DepatmentID INT,
   Salary MONEY,
   HireDate DATETIME
)
GO

Mapping the tables in the database into the classes in memory can be a non-trivial task, because both environments can be designed in conformance to different paradigms. The database follows a relational model, while the classes are designed according to an object hierarchy. The difference between both architectures is sometimes referred to as an “impedance mismatch”. There are many existing tools whose purpose is to facilitate the connection between the classes and the database tables. This process is known as “Object Relational Mapping” (ORM). The different tools follow different strategies and have different capabilities. One example of such a tool is the Entity Framework that is included with Visual Studio starting with SP1 for Visual Studio 2008.

In this post, we are not going to concern ourselves with how to perform the Mapping or how to use existing tools for this purpose. The point is that, regardless of whether we use an ORM tool or we create from scratch all of the code that reads and writes the classes into the database, in the end we shall have a database with lots and lots of tables that in turn have lots and lots of fields to represent all of the various pieces of data for the many business entities that our application needs to handle. The database schema can become very hard to understand, and also very hard to maintain, because it will need to be changed every time that we modify an entity to contain a new or revised property that needs to be saved.

The trick that we are going to show here can simplify the maintenance of some of these tables. Instead of creating a column in the table for each of the individual properties that has to be saved, we are going to create a single field that will contain an XML block with all the values for the different properties that need to be stored. If the database server is SQL Server 2005 or later, the column can be declared as being of type XML. Otherwise, a simple LOB such as TEXT can suffice. For example, our Employees table can be shortened like this:

CREATE TABLE Employees (
   EmployeeId INT PRIMARY KEY,
   Details XML
)
GO

The Details field can contain information for the various fields of the class Employee, such as Name, DepartmentID, Salary, and so on. For example, a new record might be inserted into the table by means of a SQL query like the following:

INSERT INTO Employees
VALUES
	(1,
		'<?xml version="1.0" ?>
		<Employee>
			<Name>John Smith</Name>
			<DepartmentID>1</DepartmentID>
			<Salary>987.65</Salary>
		</Employee>')
GO

Of course, not every property of every entity is suitable for being stored in this way. For instance, if we frequently need to add the salaries of all employees, it would be best to have the Salary as a separate field in the table so that a “Select Sum(…)” could be applied to it. On the other hand, there are many entities that are mostly handled as a whole, so that the program creates an instance in memory and populates all of its fields from all of the fields in a record in the database, and conversely, when the instance is saved into the database all of the fields of the class get dumped into all of the fields in the database table. In such cases, we can simply serialize the class and store the resulting XML into the single field in the database table.

One advantage of operating in this way is that neither the database table nor the code that reads and writes the entity need to be modified if the class is ever changed to add new properties. Maintenance becomes easier, and there is a lower risk of losing synchronization between the version of the program code that we are running and the version of the database that it is accessing. Also, we do not need to write conversion code to migrate the data from the old database schema to the new modified tables that contain the changed fields.

The disadvantage is that accessing individual pieces of data stored inside the XML is more complex and less efficient than if those pieces were stored in separate fields. A possible compromise is to define a record that contains separate fields for frequently accessed properties and an XML field for “miscellaneous” data that most of the time is only read as a block into the business entities managed by the program.

Let’s write some code to illustrate this technique for handling data. Our first example shows a loop that inserts new Employees into the table:

private void cmdAddEmployees_Click(object sender, EventArgs e)
{
    using (SqlConnection cn = OpenConnection())
    {
        string query = "INSERT INTO Employees VALUES(@Id, @Xml)";
        SqlCommand cmd = new SqlCommand(query, cn);

        SqlParameter parmId =
            new SqlParameter("@Id", SqlDbType.Int);
        cmd.Parameters.Add(parmId);

        SqlParameter parmXml =
            new SqlParameter("@Xml", SqlDbType.Xml);
        cmd.Parameters.Add(parmXml);

        for (int i = 100; i <= 200; i++)
        {
            //Generate arbitrary sample data
            Employee emp = new Employee();
            emp.EmpoyeeId = i;
            emp.DepartmentID = i % 3 + 1;
            emp.HireDate = DateTime.Now.AddMonths(-i);
            emp.Salary = 1000 + i * 1.1m;
            emp.Name = "Employee " + i;

            //Serialize emp
            XmlSerializer xs = new XmlSerializer(typeof(Employee));
            StringBuilder sb = new StringBuilder();
            StringWriter sw = new StringWriter(sb);
            xs.Serialize(sw, emp);

            //Assign values to parameters
            parmId.Value = emp.EmpoyeeId;
            parmXml.Value = sb.ToString();
            //Execute sql query
            cmd.ExecuteNonQuery();
        }
    }
}

private SqlConnection OpenConnection()
{
    SqlConnection cn = new SqlConnection("Data Source=....");
    cn.Open();
    return cn;
}

This code contains a loop in which new objects of type Employee are created. Our example sets the various properties of each object to arbitrary values, but of course a real application would be storing valid data in these properties. Once an Employee is created, it is serialized by means of an XmlSerializer, and the resulting XML string is assigned to the XML field in the table by means of a parameter, just in the same way as we would save into the database any one of the usual types of character strings.

Note that the code that serializes and saves the data is not dependent on the internal structure of the Employee class. If we were to modify this class to add a new property, the preceding code would still save it to the database. Neither the code, nor the schema of the table in the database would need to be changed.

Now that the records are written into the database, we can retrieve an Employee from the table with some code like the following:

using (SqlConnection cn = OpenConnection())
{
    string query =
        @"SELECT Details FROM Employees
          WHERE EmployeeId=2";
    SqlCommand cmd = new SqlCommand(query, cn);
    SqlDataReader rdr = cmd.ExecuteReader();
    while (rdr.Read())
    {
        //Retrieve the XML from the database field
        string xml = rdr[0].ToString();

        //Deserialize into Employee class
        XmlSerializer xs = new XmlSerializer(typeof(Employee));
        StringReader sr = new StringReader(xml);
        Employee emp = (Employee)xs.Deserialize(sr);

        //Display data
        MessageBox.Show(emp.ToString());
    }
    rdr.Close();
}

The preceding code selects a row by its ID, which is stored in a separate field outside the XML. Since this is the primary key, and it is therefore indexed, retrieving an Employee in this way is very fast. The XML is retrieved as a string and it is deserialized into a new Employee. To demonstrate that all the values have been retrieved, the code uses a message box to display the contents of the object.

Once again, notice that nothing in this code is dependent on the internal structure of the Employee class, and it would still work if new properties were added to the class.
Although the preceding code retrieved a record based on a field that was stored outside the XML, it is possible to filter on a field that is stored inside the XML. For instance, the following code retrieves one or more records based on the Name element in the XML:

using (SqlConnection cn = OpenConnection())
{
    string query =
        @"SELECT EmpoyeeId, Details
          FROM Employees
          WHERE Details.value('(/Employee/Name)[1]', 'varchar(50)')
            = 'John Smith'";
    SqlCommand cmd = new SqlCommand(query, cn);
    SqlDataReader rdr = cmd.ExecuteReader();
    while (rdr.Read())
    {
        //Retrieve the XML from the database field
        string xml = rdr["Details"].ToString();

        //Deserialize into Employee class
        XmlSerializer xs = new XmlSerializer(typeof(Employee));
        StringReader sr = new StringReader(xml);
        Employee emp = (Employee)xs.Deserialize(sr);

        //Display data
        MessageBox.Show(emp.ToString());
    }
    rdr.Close();
}

This is almost identical to the previous code that retrieved records based on the ID field. The difference is that we are now using the .value method of the XML data type in SQL Server to extract one of the values inside the XML block. The value is then compared with a string to filter the rows that are returned. For simplicity, this code contains a hardcoded value to search (“John Smith”), but of course the query can be parameterized to receive the search argument at runtime.

The .value method receives an XPATH query such as ‘(/Employee/Name)[1]‘ to determine the value that it needs to retrieve. This is a powerful language, and it allows many complex searches to be performed inside the XML.

Let’s see another example. The following query searches for employees that were hired before 1999:

SELECT EmployeeId, Details
  FROM Employees
  WHERE Details.value('(/Employee/HireDate)[1]', 'DateTime')
      < '01/01/1999'
GO

The XML data type in SQL Server also provides additional methods for querying the XML data. For example, the following query uses the .exist method to select all the employees for which the HireDate element has been provided. This would be equivalent to searching for “…where HireDate is not null” if HireDate were stored in a separate field in the table.

SELECT EmployeeId, Details
  FROM Employees
  WHERE Details.exist('Employee/HireDate') = 1
GO

Resolving queries like the preceding would normally require that SQL Server examine the XML field in all of the records in the table to determine those that satisfy the condition. In order to speed these operations you can create XML indexes. An XML field in a table in SQL server can have one primary XML index, and a maximum of three secondary indexes (“FOR PATH”, “FOR PROPERTY” and “FOR VALUE” depending on the type of queries that they are designed to optimize). The following SQL queries provide an example of how those indexes might be created.

CREATE PRIMARY XML INDEX IdxEmployeesDetailsPrimary
	ON Employees(Details)
GO

CREATE XML INDEX IdxEmployeesDetailsPath
	ON Employees(Details)
	USING XML INDEX IdxEmployeesDetailsPrimary
	FOR PATH
GO

The XML indexes are explained in the SQL Server documentation, as well as the various methods for querying and modifying the XML fields. There are also facilities for assigning an XML schema to an XML field so that it verifies the integrity of data assigned to it.

From the point of view of our client code, the interesting thing is that we are using serialization to save and retrieve our objects, so the code does not need to be modified when our classes are extended.

This technique should not be arbitrarily applied to just any table or any fields, since individual access to the data stored as XML will usually be more difficult and inefficient than if it were stored in individual fields in the table records. However, under the right circumstances, it can simplify greatly the data access code and the maintenance of the database schema.

Posts in this series:

Introduction to Serialization in the .Net Framework

Using serialization to save the state of the form

Cloning objects by means of serialization

Using XML serialization to read and write XML files that conform to a schema

Storing XML in SQL Server

Alberto Poblacion

C# MVP

Alberto has been working for the computer industry for more than 27 years, most of them as a software developer. He has held just about every possible position in software development up to and including being the CSA for a small ISV. Currently he works as a freelance developer and consultant, as well as providing training as a Microsoft Certified Trainer, mostly on the field of software development under the .NET platform. He's a Microsoft MVP in C#.

No comments yet.

Leave a Reply