The Difference Between Content in a Database and Content as a Database

A great deal of content lives in databases these days, but there is a world of difference between storing content in a database and treating content as a database. The essence of the difference is this: Content in a database is an object that can be retrieved from its indexed location, like locating a dining room chair in an Ikea warehouse. Content as a database is a record that can be examined and presented from different angles based on different properties, which can be selected based on any of these properties, and which can be related to other records based on common properties.

A physical metaphor is more difficult here, because this is precisely the kind of thing that is really hard to do in the physical world. In the physical world you can organize your bookshelf by title or by author or by publisher or by size or by color or by height, but whatever you choose, you a preferring one property over all the others. In a database, all properties are equal and you can create a report on your bookshelf that is organized by title or by author or by publisher or by size or by color or by height or by any other property that is separately addressable in the database.

That's the key: any property that is separately addressable. The difference between content in a database and content as a database is that content in a database does not make its properties separately available and content as a database does. There are two components to the move from content in a database to content as a database. The first is the storage method, and the second is the structure of the content itself.

Is the content accessible?

It is important to understand that an XML document — any XML document — is a database. That is, it is an ordered collection of data from which individual fields can be retrieved by a query. We can see this in the following XHTML document:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Rio Lobo Review</title>
  </head>
  <body>
    <h1>Rio Lobo Review</h1>
    <p><a href="http://en.wikipedia.org/wiki/Howard_Hawks">Hawkes'</a> final film is a
    lighthearted Western in the <a href="http://www.imdb.com/title/tt0053221/">Rio
    Bravo</a> mold, with <a href="http://www.johnwayne.com/">the Duke</a> as an ex-Union
    colonel out to settle some old scores.</p>
  </body>
</html>

We can write a query against this document to retrieve a list of a elements. Here is that query written as an XPath expression:

/html/body/p/a

If executed by an XPath processor (usually found inside something else, like an XSLT interpreter), it would return the following:

<a href="http://en.wikipedia.org/wiki/Howard_Hawks">Hawkes'</a>
<a href="http://www.imdb.com/title/tt0053221/">Rio Bravo</a>
<a href="http://www.johnwayne.com/">the Duke</a>

So, this XHTML document is a database that we can query. But what happens when we put that document into a larger database? That depends on which database we put it into. If we were to put it into an ordinary relational database as a text field inside a record (the way this post is stored by WordPress as a field in a MySQL database), could we then run the same query over multiple documents in the database? No. The database structure that is our XHTML document is not accessible to the query engine of our relational database. Our document has become just another record in that database. It is in the database, but we can no longer treat is as a database unless we pull it out of the relational database again.

Other storage options can eliminate this problem. Using an XML database management system like MarkLogic or eXist will mean that our document can be queried as a database while it is still in the database.

Another approach commonly used by content management systems is to pull certain fields out of an XML document as it is being stored and to cache their values in the relational database table where the database management system can query them with SQL. So, for instance, a CMS might read our XHTML document, pull out the content of /html/head/title and store it in a separate title field in the relational table (thus perhaps saving the author from having to create a metadata record for every document every time they stored it). You would then be able to query the CMS for a document with that title. But breaking down the full structured of a hierarchical XML document into relational tables is difficult, and the resulting structures are so complicated that it is difficult to write an effective query on them.

This then, is the first issue: if you want to treat your content as a database, it needs to be somewhere where you can query it as a database. Sometimes it may actually be easier to do this if your XML documents are sitting in a file system than if they are in a big CMS system that does not support querying them in this way, so this is an important question to ask before you write a check to a CMS vendor.

Is the content's structure useful?

The second issue is this: is your document a useful database? Every XML document is a database, but not every XML document is a useful database. Our XHTML document is not a particularly useful database. We can query it for a list of a elements, but that really does not make our lives or our readers lives better in any spectacular way. XHTML is meant for formatting content in a browser, and that is pretty much where its usefulness stops.

Let's see if we can make the same content into a more useful database using (abbreviated) SPFE markup.

<movie-review>
  <head>
    <base-uri>http://example.com/movies/reviews</base-uri>
    <name>rio-lobo</name>
    <title>Rio Lobo Review</title>
    <index>
      <entry>
        <type>movie</type>
        <term>Rio Lobo</term>
      </entry>
    </index>
  </head>
  <body>
  <title>Rio Lobo Review</title>
  <p><director name="Howard Hawkes">Hawkes'</director> final film is a
lighthearted Western in the <movie>Rio Bravo</movie> mold,
with <actor name="John Wayne">the Duke</actor> as an ex-Union
colonel out to settle some old scores.</p>
  </body>
</movie-review>

This is the same content, but it is now a much more useful database. For instance, we can do something more useful than making a list of links. We could make a list of all the movie reviews that mention John Wayne by running a query like this against all our movie-review documents:

/movie-review[body/p/actor/name='John Wayne']/title

But not only can we use the actor elements to identify topics that mention a particular actor, we can also use them to generate links from the actor element to topics about that actor. Suppose we had a topic on John Wayne in our collection.

<actor-bio>
  <head>
    <base-uri>http://example.com/movies/actors/bios</base-uri>
    <name>john-wayne</name>
    <title>John Wayne Bio</title>
    <index>
      <entry>
        <type>actor</type>
        <term>John Wayne</term>
      </entry>
    </index>
  </head>
  <body>
  <title>John Wayne</title>
  <p><actor name="John Wayne">the Duke</actor> starred in many westerns,
   including <director name="Howard Hawkes">Hawkes'</director>
   <movie>Rio Lobo</movie>.</p>
  </body>
</actor-bio>

When the build encounters the markup <actor name="John Wayne">the Duke</actor> in the Rio Lobo review topic, it can use a query to find if there is an actor-bio topic for this actor:

/actor-bio/head[/index/entry/type='actor'][term='John Wayne']/name

These examples only scratch the surface, but they illustrate the kinds of possibilities that you open up when you start to treat you content as a database, rather than simply storing it in a database.

Tags: , , , , , , , ,

{ 2 comments to read ... please submit one more! }

  1. Kel Mohror

    I have “bought-in,” Mark. Do I just create all my content coded as xml? Or can XHTML be applied? Is there a more efficient method for developing “useful Content-as-a-DataBase”? CaaDB has “breakthrough” utility that can be the foundation for a micro-enterprise.

    • Hi Kel,

      It’s not really about encoding so much as design and how you use metadata. Creating content as a database is all about attaching the right metadata to the content at the right level of granularity to support the queries you want to make. But it is also about designing an writing the content in a consistent manner so that when the database is queried, the result is correct and reliable.

      XML will certainly let you attach metadata to content, and guide the author to create consistent reliable content, but not all XML encodings put the right metadata in the right place for every business process.

      Other forms of encoding will let you do it too. XHTML by itself probably doesn’t do it, but XHTML fragments stored in a relational database table might. This is a technique used in many Web CMS. HTML5 goes a lot further than XHTML in its support for embedded metadata, though its structure is much looser than you would normally want for a database. I would generally regard all forms of HTML strictly as output formats.

{ 0 Pingbacks/Trackbacks }