I do a lot of content management system migration projects. As the content person on the team, my role is to shepherd the content through the process with the goal of it coming through if not improved, at least not degraded. Often that means arguing against automation. Here’s why.
Migration: It Isn’t Just for Developers
CMS projects are often scoped and staffed as technical projects that can be handled by developers creating automated solutions to move data from one system to another. After all, an enterprise CMS is essentially a database. And so it should be possible to approach it as though it were a database to database migration—map the data fields to one another, do some data normalization, run a script, test and refine the scripts, and you’re done. That’s all well and good if you’re working with highly structured data. Structured data has characteristics that can be programmed against.
But web sites are made up of all kinds of data—some structured, like product data on an e-commerce site; some not, like articles, blog posts, or user-generated content like reviews. Karen McGrane refers to this problem as the war of the “blobs” (unstructured) vs. “chunks” (structured data).
These incredibly valuable but unstructured blobs of content present a few problems: they’re hard to find in the CMS (think, for example, of an author name that exists only as part of an article), they’re hard to manage, and they’re hard to migrate—particularly in an automated way. These blobs of content are also often the result of years of organic growth of a site, with multiple authors, multiple authoring environments, and no uniformity even within their lack of structure. Theoretically, it should be possible to program against a blob of well-formed HTML, but depending on the all the aforementioned potentialities, that HTML may vary from one article to another even within a similar content set. Perhaps one author is more technically savvy than another and has done a little creative programming to make his or her article look a certain way. Another has used old-style markup or built the entire article as a table. Another has copied and pasted text in from Word and didn't strip out the styles. Writing a script to handle all the possible issues in legacy code and the subsequent cleanup when it doesn't work perfectly can often take as long as it would take to simply recreate the content from scratch, using considerably less expensive resources that your developers.
Even if automated migration was feasible, is it desirable? For one thing, you lose the opportunity to improve the content along the way. You seldom want to migrate a site as-is to as-is, particularly since the reason you’re migrating to begin with is that the old system isn't supporting current needs. And there’s that issue with messy old code to deal with—migrating it without taking advantage of the chance to clean it is like moving into a new house and taking the contents of the old house without doing any organizing or winnowing of your stuff.
Getting Structured
The hot topic in content strategy now is about semantic markup and how to structure content—which has benefits beyond migration, of course. See, for example, Sara Wachter-Boettcher’s book “Content Everywhere” or Cleve Gibbon’s excellent series on content modeling.
As a forward-looking practice, particularly when setting up a new content management system, developing a content model and doing the careful planning of how content needs to be managed for reuse and multi-channel, multi-device publishing, this is unquestionably best practice. But that doesn't address the issue of how we get there from here.
Here, of course, being the legion of legacy websites that are full of unstructured blobs of content desperately in need of review, clean-up, and a strategy for reuse and retention. How do we tackle these metaphorical Augean stables?
Seldom do web teams and content strategists have the resources they need to carefully tend each and every piece of content. And not all of that content may even warrant that type of resource-intensive, high-touch effort. So we need to think about how to be smart and strategic about where to focus. Prior to any migration or cleanup effort, we should be determining the highest value content. Getting to that list may require juggling user needs (determined by metrics and customer feedback) and business needs (content that is critical for conversion or content that has to be retained for legal or regulatory reasons). Some content may be valuable for both user and business reasons but because of its ephemeral nature, such as some types of user-generated content, may not be worth devoting much time to.
Know What You Have to Know What to Do
Any migration project should begin with a content inventory. If you’ve gotten to the point of planning a migration without doing an inventory and audit, stop. Back up and take the time to tackle that all-important step. Before you can make any decisions about what you’re going to migrate and how, you need to know what you have, what format it’s in, how much of it there is. Once you know what you have to work with, the next step is to do a qualitative audit to determine which content, as mentioned above, is worth the effort of migrating to begin with, let alone which should warrants spending time and effort to structure properly.
Once you've determined what you’re migrating and your CMS has been designed to support content modeling (based on your business requirements for management and reuse), what can’t be automated still needs to be migrated. The reality is that that unstructured content will probably have to be done manually. See David Hobbs’ article on what content can be automated and what must be manual. The list for manual migration may be smaller than you think. And even if there is enough structure to enable automation, if the content set is small, it may still be a better investment of time and resources to recreate it manually.
Turning Challenge into Opportunity
If the content audit turned up other issues in your content, such as inconsistent branding or voice or other stylistic issues, take the opportunity to address those in the process of recreating the pages—after all, it’s likely that the most unstructured pages are also the ones most vulnerable to variation, as discussed above. This is why it’s critical that the project was scoped from the beginning to include enough time and resources for the content team.
Read more on the role of the content strategist in a CMS project.