Migrating static html pages to a Web CMS/Blog
If you have a number of old web sites or even a significant number of static HTML web pages then you may decide that moving them into a web based content management system (CMS) or even a blog might provide a benefit.
First, reasons NOT to migrate:
- not enough time, money, resources
- static pages actually ‘perform’ better (i.e. they will display faster since there are no, or fewer, dynamic components); in fact, many CMS solutions include ‘caching’ so, after you import you can create a new static page…)
Some reasons to migrate pages:
- you have good/excellent content and you want to take advantage new web-related presentation solutions
- you want to use one tool for your content editing, layout and design as well as some level of content management
- you want to maintain a consistent feature set and presentation across your domains
Ok - how to proceed?
Unfortunately, most web-based CMS tools** (that I have reviewed) offer limited, if any, static HTML page import/migration options. The problem is that these CMS solutions are really Presentation Management Solutions; this is my view - CMS should provide an easy/simple means to both import and export data/pages/images/whatever. Drupal, CMS Made Simple, Wordpress and other solutions usually do offer some level of ‘page/node’ import - but, there are limited options so it’s usually a page at a time; some automation is definitely needed. Note that there are quite a number of plugins for various types of data import, but, I have not located a simple tool to handle importing multiple static HTML pages or even an entire tree of such pages.
When you start your hunt (run some search engine queries) for a solution you will (most likely) find a number of discussions suggesting:
- just use the presentation tools and create new pages and manually copy your content (that’s ok if you only have a handfull of pages OR if you have LOTS of time…)
- hire a programmer to write a custom conversion tool
- just start over using the wonderful web-based CMS solution that you have chosen
Hmm, none of the preceding solutions work for me (I have hundreds of pages to ‘convert’.) If I can type data into these systems then surely I should be able to auto-input existing data. We are using databases, right? When you start digging you will most likely find that these solutions have abstracted data off into some far corner of the system (remember, it’s not about the data but about the presentation.) If you have not already engaged a consultant/programmer to ‘convert/import’ your existing pages then you might proceed in this fashion:
- do a test export from your CMS of choice
- analyze the format of the exported file
- create a tool to generate the same format of file from your existing, static pages
- generate converted files appropriate for import and bring them into your new CMS via it’s ‘import’ function/tool/module
Before you start you need to consider/evaluate your static HTML files:
- do they contain any formatting that you need to remove?
- do they contain other ‘ingredients’ that you want to remove? OR
- taking a different approach, what parts of these pages do you want to retain?
In my test case I have files that were created using Microsoft Frontpage. The pages utilize CSS and ’shared borders’ as well as custom header and footer sections. I want to extract the page body as well as retain some meta-tags (<title>, <keywords>, etc.) Since these pages were created following a standard layout it is simple to extract the components that I want to retain. Ok, I have my raw input pages - how do I get them into the CMS?
Using 20 input pages as a test I did have a small level of success trying CMS Made Simple–> Import_Content (a nice plugin!); with this module you can establish page relationships as well as apply some formatting on input, but, if you are importing multiple pages then they will all be tied to one page (imagine selecting a menu option and then getting a drop-down list that contains hundreds of items - not a solution for my project.) At this point, what is missing is some way to retain overall page structure/relationships (or create new relationships on the fly) while importing the static HTML pages.
Data Relationships (mapping static HTML into your CMS database)
For your CMS/Blogging solution of choice you will need to acquire some level of understanding of both the structure of your static HTML data and how your CMS stores it’s data; at that point you can develop a mapping strategy to move your static pages into the CMS system. It sure would be useful if the CMS included some documentation on the database structure; I did not locate any so time for code digging. WAIT! I found one, possibly useful post about how to approach automating import into Drupal. On his blog, Adam Smith proposes a two step process to import pages into Drupal 6:
- create a file that contains your web page layout structure definition and then
- run a PHP script that both reads your layout specification and then plugs your static pages into the DB - looks promising!
Step one establishes the structure needed to avoid the import->layout problem I noted above. Unfortunately, as written the import script did not work on my system… However! it does provides some guidance into the Drupal ‘node/menu’ structure. Extracting a PHP code snippet we see:
// create the page
$node = new StdClass();
$node->uid = 1;
$node->type = 'page';
$node->status = 1; // published
$node->promote = 0; // don't promote to front page
$node->path = $path; // ?q=path
$node->format=3; // full HTML
$node->title = $title;
$node->body = "; // add later
node_save($node);
$parentLevel = $level-1;
$parentLevelInfo =& $levels[$parentLevel];
// create the menu item
$menuItem = array();
$menuItem['plid'] = $parentLevelInfo[0];
$parentLevelInfo[1]++;
$menuItem['weight']=$parentLevelInfo[1];
$menuItem['link_path']='node/' . $node->nid;
$menuItem['link_title']=$label;
$menuItem['type']=118; // see includes/menu.inc
menu_link_save($menuItem);
A quick review of the above code and we can see many of the database elements used for Drupal ‘nodes’ along with some guidance on the node path/menu structure. In my case I want to use someting like:
- Menu Item –> Overall Topic/Subject Summary Page –> Search page OR
- Menu Item –> Overall Topic/Subject Summary Page –> Sub-Topics –> Nodes/pages.
Taking this a bit further I install a Drupal plugin and ‘export a node‘ and get this output from a test Blog and Page posts:
| node(code( ‘nid’ => NULL, ‘type’ => ‘blog’, ‘language’ => ”, ‘uid’ => ‘1′, ’status’ => ‘1′, ‘created’ => NULL, ‘changed’ => ‘1226355506′, ‘comment’ => ‘2′, ‘promote’ => ‘1′, ‘moderate’ => ‘0′, ’sticky’ => ‘0′, ‘tnid’ => ‘0′, ‘translate’ => ‘0′, ‘vid’ => NULL, ‘revision_uid’ => ‘1′, ‘title’ => ‘Test blog entry’, ‘body’ => ‘This is a test blog entry This is a test blog entry This is a test blog entry This is a test blog entry ‘, ‘teaser’ => ‘This is a test blog entry This is a test blog entry This is a test blog entry This is a test blog entry ‘, ‘log’ => ”, ‘revision_timestamp’ => ‘1226355506′, ‘format’ => ‘1′, ‘name’ => ‘dale’, ‘picture’ => ”, ‘data’ => ‘a:0:{}’, ‘last_comment_timestamp’ => ‘1226355506′, ‘last_comment_name’ => NULL, ‘comment_count’ => ‘0′, ‘taxonomy’ => array ( ), ‘files’ => array ( ), ‘menu’ => NULL, ‘path’ => NULL, )) |
node(code( ‘nid’ => NULL, ‘type’ => ‘page’, ‘language’ => ”, ‘uid’ => ‘1′, ’status’ => ‘1′, ‘created’ => NULL, ‘changed’ => ‘1226356193′, ‘comment’ => ‘0′, ‘promote’ => ‘0′, ‘moderate’ => ‘0′, ’sticky’ => ‘0′, ‘tnid’ => ‘0′, ‘translate’ => ‘0′, ‘vid’ => NULL, ‘revision_uid’ => ‘1′, ‘title’ => ‘Test PAGE’, ‘body’ => ‘Thi sis a test page. Thi sis a test page. Thi sis a test page. Thi sis a test page.’, ‘teaser’ => ‘Thi sis a test page. Thi sis a test page. Thi sis a test page. Thi sis a test page.’, ‘log’ => ”, ‘revision_timestamp’ => ‘1226356193′, ‘format’ => ‘1′, ‘name’ => ‘dale’, ‘picture’ => ”, ‘data’ => ‘a:0:{}’, ‘last_comment_timestamp’ => ‘1226356193′, ‘last_comment_name’ => NULL, ‘comment_count’ => ‘0′, ‘taxonomy’ => array ( ), ‘files’ => array ( ), ‘menu’ => NULL, ‘path’ => NULL, )) |
Using the above layout it would be relatively simple to create a script to re-format the static HTML files with the structure used by the Import/Export plugins available for the version of Drupal that I am using - and then import a page at a time… Using something like Adam’s script mentioned above (written in your preferred language) would be one approach to automating the movement of the data into the system. You might also consider using xmlrpc tools - which work really well when moving data from one database to another database.
Note that you *should* be able to locate the Drupal database Schema for each table in the database that you are using; look under the module folders where data structures are declared, i.e. ~/modules/node/node.install. A partial listing from the node.install file:
function node_schema() {
$schema['node'] = array(
‘description’ => t(’The base table for nodes.’),
‘fields’ => array(
‘nid‘ => array(
‘description’ => t(’The primary identifier for a node.’),
‘type’ => ’serial’,
‘unsigned’ => TRUE,
‘not null’ => TRUE),
‘vid‘ => array(
‘description’ => t(’The current {node_revisions}.vid version identifier.’),
‘type’ => ‘int’,
‘unsigned’ => TRUE,
‘not null’ => TRUE,
‘default’ => 0),
‘type‘ => array(
‘description’ => t(’The {node_type}.type of this node.’),
‘type’ => ‘varchar’,
‘length’ => 32,
‘not null’ => TRUE,
‘default’ => ”),
In addition to examining the actual database tables you can start a command shell and then ‘cd’ to your Drupal install folder and use the command below to locate additional files for review.
grep -li schema modules/*/*install (note that I added the numbering in Wordpress and your list should vary since it will depend upon which modules you have installed…)
- modules/aggregator/aggregator.install
- modules/block/block.install
- modules/blogapi/blogapi.install
- modules/book/book.install
- modules/comment/comment.install
- modules/contact/contact.install
- modules/datasync/datasync.install
- modules/dblog/dblog.install
- modules/docapi/docapi.install
- modules/filter/filter.install
- modules/forum/forum.install
- modules/job_queue/job_queue.install
- modules/locale/locale.install
- modules/menu/menu.install
- modules/node_import/node_import.install
- modules/node/node.install
- modules/openid/openid.install
- modules/poll/poll.install
- modules/profile/profile.install
- modules/search/search.install
- modules/statistics/statistics.install
- modules/system/system.install
- modules/taxonomy/taxonomy.install
- modules/trigger/trigger.install
- modules/update/update.install
- modules/upload/upload.install
- modules/user/user.install
- modules/views/views.install
The ideal solution for this data-migration/import task would be one that was part of the content management system - please let me know if you find (or write) one! The simplest solution will be to use SQL statements and load the data into appropriate tables - the only problem with this approach is that you loose any auto-magic data tagging that the CMS might be doing (i.e. updating counters, indexes, secondary files, etc.)
** - If I were reviewing commercial CMS solutions I would expect to find data import/conversion tools as standard features…
Update - there is a tool (module) for Druapl 5.x that might provide a round-about solution for importing static HTML into Drupal 6.
- install, configure Drupal 5.x
- install, configure the module Html_Import
- import your static pages, review and refine as desired
- install, configure Drupal 6.x
- following the documented migration steps for moving from Druapl 5.x to Drupal 6.x
In my limited test I encountered a number of problems mostly having to do with PHP XML tools (required at the OS level) so I did not move past step 3 above. In general, wonderful as they might be, many Open Source solutions include the use of myriad other Open Source solutions. The combinantion of all of these variables may make the Open Source approach more time consuming than results producing - just something to consider when weighing the use of Open Source and commercial solutions; when tackling somewhat complex problems - if you have adequate resources to put into an effort AND if there is a long term benefit then Open Source can be a win-win. If your needs are simple/basic then Open Source solutions are are a win as soon as you start using them.
From my chair (FMC), the fastest way to import static HTML pages is to simply create SQL statements to load the data into your combination of frontend/database (Drupal, CMSMS, Wordpress, etc.) If needed, you could follow up the raw data import with SQL statements to update counters and indexes…
















