Search

Recent Posts

Tags


« | Main | »

Migrating Static HTML pages to WordPress CMS

By Dale Reagan | September 16, 2009

I previously posted about migrating static pages to Drupal – in this post I will explore importing/migrating data into WordPress.

Solution A – a simple approach to import new POSTS from static HTML files.

At this time, WordPress is one of the best documented (IMO) Open Source solutions – the project’s information sharing model is a great example to replicate.   The WordPress Codex contains links to information for the end user as well as for programmers, or in this case, on data import: a snippet from the preceding link:

Importing from [X]HTML

Using trial and error one can make an e.g., perl script to concatenate [X]HTML files as RSS <item>s, saving into a single file.xml, then import that as RSS. Note however to first remove any newlines between <p>..</p>s, as mentioned above.

The format allowed is quite simple in fact. Just make each HTML file into an <item> as below and concatenate them together:

<item>
<pubDate>Wed, 30 Jan 2009 12:00:00 +0000</pubDate>
<category>Kites</category>
<category>Taiwan</category>
<title>Fun times</title>
<content:encoded><p>What great times we had...</p><p>And then Bob...</p></content:encoded>
</item>
<item>
...
</item>

Just be sure the <content:encoded> line is a single long line with no newlines embedded.

The information above is enough the create a simple script (Bash, Ksh, Perl, etc.) to extract textual information from HTML pages and create an RSS import file.  In my initial, simple test I will take ~300 static, HTML files and use the ‘formula’ above to create an RSS file for import into WordPress.  The original HTML files used in this test were created using Microsoft Frontpage so they contain formatting that I will be removing (something you may need to do as well; I considered using tools like ‘html2text’ to strip all HTML.)

A very simple Bash shell loop (providing you have simple, non-embellished HTML) would be something like:

#!/bin/bash
# mini loop to extract HTML title & body and output simple RSS
# WARNING - NO error/data checking - files will be REPLACED
# (C) 2009 Dale Reagan - http://web-tech.ga-usa.com/
####################### the RSS output will use the original file name + '.rss'
## the output_rss function prints lines between '<body' and '</body>'
## and removes 'body' markup; you can add additional sed lines to remove other markup
#####################
function output_rss {
echo "<item>"
echo "<pubDate>$(date '+%a, %d %b %Y %T')</pubDate>"
echo "<category>Sample Category</category>"
grep -i "<title>" ${FILE} | head -1)
printf "<content:encoded>"
awk '/<body/, /<\/body>/' ${FILE} | \
         sed -e 's/<body.*>//g' | \
         sed -e 's/<\/body>//g' | \
         iconv -f ISO-8859-1 -t UTF-8
echo "</content:encoded></item>"}
###### main loop
for FILE in *.html
do
    output_rss > ${FILE}.rss
done
## combine all *.rss files
cat *.rss > wp-import.RSS
echo "**********"

Now select a few of the new .rss files for ‘import testing’ using WordPress –> Tools –> Import –> RSS.  Import and review – if all is well then you could import all files at once by selecting the ‘wp-import.RSS’ file (as generated in the example above.)

This is a simple example of a starting point if you have more files than can be easily/quickly manually migrated. Note that even after auto-magic-import that review and editing will most likely be required regardless of conversion/migration method (i.e. via copy & paste.)  As you review the import you may find additional text patterns that could be either edited or excluded by the conversion script.  It is also a good idea to remove all HTML formatting (fonts, tables, etc.) since WordPress themes provide the ‘look’ for your content.

Summary of Conversion/Migration Steps

  1. review original web pages for structure (are there relevant categories?)
  2. after installing WordPress, create relevant categories
  3. convert simple HTML to simple RSS files (i.e. using something like the simple script above)
  4. import RSS files using the WordPress RSS import tool
  5. clean-up/edit the resulting POSTS

Need help with a similar HTML to CMS, small or large project?  (hourly or project based rates.)

Note that the RSS specification is very basic and as described above you can only auto-import POSTS and CATEGORIES (no pages or tags.)

I may explore doing a more complete static HTML to WordPress migration (i.e. import an entire site) – if you run the WordPress export process, you can examine the output file and create a more elaborate conversion process to create posts, pages, categories, tags and whatever other structures the WordPress import/export process supports.  After creating such a file you could simply import the data into a WordPress installation.  As mentioned in the post about Drupal import/export, if you have a small amount of data/HTML pages then a simple copy & paste sequence is usually faster then creating custom tools to convert your static HTML pages…

As always, your mileage will vary. 🙂

Notes: iconv – Convert encoding of given files from one encoding to another (in the example above I needed to convert high-bit characters that originated in emails and other on-line documents.)

Topics: Computer Technology, Problem Solving, Unix-Linux-Os, Web Problem Solving, Web Site Conversions, Web Technologies, Wordpress Software | 1 Comment »

One Response to “Migrating Static HTML pages to WordPress CMS”

  1. Eddie Says:
    September 16th, 2009 at 11:37 am

    Dale,

    Great tutorial. I’d love to know the PHP equivalent script. Maybe I’ll figure that out and post a blog response 😀 – Good stuff really.

    Eddie


________________________________________________
YOUR GeoIP Data | Ip: 73.21.121.1
Continent: NA | Country Code: US | Country Name: United States
Region: | State/Region Name: | City:
(US only) Area Code: 0 | Postal code/Zip:
Latitude: 38.000000 | Longitude: -97.000000
Note - if using a mobile device your physical location may NOT be accurate...
________________________________________________

Georgia-USA.Com - Web Hosting for Business
____________________________________