Search

Recent Posts

Tags


« | Main | »

Converting HTML to Markdown via Pandoc

By Dale Reagan | November 28, 2011

In the fall of 2011 the hosting for the Open Source Cobbler project (an automated system provisioning tool for bare metal or virtual machines) is transitioning to GitHub.Com from FedoraHosted.Org.    The (wiki) documentation for the project is available as HTML.  GitHub can support HTML but also provides support for Markdown (a ‘pure text’ means of providing content that lends itself to conversion to other formats.)

The task – move Wiki Pages from HTML to Markdown

There are ~120 pages in the original Wiki – about ~60 pages have been selected for transition to GitHub/Markdown.  Since Markdown is new to me I have to research a bit.  GitHub (& ‘git’) are also new to me so there is a bit of information for me to pull in before I can actually do anything.   After getting an overview of Markdown my initial thinking is:

While the above is relatively simple I decide to search for ‘Markdown’ tools – I find several that convert Markdown to HTML but none that are explicitly written for this task.  Then I locate Pandoc. 🙂

Markdown – simple TEXT to HTML

“Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML). Thus, “Markdown” is two things: (1) a plain text formatting syntax; and (2) a software tool, written in Perl, that converts the plain text formatting to HTML. “

“The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible. The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions. While Markdown’s syntax has been influenced by several existing text-to-HTML filters, the single biggest source of inspiration for Markdown’s syntax is the format of plain text email.

Pandoc – multi-markup format conversion tool

” If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Need to generate a man page from a markdown file? No problem. LaTeX to Docbook? Sure. HTML to MediaWiki? Yes, that too. Pandoc can read markdown and (subsets of) reStructuredText, textile, HTML, and LaTeX, and it can write plain text, markdown, reStructuredText, HTML, LaTeX, ConTeXt, PDF, RTF, DocBook XML, OpenDocument XML, ODT, GNU Texinfo, MediaWiki markup, textile, groff man pages, Emacs org-mode, EPUB ebooks, and S5 and Slidy HTML slide shows. PDF output (via LaTeX) is also supported with the included markdown2pdf wrapper script.”

Visit the Pandoc web site for details on features and helpful examples for using this super tool.  Before I begin any conversion I expect that their will be some content that will require post-conversion editing; if I took the time to learn more about Pandoc I might be able to reduce the need for any such editing but I will start simple & fast.  To convert a single page I use something like:

pandoc –ignore-args -r html -w markdown < Some_HTML_File.html > Some_HTML_File.markdown

The results look great so I create a Bash script that will:

After reviewing the files they can simply be copied into the new Wiki for the project (some manual edits may be needed.)

In this case my I create a Bash function to call Pandoc.  The function includes an ‘awk’ command that skips the first 25 lines and a ‘sed’ command that drops all lines from the ‘match’ to the end of the file – this removes the header/footer sections.  In this case the original HTML documents contained a consistent structure that made these choices easy; of course the awk/sed would need to be adjusted for other projects.  As shown below the script would only convert a single file – a simple loop could be added to convert all files in a given folder.   Note that the script includes an ‘auto-convert’ line as the first line in the new file – this line would/should be removed when the new file to copied into the new Wiki.  The ‘set -x’ and ‘set -‘ lines simply cause the script to display what is being run (you could remove those lines OR add ‘print’ messages if you want to see what’s going on.)  Note that the new file name is generated from the input filename with ‘p’ = ‘pandoc’ and ‘md’ = ‘markdown’.   As written below, if the script is run multiple times then the script will REPLACE existing OUTPUT files.

#!/bin/bash
##
function do_pandoc {
echo "Covering HTML to Markdown via Pandoc"
set -x
printf "### Wiki File: ${FILE} | $(date) | DReagan auto-convert via 'pandoc'\n\n" > ${P_MD}
pandoc --ignore-args -r html -w markdown < ${FILE} | \
        awk 'NR > 25' | sed '/### Download in other formats/,$d'  >> ${P_MD}
set -
}
###
FILE="$1"
if [[ ! -e ${FILE} ]] ; then printf "\n\tError - Need File name: '${FILE}' not present...\n\n" ; exit 1 ; fi
P_MD=${FILE}.p.md
do_pandoc

Some  tweaks to reduce/assist manual edits

Task Summary

  1. research as noted above
  2. download Wiki files to local Linux system
  3. install Pandoc
  4. create script to process files
  5. copy new Markdown files into the new wiki (the most time consuming portion of this effort since it was done manually.)

Results for this effort

Update – GitHub & “Fork”

If you are working on a similar project and you do not have ‘write access’ to a GitHub Wiki, then:

  1. ‘fork’ the original project
  2. edit the Wiki on your copy of the project (i.e. push your changes to your ‘fork’ since you now ‘own’ and have ‘write’ access to your copy)
  3. then request a ‘pull’ from the original project owner (by-passing the manual copy from step #5 above…)

Using the process above allows the original project owner an option to pull in your changes.  At this time anyone (any logged in user) can manually edit wiki pages in an Open Source GitHub project so the manual approach is, perhaps, simpler.  I think that the ‘fork’ approach is a bit more efficient…  [And yes, I would have taken the ‘fork’ approach if I had been a bit more familiar with GitHub…]

As always, your mileage should vary a bit.  🙂

 

Topics: Computer Technology, Problem Solving, Unix-Linux-Os, Web Problem Solving, Web Site Conversions, Web Technologies | Comments Off on Converting HTML to Markdown via Pandoc

Comments are closed.


________________________________________________
YOUR GeoIP Data | Ip: 73.21.121.1
Continent: NA | Country Code: US | Country Name: United States
Region: | State/Region Name: | City:
(US only) Area Code: 0 | Postal code/Zip:
Latitude: 38.000000 | Longitude: -97.000000
Note - if using a mobile device your physical location may NOT be accurate...
________________________________________________

Georgia-USA.Com - Web Hosting for Business
____________________________________