Building the Download pages

From IHE Wiki
Revision as of 17:27, 28 September 2007 by Kboone (talk | contribs) (→‎WikiDownloader)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Over the last year, PCC has been experimenting with building the PCC Technical Framework using the IHE Wiki pages. This article describes the process that I use for doing that, and provides links to some of the tools that I've been using.

WikiDownloader

A tool that I use is a home grown Java program called WikiDownloader that supports downloads and uploads of the wiki content, and download of the wiki pages in HTML. That tool can be found here. Link purposely broken to WikiDownloader.zip because I haven't built the zip file yet

This utility takes a list of article names and downloads or uploads them to a wiki.

The tool uses the Java Wiki Bot Framework, found here http://jwbf.sourceforge.net/ . That framework requires several components from the Apache Commons found here http://commons.apache.org/, including:

Eventually, I hope to modify this tool to support tidying up of the XHTML that comes out of the wiki pages. For the most part, the wiki generates very clean XHTML, but sometimes if your editors don't do everything exactly right, the documents become invalid. Tidy (and JTidy) clean up the output and fix common errors. See http://jtidy.sourceforge.net/

Options

-u
Upload articles
-w URL
Set the wiki, username and password. The URL should be of the form http://username:password@wikihost
The default uses the value of the WIKI environment variable
-s "Summary"
Set the edit summary for the articles. Used to upload articles. Be sure to enclose the summary in quotes on the command line.
-r file
Read the list of articles to upload or download from a file.
-r -
Read the list of articles to upload or download from stdin.
-p URL
Set the address of the http proxy. (Not Working!)
-m
This is a minor edit. Used when uploading articles.
-f folder
Set the working folder for uploads or downloads.
-x
Skip any further argument processing and exit (for debugging)
-h
Download the HTML of the article.
-i
Force all images to be downloaded even if they exist.
article
Name of an article to upload.

Common Use

Downloading A Single Wiki Page in HTML Format

The following command line will download the article named article to the folder named html in HTML format:

java org.ihe.wiki.WikiDownloader -h -f html article

When downloading content in HTML format, the WikiDownloader utility will create a list of files that were listed as links in the downloaded content that it did not download. This is an aid to ensure that you have downloade all content needed to view a set of related pages.

Downloading Mulitple Wiki Pages in HTML Format

The following command line will download all articles listed in file to the folder named html in HTML format:

java org.ihe.wiki.WikiDownloader -h -f html -r file

Downloading A Single Wiki Page in WikiText Format

The following command line will download the article named article to the folder named wiki in WikiText format:

java org.ihe.wiki.WikiDownloader -f wiki article

Downloading Mulitple Wiki Pages in WikiText Format

The following command line will download all articles listed in file to the folder named wiki in WikiText format:

java org.ihe.wiki.WikiDownloader -f wiki -r file

Uploading A Single Wiki Page in WikiText Format

The following command line will upload the article named article from the folder named wiki to the wiki, as a major edit, with the edit summary "Restructure article".

java org.ihe.wiki.WikiDownloader -s "Restructure article" -f wiki article

NOTE: When uploading, it is important to give the -s argument before the article name, or the article will be downloaded OVER your old one.

Uploading Mulitple Wiki Pages in WikiText Format

The following command line will upload all articles listed in file from the folder named wiki to the wiki, as a minor edit, with the edit summary "Fix article names":

java org.ihe.wiki.WikiDownloader -s "Fix article names"  -f wiki -r file

NOTE: When uploading, it is important to give the -s argument before the -4 argument, or the articles will be downloaded OVER your old ones.

Some HTML Infrastructure

The WikiDownloader does a little bit of restructuring of the HTML content when it is written out, updating links so that they become local, fixing up images, and removing some wiki related content so that the pages can be used locally. As part of that restructuring, it assumes that you have a sub-folder named images that is where the images will be placed, and another sub-folder named skins that is where the CSS stylesheets used to display the content will be placed.

These content used for the PCC downloads for skins is included with the WikiDownloader.zip file.

Producing the PDF

Downloading the HTML

This process takes about half a day, mostly in step 4 and 6. Downloading the complete HTML content of the wiki takes about 30-45 minutes.

  1. Run the WikiDownloader to Get all the Content
  2. Make a list of pages that didn't download fully due to errors in HTML (these are reported as exceptions)
  3. Clean up the wiki pages so they generate clean HTML
  4. Rinse, Lather and Repeat at step 2, downloading any pages until they come through clean.
  5. View the list of linked pages that were not downloaded
  6. Rinse, Lather and Repeat at Step 2, downloading any missed content until you have all the content.

Generating a Single file for a Profile

This takes about 5 minutes per Profile, and hasn't been fully tested.

Each PCC Profile is stored in several HTML files. This makes for easily browsed content, but many vendors have requested that profiles be stored in a single file. I've created an XSL stylesheet that does this, including in the WikiDownloader zip, named SingleDocument.xsl. This stylesheet imports links inserted into the WikiPage using the TOCLink wiki template directly into the file. To use this stylesheet, enter the following command line:

msxsl PCC_TF-1\profile.htm SingleDocument.xsl >PCC_TF-2\profile-single.htm

HTML to Word

This takes about a half day for Volume I, and about a day and half for Volume II, mostly due to the need to repeat this over and over and over from step 3. The new stylesheet (see above) will vastly improve this.

After downloading all the content, I go through a somewhat tedious process of putting together the wiki content, that just recently got easier. It goes through the following steps:

  1. Download all the HTML.
  2. Open a Word Document using the IHE Profile Template. Delete all content.
  3. Open up the first HTML page to copy in Internet Explorer (this would probably work in other Browsers, but I haven't tested it yet).
  4. Select the content in IE.
  5. Copy it to the clipboard.
  6. Paste it into Word
  7. Rinse, Lather and Repeat from step 3, pasting content into the Word document at appropriate locations until all content is imported (NOTE: This step can now be omitted because of the previous stylesheet).


Word Cleanup

This takes about half a day and most time is spent in Volume II

  1. Run several Word Macros over an over to clean up tables, figures, and examples, putting them in the appropriate styles, cleaning up hyperlinks, et cetera. This step will eventually be eliminated once I develop a stylesheet to convert the single HTML file into something that Word can import cleanly (e.g., the new Word XML format, which is not great, but which would work for this use).
  2. Add a Table of Contents

PDF Generation

This is perhaps the least tedious but longest running in computer time. This could take an hour or more for volume II. Don't expect to do much with your computer.

  1. Press the Convert to Adobe PDF Button
  2. Review the output
  3. Fix errors
  4. Repeat at step 1 till no more errors.