Thursday, March 5, 2009

Convert Word .doc files to .html using ASP.NET

Programatically converting Word DOCs to HTML

This article describes how to use ASP.NET 2 to convert documents in Word .doc format into .html documents. This is done using the built-in features of MS Word, via the COM object.

The reason for doing this was as follows: I wanted to allow users to upload files to my Intranet, through their browser, and make them available for other people to look at. But if they uploaded Word documents, only people with Word installed would be able to view them, causing problems for Mac & Linux users. So, I wanted to get my server to convert the .doc file into a .html file automatically, at the point when the file is uploaded. There was no way that I was going to reverse-engineer a Word doc and figure out how to convert it into html, so instead I used the built-in facility inside MS Word that does this for you. If you give it a Word doc, it will save a .html file, and a separate folder with all the necessary images in it, all linked properly to the html file. Yes, I admit it is an html file full of weird codes, but it does work, in fact very nicely.

How to do it

The first step is that you must have MS Word installed on the server where this ASP.NET page is going to be running. You then add a reference to your ASP.NET project, telling Visual Studio where to find the vital Word library. To do this:

  1. In Solution Explorer, right-click on your project root and select "Add Reference".
  2. Go to the COM tab and find Microsoft Word 11 Object Library.
  3. Click on it and then click OK.

Once you have done this, you will be able to use the "Word" namespace in your project.

To test it, make a sample webpage, perhaps called test.aspx, and put a FileUpload, a Button and a Label on it. The FileUpload component is used to upload the file; the Button is clicked to make the process start, and the Label is used to display a success message.

The complete code for the upload routine is here:

protected void Button1_Click(object sender, EventArgs e)
{
if (FileUpload1.HasFile)
{
// When we click Button1, the file we specify is uploaded to a temporary
// folder, then converted into an html document...
string folder_to_save_in = @"c:\temp\documents\";
string filePath = folder_to_save_in + FileUpload1.FileName;
// This bit does the actual file upload:
FileUpload1.SaveAs(filePath);

// Here we set up a WOrd Application...
Word.ApplicationClass wordApplication = new Word.ApplicationClass();

// Opening a Word doc requires many parameters, but we leave most of them blank...
object o_nullobject = System.Reflection.Missing.Value;
object o_filePath = filePath;
Word.Document doc = wordApplication.Documents.Open(ref o_filePath,
ref o_nullobject, ref o_nullobject, ref o_nullobject, ref o_nullobject, ref o_nullobject,
ref o_nullobject, ref o_nullobject, ref o_nullobject, ref o_nullobject, ref o_nullobject,
ref o_nullobject, ref o_nullobject, ref o_nullobject, ref o_nullobject, ref o_nullobject);

// Here we save it in html format...
// This assumes it was called "something.doc"
string newfilename = folder_to_save_in + FileUpload1.FileName.Replace(".doc", ".html");
object o_newfilename = newfilename;
object o_format = Word.WdSaveFormat.wdFormatHTML;
object o_encoding = Microsoft.Office.Core.MsoEncoding.msoEncodingUTF8;
object o_endings = Word.WdLineEndingType.wdCRLF;
// Once again, we leave many of the parameters blank.
// See http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vbawd11/html/womthSaveAs1_HV05213080.asp
// for full list of parameters.
wordApplication.ActiveDocument.SaveAs(ref o_newfilename, ref o_format, ref o_nullobject,
ref o_nullobject, ref o_nullobject, ref o_nullobject, ref o_nullobject, ref o_nullobject, ref o_nullobject,
ref o_nullobject, ref o_nullobject, ref o_encoding, ref o_nullobject,
ref o_nullobject, ref o_endings, ref o_nullobject);

// Report success...
Label1.Text = "Uploaded successfully!";
// Finally, close original...
doc.Close(ref o_nullobject, ref o_nullobject, ref o_nullobject);
}
}

And that is it really. When you browse to a file and click the upload button, the file is uploaded to your server and stored in the temp folder. Then, this doc file is opened, and a SaveAs performed. This saves the new .html file in the same temp folder, with the associated image files in a subfolder with the same name as the .html file, but with _files appended to its name.