You are here: Home > Articles > Article Display

Exporting Word files to HTML

In this article we will first discuss the case for and against using Word as your HTML editor. Then we will see how to properly save a Word file to smaller, more compact HTML files. Third and last, we will see how to do this through code, and possibly create a batch process for converting numerous Word files to HTML at once.

Published: Mar 5, 2003
Tested with: VBScript 5.5, MS Word 2002
Category: ASP
26,931 views

Introduction

In this article we will first discuss the case for and against using Word as your HTML editor. Then we will see how to properly save a Word file to smaller, more compact HTML files. Third and last, we will see how to do this through code, and possibly create a batch process for converting numerous Word files to HTML at once.

The case for and against Word as an HTML editor

Microsoft has given us the ability to save a Word file as HTML for many of the latest editions of Office. It's a very easy process, and many use this way of creating HTML pages because:

  • They are already familiar with Word and its formatting features.
  • Word comes installed on their computer, and they do not want to purchase additional HTML authoring software.
  • They have numerous files in Word format that they want on a website in HTML. Simply exporting them to HTML is the fastest way.

Unfortunately, there is a downside to this method: Word does a terrible job of creating compact, cross-browser HTML source code. If this is important to you, then you should probably stay away from using Word as your HTML editor in the first place. However, having said this, it is still possible to clean up the generated code quite a bit, first through Word itself and second through other tools or custom Regular Expressions.

Saving as HTML from Word

Start by opening an existing Word file on your system, or by creating a new one and typing in some text and pictures. Then click on File > Save as Web Page...

Save as Web Page...

Doing so, Word will display the Save As dialog box.

Save As

We can see that Word took the filename of the DOC file (for any new files it creates a filename based on the title of the document) and is prompting us to save it with the extension .htm. This is clearly shown by the select box labeled Save as type which has Web Page (*.htm; *.html) already selected. We can now perform the normal save operations, like choosing the name and location of the HTML file. However, Word has a save option called Filtered HTML which greatly reduces the HTML code produced.

Save as Filtered HTML

It's important to understand the difference between the two options. When Word saves a file as HTML, it still wants to be able to open it back in Word and maintain the same formatting as when you created it. The way it does this, is by leaving a lot of Word propriatory code inside the generated HTML file. If however, we simply want to export our contents to the smallest HTML file possible, without needing to re-open them back in Word, we can choose the Filtered HTML option. This produces smaller files, less HTML code and, even more important, a better cross-browser compatible source code. When you select this option and click on Save, you will get a popup which will alert to this fact.

Save alert

Click on Yes to finish the process. Something else worth noting happens here on save. Suppose you have some images embedded inside your Word file. These images could be GIFs, JPGs, BMPs, PNGs, etc. When you insert an image in Word, the image file is actually embedded inside the file and is saved along with it. When we save the file as HTML, Word exports all these images to a folder that it creates in the same location as the exported HTML file, and then generates links to them inside the HTML code. The exported images are handled like so:

  • They are reduced/increased in size depending if they were decreased/increased in width and length inside Word.
  • They are converted to GIFs and JPGs.
  • Their names stay the same.
  • The name of the folder that they are stored under is the name of the HTML file that is created, plus the extension "_files". For example, if the filename is "My company.htm", then the images will be under the folder "My company_files".
  • The link inside the HTML file to the images is relative. For example, <img src="My company_files/house.gif">.

Exporting to HTML through code

Let us assume that we have a bunch of Word files sitting inside a directory, and they all need to be converted to HTML files. We can open each one and follow the procedure above, but that can take a long time, depending on how many of them you have. We can instead, use a little WSH scripting to do this for us. The idea is the same: create an instance of the Word application, loop through the folder, open each DOC file that we find, export it as Filtered HTML, close the file, move on to the next, and finally close the Word application object. Let's first look at the code needed to do this with WSH VBScript, and then we will break it down.

1 Option Explicit
2
3 'declare all variables
4 Dim objWord
5 Dim oDoc
6 Dim objFso
7 Dim colFiles
8 Dim curFile
9 Dim curFileName
10 Dim folderToScanExists
11 Dim folderToSaveExists
12 Dim objFolderToScan
13
14 'set some of the variables
15 folderToScanExists = False
16 folderToSaveExists = False
17 Const wdSaveFormat = 10 'for Filtered HTML output
18
19 '********************************************
20 'change the following to fit your system
21 Const folderToScan = "C:\Word\documentation\"
22 Const folderToSave = "C:\Inetpub\wwwroot\word\"
23 '********************************************
24
25 'Use FSO to see if the folders to read from
26 'and write to both exist.
27 'If they do, then set both flags to TRUE,
28 'and proceed with the function
29 Set objFso = CreateObject("Scripting.FileSystemObject")
30 If objFso.FolderExists(folderToScan) Then
31     folderToScanExists = True
32 Else
33     MsgBox "Folder to scan from does not exist!", 48, "File System Error"
34 End If
35 If objFso.FolderExists(folderToSave) Then
36     folderToSaveExists = True
37 Else
38     MsgBox "Folder to copy to does not exist!", 48, "File System Error"
39 End If
40
41 If (folderToScanExists And folderToSaveExists) Then
42     'get your folder to scan
43     Set objFolderToScan = objFso.GetFolder(folderToScan)
44     'put al the files under it in a collection
45     Set colFiles = objFolderToScan.Files
46     'create an instance of Word
47     Set objWord = CreateObject("Word.Application")
48     If objWord Is Nothing Then
49         MsgBox "Couldn't start Word.", 48, "Application Start Error"
50     Else
51         'for each file
52         For Each curFile in colFiles
53             'only if the file is of type DOC
54             If (objFso.GetExtensionName(curFile) = "doc") Then
55                 'get the filename without extension
56                 curFileName = curFile.Name
57                 curFileName = Mid(curFileName, 1, InStrRev(curFileName, ".") - 1)
58                 'open the file inside Word
59                 objWord.Documents.Open objFso.GetAbsolutePathName(curFile)
60                 'do all this in the background
61                 objWord.Visible = False
62                 'create a new document and save it as Filtered HTML
63                 Set oDoc = objWord.ActiveDocument
64                 oDoc.SaveAs folderToSave & curFileName & ".htm", wdSaveFormat
65                 oDoc.Close
66                 Set oDoc = Nothing
67             End If
68         Next
69     End If
70     'close Word
71     objWord.Quit
72     'set all objects and collections to nothing
73     Set objWord = Nothing
74     Set colFiles = Nothing
75     Set objFolderToScan = Nothing
76 End If
77
78 Set objFso = Nothing

Save the following code as a vbs file (for example, createdoc.vbs) somewhere on your system. Before you use it, you must change the 2 constants folderToScan and folderToSave. These folders reflect which folder to look in for any Word files and which folder to save to. Once you edit these 2, double click on the vbs file to run it.

The code scans through the folder defined in folderToScan. After a simple check to see if the folder exists, it creates an instance of the File System Object, maps to this folder and puts all the files under it in a collection. It then creates an instance of the Word application, and loops through the files in the collection. For each Word file that it finds, it opens and saves it as Filtered HTML. If you now look inside the output folder, folderToSave, you will see the newly created HTML files with their corresponding directories of images.

The constant wdSaveFormat is a unique number that specifies an external file converter. Setting it to 10 creates Filtered HTML files. For regular HTML output use the number 8. This will produce bigger HTML files but will maintain the Word formatting.

Conclusion

We have seen how Word exports files to HTML and how to use this to our advantage. With this method, we can easily convert our Word files to HTML and simply post them to our website. Or, we can use the same code to extract all the images from a bunch of Word files. We can take this further by possibly outputting to other formats as well, or by creating a new document from scratch and then saving it, instead of opening an existing one.

 



Other articles in this category
  1. GetRows VBScript Class - Part III: Paging the results
    January 16, 2003
    In Part I of this series, we saw how to create a VBScript class to query our database using the very fast GetRows() method, and return a recordset as a local array. In Part II, we extended the class to allow ADDing and UPDATEing a row in the database. In this Part III, we will expand the class further to allow pagination of the returned recordset.
  2. Dynamic Tree Menu of your site
    May 31, 2002
    We'll see how to create a menu system that is cross-browser and includes all your site's folders/files. It uses ASP, XML and DHTML and by simply copying it to your site you have an instant Windows Explorer-like navigation of the contents.
  3. Generating an XML file of your website's folders/files
    May 24, 2002
    Using the File System Object (FSO) we can traverse through our website's contents and write them out in a nicely nested form in an XML file. We can then use that file for example, in a content management system or a TreeView control.
  4. Downloading any file using ASP, FSO and the ADODB Stream object
    May 8, 2002
    In this article, we will see how to allow a user to download any file from our web server. They will see a prompt, giving them the option of opening or saving it, rather than simply opening it which is the default. We can achieve this using the FSO and ADODB objects.
  5. Calling MS Access Parameterized Queries from ASP
    April 30, 2002
    Instead of passing a SQL query through your ASP code against Microsoft Access as you would normally do, you can use the Queries design interface to create them in Access and then call them from your ASP code. It makes things easier to edit and maintain, and the results are returned faster.