We recently had a client create hundreds of Microsoft Word (.doc) files, and ask us to import their contents to their WordPress site. Importing .doc files via PHP isn’t the quickest task to setup, so we decided to batch-convert the files to .html so we could easily read their contents and clean up the code before inserting into the database.
Unfortunately, all of the batch-conversion programs we tested out had trouble with non-English characters, and ended up doing more harm than good.
With a bit of research, we found that our beautiful Macs had a command-line application called “textutil” that could take care of this in seconds.
- Open Terminal
- Navigate to the folder holding the original documents
- Enter the following command:
textutil -convert html *.doc
Open the folder in Finder, or run
ls and you’ll see that every .doc file now has a .html companion. The generated HTML is fairly clean, but includes some code you may want to clean up.
textutil has many options, and even some that can clean up the output. See the full manual in the Mac developer library.