Monday, October 17, 2011

Token Replacement inside DocX

Before I get into this.  YES you can do this same thing with some standard microsoft tools like excel and access.  There is even a PHP class already built for this.


So have you ever needed to work with "less than computer savvy" clients?  And maybe they need to have document templates for whatever reason.  Well normally when you generate a word document or PDF you have to get a PHP class to generate that, and code in the document. That way you can dynamically generate the document on the fly when you need to.

Well that would take you a long time to type up say.. 20 documents.  Wouldn't it be cool if you (or the client) could just create their own templates in word, and your system would understand that? AND be able to generate a document based on that template?

Well I wouldn't be writing this blog if the answer was "sorry we can't do that".

Opening the Docx

So whats the technique?  It turns out that docX files are just zip files.  You literally can rename docx to zip and unzip it.

You get the following files when you unzip

Well inside the word folder is a file called document.xml. That file is the actual content of your word doc. Which means you can replace the text right there.

Re-Creating the DocX
So now that you have edited these files, you need to make them into a docx again. Just zip it back up, and rename it docx. Done and done.

PHP can do this for us
So all of the things we just did, php can do for us. So that means we can build this into our systems.

Step 1) Create a temp folder to store the extracted contents of the docx file

All this code does is create the temp directory if it does not exist. If it already exists, clean it out. Refer to the youtube video for the full code, and to see what recursive_remove_directory does (if you haven't figured it out)

Step 2) make sure the directory path is safe for shell use
So were going to be executing linux commands using shell_exec, and you can't have spaces and other random things in the path. So we just need to escape that first.
Step 3) Unzip to the temp folder
This is just a shell command
Step 4) Replace tokens.
So now we get the contents of the document.xml file, use str_replace to replace our tokens, then save it back.
Step 5) Re-zip ONLY the files we need.
we are going to specify the files here, because mac os x likes to create random folders when it unzips from the command line...
And thats it!
Now we have programmatically token replaced a docx file.

Here is the full tutorial