Home > API Shares, google, internet, Open-source, PHP, php function, Plugin, political, securenext, selvabalaji, seo, Social Network, web services, Wordpress > Display Contents of Different File Formats Word/Excel/Powerpoint/PDF/RTF as HTML

Display Contents of Different File Formats Word/Excel/Powerpoint/PDF/RTF as HTML

HTML 4.01 vs HTML 5 illustrated

HTML 4.01 vs HTML 5 illustrated (Photo credit: Glutnix)

This is a typical problem which I also raised on Stack Overflow (http://stackoverflow.com/questions/11061929/php-extract-text-from-different-file-formats-word-excel-powerpoint-pdf-rtf#comment14475398_11061929), but there seemed no single resource around the web to solve this particular problem, so since I have solved it I thought it would make sense to provide an approach and a solution, it can be refined better with time

Problem: We have a web application that allows different users to upload different files to share with others, the file types are limited. However just before downloading a file, the user needs to see a preview of the file contents, which is where it becomes tricky since each  file type is different.

Approach:

1. Identify the extension of each file

2. Display the text or HMTL from each file using the appropriate library for the file type, since there is no unified library

The base class is https://gist.github.com/2941076 which requires the path to the file and the extension (since the extension is already stored in the database I do not try to extract it from the file)

The libraries used for the different types of files are the key to the solution:

1. MS Word Documents – LiveDocx Service within Zend Framework (http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/) the steps are:

– Create a Mail Merge Document using the word document as a template

– Connect to the LiveDocx service (here you need SOAP and SSL enabled on your local LAMP installation)

– Save the generated document, and render it as HTML

2. MS Excel – PHPExcel from Codeplex so simple its scary (http://phpexcel.codeplex.com)

– Create a PHP Excel class to read the file

– Create an HTML writer to render the HTML

– Save the HTML to a file and read its contents

3. PDF text extract – http://pastebin.com/hRviHKp1 

– Create an instance of the PDF2Text class holding a reference to the PDF file

– Decode the PDF which extracts the text out of the file

d) Powerpoint – Work in Progress to be added later

More as I work in the power point extractions

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: