A while ago, I ran into a word processing requirement of converting documents to web pages and processing the document text to extract information. While working on this, I ended up wrapped a few APIs e.g. JODConverter into a small set of utility classes. I am not sure of the value of to a developer however, documenting this might just help someone reach the open source APIs I used and use this as sample code. So, here I go with some of the use cases –

I was looking for a way to convert MS Word documents to HTML web pages and Rich Text Format documents and used JODConverter to accomplish this task. It utilizes an Open Office headless service to carry out the task. TextExtractor exposes two methods in this regards – one each to extract text and html markup from documents.

public String extractTextFromFile(String inputfile);

public String extractHtmlTextFromFile(String inputfile);

The extraction process relies on document conversion operation provided by JODConverter . It requires a locally running Open Office process which can be run headless and the configuration of such process can be provided in config.properties in the classpath.

Another use case I ran into required me to persist an email body (html content) while processing incoming emails and JODConverter came to rescue again with its simple document conversion operation which is wrapped in another TextExtractor method.

At the time of writing this post, the code available at Github repository is being finished off.