If you are looking for a tool to convert an HTML document to a PDF, then the iTextSharp converter is a great tool. iTextSharp provides open source libraries for various languages including Java and .NET. If you wish to commercially use iTextSharp, you will, however, need to pay for the license.
Keep in mind that I am basing my observations on the iTextSharp port to .NET. The Java version may behave differently.
I personally like the iTextSharp tool, and even given its quirks, will continue using it. Please note thatthis article is focussed on observations of the iTextSharp HTMLWorker object. Significant advancements have been made with the iTextSharp XMLWorker object which you can read more about in another article I have written.
Here are some of the main points to consider when planning/writing your HTML to PDF conversion function using the iTextSharp HTMLWorker Object
- There are two main ways of using the iTextSharp libraries to generate a PDF document:
- You can directly write to an iTextSharp document as you read from your database (this is nice because it bypasses the need to read/write from the server’s file system).
- You can read an HTML file from the server’s file system into a StringWriter() object
- The iTextSharp converter does not properly handle global page styles applied through a style sheet. Best case scenario the converter ignores the global style sheet, and worst case it renders it as text to the generated PDF document.
- To manually control the font size in a converted iTextSharp document you need to specify the font size that at the HTML element level using either the <font size=”1″ element or the in-line style=”font-size:10px;” attribute. This also applies to other page-level styles.
- The default font-size setting used by iTextSharp when none is specified is Helvetica 12 pt.
- iTextSharp will not correctly handle jagged HTML tables. Specifically, if your 10 column table has a row with only 1 column, then their conversion routine will start taking columns from rows after the 1 column row in order to create the 10 columns. Depending on your code, this can end up with surprising results if you are generating your table columns/rows dynamically from a database. Furthermore, iTextSharp does not support the HTML rowspan attribute.
- Including images with your new PDF can be tricky. The three most common methods are:
- To reference an image document by path and name on the local filesystem
- To use a System.Drawing.Image object
- To pass an image document as a URL
- Images will be imported at 72dpi and will most likely lose significant resolution. Use the .ScalePercent(24f) method to try to correct for this.
- It’s best to force your generated PDF document into landscape mode when working with HTML table data.
- Use the iTextSharp.text.Document method to set your PDF to an A4 page in landscape layout with 1pixel margins as follows:
Dim myDocument As iTextSharp.text.Document = New iTextSharp.text.Document(iTextSharp.text.PageSize.A4.Rotate(), 1, 1, -100, 0)
- iTextSharp does not support page breaks. This is a bit of a glaring omission in my opinion. Some third party add-ons exist that enable page breaks, but given my misgivings about using too many independently developed add-ons, I have gone the route of using <br> tags instead. Thus when rendering a Web page to PDF I pass in a parameter that toggles a series of line breaks that moves my report tables to separate pages. It may not be the best solution, but it works.
Some specific workarounds when using the HTMLWorker object
In this article I’ve been working with the iTextSharp HTMLWorker object. This object has been deprecated, but many older systems still use the HTMLWorker object. Converting from the HTMLWorker to the XMLWorker requires you to reformat your HTML from HTML 4.01 to XHTML, which can be a significant amount of effort. Thus it’s still important to know your way around the HTMLWorker object.
Here are some of the quirks/workarounds I’ve noticed with the HTMLWorker object
- If you are inserting a blank table cell, the code for a blank space does not work. In fact the cell won’t even be rendered, and if it is the only cell in a row, then the row won’t be rendered. Using a space character from the spacebar seems to do the trick though and forces the cell to render.
- If you are applying font-size styles to your text, the HTMLWorker object seems to really dislike non-numeric font sizes. For example:
- Using font-size: XX-small; will cause the HTMLWorker object to simply skip rendering the HTML object in which the style is being set.
- Specifically, I set the font size in a span tag that was in a table cell. Surprisingly this caused the entire table row not to render when the document was exported to PDF.
- However, I was able to get my table row to render by changing the font-size to a numeric value. When I changed the style of the span tag to font-size:8px; then the row rendered in the PDF and the text had the correct font size.
- If you want to apply a background color to a table cell, then the HTMLWorker object does not recognize the css markup: background-color:red; The HTMLWorker object instead accepts the table cell attribute bgcolor=”#FF0000″
- If you want to size a table cell or table column, you cannot use CSS. Instead use the <td width= syntax. Furthermore, you cannot size the table cell with an absolute number. Since the PDFConversion algorithm automatically sizes your HTML tables to 100% of the PDF document, you must also size your columns using percentages. So for example <td width=”5%” works like a charm.