Wednesday, March 20, 2013

Dealing with Excel files from Common Lisp - Using ABCL and Apache POI

In my day job, I mostly program in Common Lisp. Most of what I do is file and database work which Common Lisp is pretty well-suited for. Sometimes, though, I have to deal with data that comes in Excel files, and in the past that meant loading the files into Excel, exporting it into some plain text format and then working with those plain text files from Common Lisp.

While this works, it is a manual and error prone process. Also, Excels plain text export mechanism often mangles the data in undesired ways, which requires yet more additional manual steps (or VB scripting), which I'd rather like to avoid for processes that need to be automatic. Thus, I was on the looks for a way to access the Excel files directly with a Common Lisp program.

Writing a new Excel file parser was out of the question - I have real customer needs to fulfill, and implementing a capable Excel file reader is a large infrastructure project. So I looked into using an existing Excel file reading library instead. There are numerous options, commercial and non-commercial, and I've looked into one of the libraries written in C, but the requirement to more or less manually create FFI stubs and foreign structure layouts for a library that in itself was not documented very well and also did not look like being very accessible made me look for other options.

Apache POI

One of the more prominent open source libraries for accessing Microsoft Office files is Apache POI. It has been around for over 10 years and supports most MS Office formats, including the old OLE2 Excel format as well as the newer OpenXML format (that, despite using XML as the base format, is a horribly complex mess that I hope to never have to deal with directly). Apache POI is a Java library, so it can't directly be used from SBCL, which is the Common Lisp implementation that I normally use.

Armed Bear Common Lisp (ABCL)

In the recent months, I have noticed that there was quite some activity around Armed Bear Common Lisp (ABCL). I had tried an earlier release of it, and while it somewhat worked, it seemed to have a fair number of restrictions that made it unsuitable for me at the time. In particular, ABCL lacked support for the Metaobject Protocol which is something that I often use, either directly or as a library dependency. Also, the older version that I tried could not load the Postmodern library that we use to access our Postgres database, which was the final show stopper. But all that was before the recent 1.1.1 release of ABCL.

ABCL is hosted on the Java Virtual Machine (JVM), and maybe the biggest advantage of that is that access to other JVM-hosted code is straightforward and easy from Common Lisp programs running in ABCL. Thus, using Apache POI should be a snap. Also, as ABCL is becoming a reasonably complete implementation of the Common Lisp standard now, I had hopes to be able to use some of my existing infrastructure code in the program that dealt with Excel files.

To make it short: ABCL works great now. It took me very little time to translate the calls that I found in some Apache POI example program to Common Lisp, and I could also use all of the Common Lisp libraries that I needed for the task. There are some important libraries that don't work on ABCL yet (i.e. CXML-STP, CL+SSL), but I don't need these right now. And ABCL, during development, behaves like any other reasonable Common Lisp implementation in that it supports SLIME.

The Excel reading was a snap and the read process is reasonably fast, but ABCL's startup times are a bit annoying. There currently is no way to do the equivalent of "saving the world" on ABCL, so one has to load all required software at startup time. We're using ASDF for that, and it seems that some of the slow startup times need to be attributed to it. The Excel file reader will run as a batch job, so the startup times don't matter for our production uses, but testing the scripts was a tad tardy.

To illustrate how easy accessing Excel files from Common Lisp is, here is some example code that dumps the first worksheet of an Excel file to the standard output in a Tab separated values format:

;; -*- Lisp -*-

(defpackage :export-tsv
  (:use :cl))

(in-package :export-tsv)

(defun init-classpath (&optional (poi-directory "~/poi-3.9/"))
  (let ((*default-pathname-defaults* poi-directory))
    (dolist (jar-pathname (or (directory "**/*.jar")
                              (error "no jars found in ~S - expected Apache POI binary ~
                                        installation there"
                                     (merge-pathnames poi-directory))))
      (java:add-to-classpath (namestring jar-pathname)))))

(defun process-file (pathname)
  (let* ((file-input-stream (java:jnew "java.io.FileInputStream"
                                       (namestring pathname)))
         (workbook (java:jstatic "create"
                                 "org.apache.poi.ss.usermodel.WorkbookFactory"
                                 file-input-stream))
         (sheet (java:jcall "getSheetAt" workbook 0))
         (formatter (java:jnew "org.apache.poi.ss.usermodel.DataFormatter" java:+true+))
         (total-row-count (java:jcall "getLastRowNum" sheet)))
    (dotimes (row-number total-row-count)
      (let* ((row (java:jcall "getRow" sheet row-number))
             (column-count (java:jcall "getLastCellNum" row)))
        (dotimes (column-number column-count)
          (unless (zerop column-number)
            (write-char #\Tab))
          (write-string (java:jcall "formatCellValue"
                                    formatter
                                    (java:jcall "getCell" row column-number))))
        (terpri)))
    (java:jcall "close" file-input-stream)))
Before the process-file function can be used, init-classpath must be called to add the Apache POI jars to the Java class path.

As you can see, the program is rather short, and even if all comments are stripped, the Java version contains a lot more ceremony. No surprise here, maybe you wanted to have your prejudice confirmed :).

ABCL will now have a firm place in my toolkit. Big shouts go to the maintainers who did a great job lifting ABCL up to a level where it will be very useful to me.

Share:

8 comments:

  1. It might seem surprising, but "**/*.jar" matches everything that "*.jar" does, i.e. the slash is only syntactic, not significant.

    ReplyDelete
  2. It was surprising to me to see that what you write did not work for me. That is why I search the two wildcarded paths. Might be an ABCL bug, but ABCL play time is over for now.

    ReplyDelete
  3. ... but I cannot reproduce the problem again now, so I've corrected the sample code. Thanks for pointing it out!

    ReplyDelete
  4. This code misses last row from the excel file. Seems like it's necessary to do
    (1+ (java:jcall "getLastRowNum" sheet)

    P.S.

    I am reusing your code, thanks for the example

    ReplyDelete
  5. The code is loosing the last row from the excel file. It seems necessary to use
    (1+ (java:jcall "getLastRowNum" sheet))

    P.S. I am reusing your code, thanks for the example.

    ReplyDelete
  6. Anton, thank you for the bug report - I have not corrected the original source in the post as I cannot verify it right now, but hopefully the comment will help people who reuse the code in the future. Please don't copy and paste blindly :)

    ReplyDelete
  7. It may be because the row numbering is zero-based in the lib.
    Thanks for the example.

    ReplyDelete
  8. Thank you very much for the starter example.

    I modified the code slightly to allow for empty cells.

    The key part is:

    (java:jcall "getCell" row column-number (java:jfield (java:jclass-of row) "RETURN_BLANK_AS_NULL"))

    ReplyDelete