Ripping Access Databases in Clojure

tom, 2011-10-07

This the first of a series of posts documenting my application of clojure to some “practical” problems at work. I primarily wrote this to force me to get my process down, and maybe to save someone else some time if they come across the same situation. I have not posted code samples (but I intend to in a near-future update).

Problem: there are k-hundred Microsoft Access databases sitting inside of zip archives spread across a several sub-directories (don’t dwell on the reasons for the prolific use of Access). The zip archives may also contain sub-directories that need to be examined. The client wants to retrieve all tables from the access databases (database versions may range from [1997, 2003, 2007]) and dump them into tab-delimited files with the first row being a header (derived from the table fields).

My solution: Use clojure and its java interop to build a stand-alone application to rip the tables. As with other examples on the web and in the Fogus and Halloway books, I aimed to “wrap the crap” by building a nice functional-programming API (nice for me at least) around the java libs. In the process, I ended up learning a lot about Clojure’s interop facilities and how to live with java (I had little to no background with java prior to Clojure). I believe firmly in searching thoroughly through the existing clojure libraries for implementations prior to rolling your own. file-seq is a good example :) Another cool thing is to check out the source to the contrib libs. Often times, they show a clear and practical path for interfacing with the necessary java libraries to accomplish a specific class of tasks. Also, the show function from clojure.contrib.repl-utils is invaluable for interactively learning about classes from the REPL (who needs intellisense? :) ).

General approach: Exploit the utilities in clojure.contrib.java-utils and clojure.contrib.duck-streams to facilitate easy file I/O inside of clojure. Unfortunately, I ended up needing to extract some properties of java’s File from java.io, as well as some boiler-plate java functions from java.util.zip, specifically the ZipFile class. This meant I couldn’t use the contrib libraries exclusively (but they helped immensely). Fortunately, the need to wrap niche functionality pushed me to learn a whole lot about java and how Clojure handles java.

Generally, I approached the problem as a reduction (or fold) over data (a sequence of files) which performs necessary side-effecting IO while accumulating a log-file as a result. I used the excellent file-seq function (which taught me about the awesome tree-seq function) as the basis for my file-fetching. One useful utility function (get-files) is a variation of file-seq with a regex-based filter applied to the resulting file-sequence (I filter based on file-type – #(re-find ".zip|.mdb.accdb" %) ).

When zip files are encountered during the reduction, we extract them with a decompression function to pull all of their entries (ported from java examples to clojure) into a folder with the same root-name as the zipfile (side-effecting IO in the process). The decompression function returns a list of all file-paths extracted so that we can lazily concatenate new files into the sequence of files returned by the original file sequence from get-files.

The only other file-types handled during the reduction are valid Access files, which are ultimately converted into 1 or more tab-delimited files. Surprisingly, database interrogation (at least for Access) ended up being a significant source of frustration and research pain. Most of this was due to the lack of feedback in why a particular connection string would fail. Ideally, when the app encounters a valid database, it uses clojure.contrib.sql to establish a connection, determine the names of the tables, and dump all tables into tab-delimited files. I figured (from my limited knowledge) that we could use some inherent database driver (JDBC) to connect to Access and pump SQL queries. On the surface (and in Holloway’s book), this is actually really easy to do. The SQL library in contrib facilitates the process pretty well….once you have a valid connection string. It took a bit of research to figure out how to build a connection string for a local .accdb or .mdb database. In the process, I explored the innards of the clojure.contrib.sql and clojure.contrib.sql.internal source to see how the lib was interacting with java.sql. Most of the samples online are for Java (not many clojure developers appear to need to interrogate Access), but I was able to triangulate a solution from them. I eventually found the MSDN site for arcane properties that the particular ODBC driver for Access expected ….like the fact that "DBQ=C:\insert\path\yourdb.accdb" is the path property, which goes after the sub-protocol portion of the JDBC connection url.

I ended up feeding clojure.contrib.sql a map specifying the unique Access-specific properties,

{:classname "sun.jdbc.odbc.JdbcOdbcDriver"
 :subprotocol "odbc"
 :subname (str "Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ="
	       dbpath)}

With the connection working, executing queries became trivial. However, I almost hit ANOTHER wall - due to Access - in trying to interrogate the database to get a list of table names (so that I could query them and dump the results). Access hides its table names inside of a hidden system table called MSysObjects. My initial solution was to establish a connection to the database, then execute a query against MsysObjects to pull out the table names. The ODBC driver insisted that I did not have read access for the table, and refused to provide a result. This is one are where the online community just did not have the answers. The recommended “fix” was either to use ADO or ADOX (both windows-specific tech), or to open the database and set some options inside the database to make the system table visible. Given the original problem statement, this was untenable. At literally the last minute, I was rummaging over the code samples for clojure.contrib.sql, and saw the sample for “getting tables from a database”. It turns out that the connection object returned via java interop already interrogates the database and retrieves its schema during connection. I had no idea (because of my ignorance), but the clojure.contrib.sql example was easily adapted to provide a list of table-names from the map the library sample returned.

I combined all of this stuff into a main function, with the intent to provide command-line args (and a JFileChooser interface if no args/invalid args are provided). Main snakes files from the source directory, unpacks zip archives and injects a filtered sequence of valid unzipped-files into the remaining file-set, extracts tables from databases into unique folders, and maintains a map of all input file-paths to output file paths (which is finally dumped into a summary log).

I am still working on deployment (as a standalone executable JAR), but so far Clojure is working nicely to solve this problem.