This the first of a series of posts documenting my application of clojure to some “practical” problems at work. I primarily wrote this to force me to get my process down, and maybe to save someone else some time if they come across the same situation. I have not posted code samples (but I intend to in a near-future update).
Problem: there are k-hundred Microsoft Access databases sitting inside of zip archives spread across a several sub-directories (don’t dwell on the reasons for the prolific use of Access). The zip archives may also contain sub-directories that need to be examined. The client wants to retrieve all tables from the access databases (database versions may range from [1997, 2003, 2007]) and dump them into tab-delimited files with the first row being a header (derived from the table fields).
My solution: Use clojure and its java interop to build a stand-alone
application to rip the tables. As with other examples on the web and in
the Fogus and Halloway books, I aimed to “wrap the crap” by building a
nice functional-programming API (nice for me at least) around the java
libs. In the process, I ended up learning a lot about Clojure’s interop
facilities and how to live with java (I had little to no background with
java prior to Clojure). I believe firmly in searching thoroughly through
the existing clojure libraries for implementations prior to rolling your
own. file-seq
is a good example :) Another cool thing is to check out
the source to the contrib libs. Often times, they show a clear and
practical path for interfacing with the necessary java libraries to
accomplish a specific class of tasks. Also, the show
function from
clojure.contrib.repl-utils
is invaluable for interactively learning
about classes from the REPL (who needs intellisense? :) ).
General approach: Exploit the utilities in clojure.contrib.java-utils
and clojure.contrib.duck-streams
to facilitate easy file I/O inside of
clojure. Unfortunately, I ended up needing to extract some properties of
java’s File
from java.io
, as well as some boiler-plate java
functions from java.util.zip
, specifically the ZipFile
class. This
meant I couldn’t use the contrib libraries exclusively (but they helped
immensely). Fortunately, the need to wrap niche functionality pushed me
to learn a whole lot about java and how Clojure handles java.
Generally, I approached the problem as a reduction (or fold) over data
(a sequence of files) which performs necessary side-effecting IO while
accumulating a log-file as a result. I used the excellent file-seq
function (which taught me about the awesome tree-seq
function) as the
basis for my file-fetching. One useful utility function (get-files
) is
a variation of file-seq
with a regex-based filter applied to the
resulting file-sequence (I filter based on file-type –
#(re-find ".zip|.mdb.accdb" %)
).
When zip files are encountered during the reduction, we extract them with a decompression function to pull all of their entries (ported from java examples to clojure) into a folder with the same root-name as the zipfile (side-effecting IO in the process). The decompression function returns a list of all file-paths extracted so that we can lazily concatenate new files into the sequence of files returned by the original file sequence from get-files.
The only other file-types handled during the reduction are valid Access
files, which are ultimately converted into 1 or more tab-delimited
files. Surprisingly, database interrogation (at least for Access) ended
up being a significant source of frustration and research pain. Most of
this was due to the lack of feedback in why a particular connection
string would fail. Ideally, when the app encounters a valid database, it
uses clojure.contrib.sql
to establish a connection, determine the
names of the tables, and dump all tables into tab-delimited files. I
figured (from my limited knowledge) that we could use some inherent
database driver (JDBC) to connect to Access and pump SQL queries. On the
surface (and in Holloway’s book), this is actually really easy to do.
The SQL library in contrib facilitates the process pretty well….once
you have a valid connection string. It took a bit of research to figure
out how to build a connection string for a local .accdb
or .mdb
database. In the process, I explored the innards of the
clojure.contrib.sql
and clojure.contrib.sql.internal
source to see
how the lib was interacting with java.sql
. Most of the samples online
are for Java (not many clojure developers appear to need to interrogate
Access), but I was able to triangulate a solution from them. I
eventually found the MSDN site for arcane properties that the particular
ODBC driver for Access expected ….like the fact that
"DBQ=C:\insert\path\yourdb.accdb"
is the path property, which goes
after the sub-protocol portion of the JDBC connection url.
I ended up feeding clojure.contrib.sql
a map specifying the unique
Access-specific properties,
{:classname "sun.jdbc.odbc.JdbcOdbcDriver"
:subprotocol "odbc"
:subname (str "Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ="
dbpath)}
With the connection working, executing queries became trivial. However,
I almost hit ANOTHER wall - due to Access - in trying to interrogate the
database to get a list of table names (so that I could query them and
dump the results). Access hides its table names inside of a hidden
system table called MSysObjects
. My initial solution was to establish
a connection to the database, then execute a query against MsysObjects
to pull out the table names. The ODBC driver insisted that I did not
have read access for the table, and refused to provide a result. This is
one are where the online community just did not have the answers. The
recommended “fix” was either to use ADO or ADOX (both windows-specific
tech), or to open the database and set some options inside the database
to make the system table visible. Given the original problem statement,
this was untenable. At literally the last minute, I was rummaging over
the code samples for clojure.contrib.sql
, and saw the sample for
“getting tables from a database”. It turns out that the connection
object returned via java interop already interrogates the database and
retrieves its schema during connection. I had no idea (because of my
ignorance), but the clojure.contrib.sql
example was easily adapted to
provide a list of table-names from the map the library sample returned.
I combined all of this stuff into a main function, with the intent to
provide command-line args (and a JFileChooser
interface if no
args/invalid args are provided). Main snakes files from the source
directory, unpacks zip archives and injects a filtered sequence of valid
unzipped-files into the remaining file-set
, extracts tables from
databases into unique folders, and maintains a map of all input
file-paths to output file paths (which is finally dumped into a summary
log).
I am still working on deployment (as a standalone executable JAR), but so far Clojure is working nicely to solve this problem.