README
for Excel spreadsheet file r/w access scripts for octave (>
3.2.0)
Copyright (C) 2009, 2010 Philip Nienhuis <prnienhuis
at users.sf.net>
This version August 11, 2010.
EXCEL .XLS SUPPORT FILES
doc/README-XLS.html
This file.
xlsread.m
All-in-one
function for reading data from one specific worksheet in an Excel
spreadsheet file. This script has Matlab-compatible
functionality.
xlswrite.m
All-in-one
function for writing data to one specific worksheet in an Excel
spreadsheet file. This script has Matlab-compatible functionality.
xlsfinfo.m
All-in-one
function for exploring basic properties of an Excel spreadsheet file.
This script has Matlab-compatible functionality.
xlsopen.m
Function
for "opening" (= providing a handle to) an Excel
spreadsheet file ("workbook"). This function sorts out
which interface to use for .xls access (i.e.,COM; Java & Apache
POI; JexcelAPI; etc.), but it's choice can be overridden.
xls2oct.m
Function
for reading data from a specific worksheet pointed to in a struct
created by xlsopen.m. xls2oct can be called multiple times
consecutively using the same pointer struct, each time allowing to
read data from different ranges and/or worksheets. Data are returned
in the form of a 2D heterogeneous cell array that can be parsed by
parsecell.m. xls2oct is a mere wrapper for interface-dependent
scripts that do the actual low-level reading .
oct2xls.m
Function
for writing data to a specific worksheet pointed to in a struct
created by xlsopen.m. octxls can be called multiple times
consecutively using the same pointer struct, each time allowing to
write data to different ranges and/or worksheets. oct2xls is a mere
wrapper for interface-dependent scripts that do the actual low-level
writing.
xlsclose.m
Function
for closing (the handle to) an Excel workbook. When data have been
written to the workbook oct2xls will write the workbook to disk.
Otherwise, the file pointer is simply closed and possibly used
interfaces for Excel access (COM/ActiveX/Excel.exe) will be shut down
properly.
parsecell.m
Function for separating the data in raw arrays returned by xls2oct, into numerical/logical and text (cell) arrays.
spsh_chkrange.m
spsh_prstype.m
getusedrange.m
calccelladdress.m
parse_sp_range.m
Support files called by the scripts and not meant for direct invocation by users.
REQUIRED
SUPPORT SOFTWARE
For
the Excel/COM interface:
A windows computer with Excel installed
Octave-forge Windows-1.0.8 or later package WITH LATEST SVN PATCHES APPLIED
For the Java / Apache POI / JExcelAPI interfaces (general):
octave-forge java-1.2.5 package or later version on Linux
octave-forge java-1.2.6 with latest svn fixes on Windows/MingW
Java jre or jdk > 1.6.0 (hasn't been tested with earlier versions)
Apache POI specific:
class .jars: poi-3.5-FINAL-<date>.jar & poi-ooxml-3.5-FINAL-<date>.jar (or later versions) in classpath
Get it here: http://poi.apache.org/download.html
and here:
for OOXML support (only available with Apache POI):
poi-ooxml-schemas-<version>.jar, xbean.jar, dom4j-1.6.1.jar in javaclasspath.
Get them here:
http://poi.apache.org/download.html (“xmlbeans” and poi-ooxml-schemas)
http://sourceforge.net/projects/dom4j/files (dom4j-<version>)
JExcelAPI specific:
class .jar: jxl.jar in classpath
Get it here: http://sourceforge.net/projects/jexcelapi/files/
Hint: simply put the relevant javaaddpath statements in the .octaverc file.
USAGE
xlsread
and xlswrite are mere wrappers for
xlsopen-xls2oct-xlsclose-parsecell and
xlsopen-oct2xls-xlsclose sequences, resp. They
exist for the sake of Matlab compatibility.
xlsfinfo
can be used for finding out what worksheet names exist in the file.
For OOXML files the input parameter REQINTF should be specified with
a value of 'poi' (case-insensitive) and -obviously- the complete POI
interface must have been installed.
Invoking
xlsopen/..../xlsclose directly provides for much more
flexibility, speed, and robustness than xlsread / xlswrite.
Indeed, using the same file handle (pointer struct) you can mix
reading & writing before writing the workbook out to disk using
xlsclose.
And: xlsopen / xlsclose hide the gory
interface details from the user.
Currently only .xls files (BIFF8)
can be read/written; using JExcelAPI BIFF5 can be read as well. For
OOXML files the complete Apache POI interface must be installed (and
possibly the REQINTF parameter specified with a value of
'poi').
When using xlsopen....xlsclose be sure
to keep track of the file handle struct.
A possible
scenario:
xlh = xlsopen (<excel_filename> , [rw],
[<requested interface>])
# Set rw to 1 if you want to
write to a workbook immediately.
# In that case the check for file
existence is skipped and
# -if needed- a new workbook created.
#
If you really want an other interface than auto-selected
# by
xlsopen you can request that. But xlsopen still checks
# proper
support for your choice.
# Read some data
[
rawarr1, xlh ] = xls2oct (xlh, <SomeWorksheet>, <Range>)
#
Be sure to specify xlh as output argument as xls2oct keeps
# track
of changes and the need to write the workbook to disk
# in the
xlhstruct. And the origin range is conveyed through
# the xlh
pointer struct.
# Separate data into numeric and text
data
[ numarr1, txtarr1, lim1 ] = parsecell (rawarr1)
#
Get more data from another worksheet in the same workbook
[
rawarr2, xlh ] = xls2oct (xlh, <SomeOtherWorksheet>, <Range>)
[
numarr2, txtarr2, lim2 ] = parsecell (rawarr2)
# <...
Analysis and preparation of new data in cell array Newdata....>
#
Add new data to spreadsheet
xlh = oct2xls (Newdata, xlh,
<AnotherWorksheet>, <Range>)
# Close the
workbook and write it to disk; then clear the handle
xlh =
xlsclose (xlh)
clear xlh
When not using the COM interface, specify a value of 'POI' for parameter REQINTF when accessing OOXML files in xlsread, xlswrite, xlsopen, xlsfinfo (and be sure the complete Apache POI interface is installed).
Using JExcelAPI (JXL), after writing into a worksheet you MUST save the file – adding data to the same or another worksheet is no more possible after the first call to oct2xls().
SPREADSHEET FORMULA SUPPORT
When
using the POI and JXL interfaces you can:
(When reading, xls2oct) either read spreadsheet formula results (like in COM interface), or the literal formula text strings;
(When writing, oct2xls) either enter formulas in the worksheet as formulas, or enter them as literal text strings. The former is also like in COM.
In short, you can enter spreadsheet formulas and in a later stage read them back, change them and re-enter them in the worksheet.
The behaviour is controlled by an option structure options which for now has only one (logical) field:
options.formulas_as_text = 0 (the default) implies enter formulas as formulas and read back formula results
options.formulas_as_text =1 (or any positive integer) means enter formulas as text strings and read them back as text strings.
Be aware that there's no formula evaluator in JexcelAPI (JXL). So if you create formulas in your spreadsheet using oct2xls or xlswrite with 'JXL', do not expect meaningful results when reading those files later on unless you open them in Excel and write them back to disk.
While both Apache POI and JExcelAPI feature a formula validator, not all spreadsheet functions present in Excel have been implemented (yet).
Worse, older Excel versions feature less functions than newer versions. So be wary as this may make for interesting confusion.
MATLAB COMPATIBILITY
xlsread,
xlswrite and xlsfinfo are for the most part
Matlab-compatible. Some small differences are mentioned below. When
using the Java interfaces octave supplies some formula manipulation
support.
xlsread
Matlab's xlsread
supports invoking extra functions while reading ("passing
function handle"); octave not. But this can be simulated outside
xlsread.
Matlab's xlsread flags some spreadsheet
errors, octave-forge just returns blank cells.
When using a
Java interface, octave-forge returns info about the actual (rather
than the requested) cell range where the data came from; Matlab and
ActiveX/Excel can't. This is due to Excel automatically trimming
returned arrays from empty outer columns and rows. When using COM,
octave has the same limitation.
This might be solved using
advanced Visual Basic statements into getusedrange.m but my VB
proficiency is inadequate for this :-)
Personally I find it very useful to know from what part of a worksheet the data originate so I've put quite some effort in it :-)
Matlab's
xlsread ignores all non-numeric data values outside the smallest
rectangle encompassing all numerical values. Octave's xlsread
doesn't. This means that Matlab ignores all row/column headers, not
very user-friendly IMO.
When using the Java interface,
reading and writing xls-files by octave-forge is
platform-independent. On systems w/o installed Excel, Matlab can only
read Excel 95 formatted .xls files (written using ML xlswrite's
'Basic” option).
Matlab's
xlsread returns strings for cells containing date values. Octave
returns numerical data (where 0 = 1/1/1900 – you need to
transfer them into proper octave date values yourself, see bottom of
this document for more info).
Matlab's xlsread invokes
csvread if no Excel interface is present. Octave-forge's
xlsread doesn't.
xlswrite
Octave-forge's
xlswrite works on systems w/o Excel support, Matlab's doesn't
(properly).
When specifying a sheet number larger than the number of existing sheets in an .xls file, Matlab's xlswrite adds empty sheets until the new sheet number is created; Octave's xlswrite only adds one sheet called “Sheet<number>” where <number> is the specified sheet number.
Even better (IMO) while M's xlswrite always creates Sheet1/Sheet2/Sheet3 when creating a new spreadsheet, octave's xlswrite only creates the requested worksheet. (Did you know that you can instruct Excel to create spreadsheets with just one, or any number of, worksheets? Look in Tools | Options, General tab.)
And octave doesn't touch the “active sheet”.
If
the specified write range is larger than the actual data array,
Matlab's xlswrite adds #N/A cells to fill up the lowermost rows and
rightmost columns; octave-forge's xlswrite doesn't.
xlsfinfo
When
invoking Excel/COM interface, octave-forge's xlsfinfo also
echoes the type of sheet (worksheet, chart), not just the sheet
names. Using Java I haven't found similar functionality (yet).
COMPARISON OF INTERFACES & USAGE
Using
Excel itself (through COM / ActiveX on Windows systems) is probably
the most robust and versatile and especially FAST option. There's one
gotcha: in case of some type of COM errors Excel will keep running
invisibly, you can only end it through Task Manager.
A tiny
problem is that one cannot find out easily through COM what file
types are supported; xls, wks, wk1, xlsx, etc.
Another -obvious-
limitation is that COM Excel access only works on Windows systems
where Excel is installed.
As Excel/COM trims returned arrays from
empty outer row and columns it is not easy to find out from what cell
range the returned data originate.
JExcelAPI (Java-based and
therefore platform-independent) is proven technology but switching
between reading and writing is quite involved and memory-hungry when
processing large spreadsheets. As the docs state, JExcelAPI is
optimized for reading and it does do that well - but slower than
Excel/COM. The fact that upon a switch from reading to writing the
existing spreadsheet is overwritten in place by a blank one
and that you can only get the contents back wen writing out all of
the changes is worrying - and any change after the first write() is
lost as a next write() doesn't seem to work, worse yet, you may
completely loose the spreadsheet in question. The first is by
JExcelAPI design, the second is probably a bug (in octave-forge/Java
or JExcelAPI ? I don't know). Adding data to existing spreadsheets
does work, but IMO undue user confidence is needed.
JExcelAPI supports BIFF5 (only reading) and BIFF8 (Excel 95 and Excel 97-2003, respectively).
JexcelAPI,
unlike ApachePOI, doesn't evaluate functions while reading but
instead relies on cached results (i.e. results computed by Excel
itself). Depending on Excel settings (“Automatic calculation”
ON or OFF) this may or may not yield incorrect results.
Apache
POI (Java-based and platform-independent too) is based on the
OpenOffice.org I/O Excel r/w routines. It is a little more versatile
than JExcelAPI, while it doesn't support BIFF5 it does support BIFF8
(Excel 97 – 2003) and OOXML (Excel 2007).
It
is slower than native JXL let alone Excel & COM but it features
active formula evaluation, although at the moment (v. 3.6) not all
Excel functions have been implemented. I've made the relevant
subfunction (xls2jpoi2oct) fall back to cached formula results (and
yield a suitable warning) for non-implemented Excel functions while
reading Excel files.
All in all, of the two Java options I'd
prefer Apache POI rather than JexcelAPI. But the latter is
indispensable for BIFF5 formats.
Some notes on the choice for Java:
It saves a LOT of development time to use ready-baked Java classes rather than developing your own routines and thus effectively reinvent the wheel.
A BIG advantage is that a Java-based solution is platform-independent (“portable”).
But Java is known to be not very conservative with resources, especially not when processing XML-based formats.
So Java is a compromise between portability and rapid development time versus capacity (and speed).
But IMO data sets larger than 5.105 cells should not be kept in spreadsheets anyway. Better use real databases for such data sets.
A NOTE ON JAVA MEMORY USAGE
Java memory pool allocation size
The java virtual machine (JVM), when initialized by octave, reserves one big chunk of your computer's RAM in which all java classes and methods etc. are to be loaded: the java memory pool. It does this because java has a very sophisticated “garbage collection” system. At least on Windows, the initial size is 2MB and the maximum size is 16 MB. On Linux this allocated size might differ. This part of memory is where the java-based XLS/ODS octave routines live and keep their variables etc.
For transferring large pieces of information to and from spreadsheets you might hit the limits of this pool. E.g. to be able to handle I/O of an array of around 500,000 cells I needed a memory pool size of 512 MB.
The memory size can be increased by inserting a file called “java.opts” (without quotes) in the directory ./share/octave/packages/java-<version> (where the script file javaclasspath.m is located), containing just the following lines:
-Xms16m
-Xmx512m
(where 16 = initial size, 512 = maximum size, m stands for Megabyte).
After processing a large chunk of spreadsheet information you might notice that octave's memory footprint does not shrink so it looks like java's memory pool does not shrink back; but rest assured, the memory footprint is the allocated (reserved) memory size, not the actual used size. After the JVM has done its garbage collection, only the so-called “working set” of the memory allocation is really in use and that is a trimmed-down part of the memory allocation pool. On Windows systems it often suffices to minimize the octave terminal for a few seconds to get a more reasonable memory footprint.
TROUBLESHOOTING
Some hints for troubleshooting Excel support are contained in this thread:
dated August 10, 2010.
A more structured approach is below:
Check if COM / ActiveXworks (only under Windows OS). Do a pkg list and see
a. If there's a windows package mentioned (then it's installed). If not, install it.
b. If there's an asterisk on the windows package line (then the package is loaded). If not, do a pkg load windows
Check if the ActiveX server works. Do:
exl = actxserver ('Excel.Application') ## Note the period between “Excel” and “Application”
If a COM object is returned, ActiveX / COM / Excel works. Do: exl.Quit(); delete (exl) to shut down the (hidden) Excel invocation.
If you get an error message, your last resort is re-installing the windows package, or trying the Java-based interfaces.
Check if java works. Do a pkg list and see
a. If there's a java package mentioned (then it's installed). If not, install it.
b. If there's an asterisk on the java package line (then the package is loaded). If not, do a pkg rebuild-auto java
Check java memory settings. Try javamem
a. If it works, check if it reports sufficiently large max memory (had better be 200 MiB, the bigger the better)
b. If it doesn't work, do:
rt = java_invoke ('java.lang.Runtime', 'getRuntime')
rt.gc
rt.maxMemory ().doubleValue () / 1024 / 1024
The last command will show MaxMemory in MiB.
c. In case you have insufficient memory, see in “GOTCHAS”, “Java memory pool allocation size”, how to increase java's memory pre-reservation.
Check if all classes (.jarfiles) are in class path. Do a 'javaclasspath' (under unix/linux, do 'tmp = javaclasspath; strsplit (tmp,”:”)' (w/o quotes). See above under “REQUIRED SUPPORT SOFTWARE” what classes should be mentioned.
If classes (.jar files) are missing, download and put them somewhere and add them to the javaclass path with their fully qualified pathname (in quotes) using javaaddpath().
Once all classes are present and in the javaclasspath, the xls interfaces should just work. The only remaining showstoppers left are insufficient write privileges for the working directory, a wrecked up octave or some other problem outside octave.
Try opening an xls file:
xls1 = xlsopen ('test.xls', 1, 'poi'). If this works and xls1 is a struct with various fields containing objects, the Apache POI interface (POI) works. Do an xls1 = xlsclose (xls1) to close the file.
xls2 = xlsopen ('test.xls', 1, 'jxl'). If this works and xls2 is a struct with various fields containing objects, the JExcelAPI interface (JXL) works as well. Don'tforget to do xls2 = xlsclose (xls2) to close the file.
DEVELOPMENT
xlsopen/xlsclose and friends have been written so that adding other interfaces (Perl? native octave? ...?) should be very easily accomplished. xlsopen merely needs two stanzas, xlsfinfo and getusedrange each need an additional elseif stanza, and xlsclose needs a small stanza for closing the pointer struct and writing to disk.
The
real work lies in creating the relevant xls2<...>2oct
& oct2<...>2xls subfunction scripts in
xls2oct and oct2xls, resp., but that shouldn't be
really hard, depending on the interface support libraries' quality
and documentation. Separating the file access functions and the
actual reading/writing from/to the workbook in memory has made
developer's life (I mean: my time developing this stuff) much
easier.
Some other options for development (who?):
Speeding up, especially Java worksheet/cell access. For cracks, not me.
Automatic conversion of Excel date/time values into octave ones and vice versa. But then again Excel's dates are 01-01-1900 based (octave's 0-0-0000) and buggy (Excel thinks 1900 is a leap year), and I sometimes have to use dates from before 1900. Maybe as an option?
Creating Excel graphs (a significant enterprise to write from scratch).
Support for "passing function handle" in xlsread.
Enjoy!
Philip Nienhuis, August 10, 2010