Does anyone have or know of any Perl code or modules which will read SAS Transport (XPORT) datasets? XPORT format files are machine-independent 80 column format files used for data archiving and transfer, particularly in the pharmaceutical industry where they are one of the formats accepted by the US FDA for data submission. The layout of the XPORT format is available from the SAS web site in Technical Document TS-140. The XPORT format is advantageous in that all datasets in a SAS library can easily be exported into a single XPORT file with just a few lines of code from any version of SAS from 5 onwards, the XPORT format itself is frozen so that there are version issues, the XPORT format preserves the full numeric precision of exported floating point numbers, and the format preserves dataset names, variable names, variable labels and variable informats and formats. It does not appear to support dataset labels, nor does it support long dataset or variable names or labels introduced in SAS Version 7/8.
XPORT files ought to be easy to read using Perl, but unfortunately the XPORT format uses IBM mainframe machine double precision format to store numeric data, and unless you happen to have an IBM mainframe, you need to convert this numeric format to something more usable, like IEEE numeric representation. SAS supplies some C source code to do this, and this could be integrated into Perl as an extension. I was just wondering if anyone else had tackled this problem? There are no modules on CPAN (Comprehensive Perl Archive Network) which address this issue.
Alternatively, it seems that an ASCII-based file format which achieves the same as the SAS XPORT format is required. It would be quite easy to write a portable SAS macro which, using nothing more than Base SAS, which automatically exported a SAS dataset into this alternative XPORT format. Such a format could include information which is missing from the XPORT format, such as dataset label, number of observations in the dataset and perhaps even a checksum or hash of the data values to check data intregrity.
The only difficulty with this is how to always ensure that maximum numeric precision is retained and represented in the ASCII export file without making the file enormous (by allowing for a large number of digits for every numeric variable. Does anyone have any insights into this problem?