Date: Thu, 21 Jul 2005 19:46:48 -0500
Reply-To: "Oliver, Richard" <roliver@SPSS.COM>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: "Oliver, Richard" <roliver@SPSS.COM>
Subject: Re: [BULK] Re: Peeve: SPSS and character data type
Content-Type: text/plain; charset="us-ascii"
A few comments:
1. I can't argue that it certainly would be useful if SPSS was more flexible in the treatment of strings of different lengths.
2. AFAIK, it's simply not true that "SPSS picks up the column width from the value in the first row (is it now the first few rows?), and anything longer than what occurs there is truncated." At least it isn't true for Excel files in Excel 97 (or maybe 95) format or later read via GET DATA instead of GET TRANSLATE. In my simple test, the longest value didn't occur until row 1000 in the Excel file, but the string width in SPSS was based on that value.
3. You can fix the length of strings to some arbitrarily long width by reading the Excel files as if they were database tables with the Database Wizard (GET DATA TYPE=ODBC /ASSUMEDSTRINGWIDTH=value). In fact, if you want to merge multiple sheets from the same workbook, that's what you should do anyway. You can read all the sheets from the same book in a single command -- and strings of differing lengths don't matter! (at least string length didn't matter for the ID/key variable in my simple test)
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of Richard Ristow
Sent: Thursday, July 21, 2005 4:34 PM
Subject: [BULK] Re: Peeve: SPSS and character data type
At 12:30 PM 7/21/2005, Allan Reese AFH CEFAS wrote (subject "Re:
SPSSX-L Digest - 19 Jul 2005 to 20 Jul 2005 (#2005-196) Peeve re text
>I sympathise with Richard having frequently had to work on merging
>files with "incompatible" data types. Maybe there could be a merge
>option of "if two variables have the same name but one is a longer
>string, pad the short variable to that width."
Yes, yes indeed. That is exactly what I would like to see.
>[Sometimes this] derives from Excel [ugh] when naive users type in data
>values without regard to type and the file gets converted to SPSS with
>an arbitrary column width.
First, as to Excel: It isn't just naive users. Naive users are apt to enter text and numerical values in the same column [double UGH!], which is even worse. But experienced users need, and use, columns of text data. They enter different widths in different cells because their values have different widths, and they don't want to pad to a fixed width with blanks and a period; and, why should they?
>One approach to the problem is to ensure that any data copied from
>Excel has its first row as dummy data to establish the consistent
>format for each column.
That's almost required, and not just for consistent formats between sheets. SPSS picks up the column width from the value in the first row (is it now the first few rows?), and anything longer than what occurs there is truncated. The SPSS quirk makes this much worse with multiple
sheets: you have to fill in a template value matching the longest string found in *any* of the sheets.
>On the other hand, this perversity usually derives from Excel [and its
>variable-length text datatype].
As I wrote, there are plenty of other ways. A lot of text values, like names of places or people, have no natural length or maximum length.
Lengths assigned in databases (SPSS, Access, SAS, or whatever) are designers' estimates or experience. If you can't standardize the data dictionary project-wide at the outset, lengths will vary between sources because of designers' different estimates and experience. And even lengths that should be identical, e.g. the same item in different tables in one Access database, often are not.
>It is clear that the SPSS data model, of a table accessed row-wise, is
>not consistent with variable-width character strings.
It's neither here nor there now, but that's not totally clear. There are reasonably good ways to implement variable-width strings in such a model. Width variable width up to a defined maximum is pretty easy.
Completely free width is harder, especially to handle efficiently, but far from impossible.
That's neither here nor there because we're talking SPSS, not abstractions. Adding varying-length strings with a defined maximum length would take a major change in SPSS's file structure. Adding varying-length strings with no defined maximum length (or a large fixed limit, say 32K bytes) would take rebuilding the file structure, and all code to manage it, from the ground up.