LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2008, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 21 May 2008 18:35:26 -0400
Reply-To:   Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:   Re: Nested Data Accesses?
In-Reply-To:   <6c9d07d8-9bb0-42ba-a2b3-4def9d0342c3@w7g2000hsa.googlegroups.com>
Content-Type:   text/plain; charset="us-ascii"

Let me say up front that I would not agree to perpetuate such a contrived legacy system for creating database keys. While related programming may continue indefinitely, projects such as these too often lead to deadend careers. While I understand that many database programmers have to learn how to convert legacy systems to viable databases, it really doesn't make sense to prop up an inferior database scheme with tricky programs.

In general I see far too many questions on SAS-L on how to use SAS to "abnormalize" databases. I see bad database design as a practice as bad as, if not worse, than what Art Carpenter satirizes in his lectures as programming with job security in mind. In one common form of bad database design, a programmer asks how to create and manage a number of data artifacts, such as the key values that we see in the Nested Data Accesses problem. Other variants include hierarchic data in sequential rows of a dataset, variable values that point to other variables in the same row of a dataset, and different blocks of rows in one dataset. Often a database programmer claims that these data artifacts improve system performance, or have to be that way for various local or regulatory reasons, or make programming easier and more intuitive. In fact, none of these claims has merit. SAS and other database programming environments provide way fast access to related datasets and make programming simpler and easier to extend and maintain. The same programmers that claim to be working in special situations requiring data artifacts more likely or not will learn the advantages of good database designs and drop the artifacts one by one.

I spent some time last evening staring at this problem. I was trying to figure out why anyone would come up with this scheme in the first place, and why isn't it obvious to everyone that this analog of division in Roman numerals would be more trouble than it's worth. Perhaps it appears to have an "autokey generator" function and many database programmers believe (falsely) that database systems should create unique primary keys whenever alternative (candidate) keys contain duplicated values in a dataset.

So what's so wrong here? First, the key values combine an odd mixture of sequential numbers (dates?), what appears to be a day value, and a constant with a sequential number appended to it. We have no idea what the elements of the keys actually represent, making it more difficult to understand their role in a database, and we can only guess how they relate to attributes of relations in a database. The first SQL query in a basic program (below) shows how one might increment a character string representing a sequential number in a key value so that the sequential number exceeds by one the maximum value in a dataset (OLDIDS). The second query concatenates the string into the middle of a string that would replace a NULL string in a key value. All of this would work acceptably well except when operating on more than one tuple (row) with NULL strings in key values. The yield of the query has more than one tuple (row). Fortunately a SQL UPDATE statement requires a yield of only one value per attribute to be updated; that is, the query must yield the correct "relvar type". Testing the UPDATE with sample data shows it not to be the case in general or when operating on this sample of data.

Data OLDIDS; input id $ 1-6 dat $ 7-10 matchfield $ 11-15 ; cards; 070100da1tmatc1 070101da2tmatc1 070102da3tmatc1 070200da4tmatc2 070201da5tmatc2 070202da6tmatc3 070203da7tmatc3 070300da8tmatc4 ; run;

/* NEWDATA (underscore indicates no value) cards; ______da2tmatc1 ______da4tmatc2 ______da9tmatc3 ______da0tmatc3 */

/* Initial Match (done already): */ Data unmatched; input id $ 1-6 dat $ 7-10 matchfield $ 11-15 ; cards; 070101da2tmatc1 070200da4tmatc2 da9tmatc3 da0tmatc3 ; run; proc sql undo_policy=NONE; select distinct put(input( (select distinct max(substr(id,3,4)) from OLDIDS ) ,2.)+1 ,2.) from unmatched as R1 inner join OLDIDS as R2 on R1.matchfield=R2.matchfield where R1.id IS NULL ; quit; proc sql undo_policy=NONE; select distinct max(substr(R2.id,1,2)) ||put(input( (select distinct max(substr(id,3,4)) from OLDIDS ) ,2.)+1 ,2.) ||max(substr(R2.id,5,6)) as id, R1.dat as dat,R1.matchfield as matchfield from unmatched as R1 inner join OLDIDS as R2 on R1.matchfield=R2.matchfield where R1.id IS NULL group by substr(R2.id,1,2),substr(R2.id,5,6)

; quit;

proc sql undo_policy=NONE; insert into OLDIDS select id,dat,matchfield from (select distinct trim(max(substr(R2.id,1,2))) ||put(input( (select distinct max(substr(id,3,4)) from OLDIDS group by substr(id,3,4) ) ,2.)+1 ,2.) ||max(substr(R2.id,5,6)) as id, R1.dat as dat,R1.matchfield as matchfield from unmatched as R1 inner join OLDIDS as R2 on R1.matchfield=R2.matchfield where R1.id IS NULL group by substr(R2.id,1,2),substr(R2.id,5,6) ) ; quit;

Yes, looping through the UPDATES one key value at a time may solve the problem. Perhaps that solution makes sense as a stopgap measure. I'd recommend better database design as a better solution and one that may save time and effort beginning now. S

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Laptop765@gmail.com Sent: Wednesday, May 21, 2008 12:23 PM To: sas-l@uga.edu Subject: Re: Nested Data Accesses?

Thanks everyone for your help... I finally figured out a way to do it. I wrote a macro with a for loop. In each iteration, I saved the matchfield value and some unique identifiers so I could close the dataset. I ran this through a modified version of the SQL to get both values. I then used the unique identifiers to add the two values to the field.

I appreciate all of the information everyone has given me, I learned a great deal through all of this.


Back to: Top of message | Previous page | Main SAS-L page