I'm sure this is much easier than I'm making it. I have a comma-separated text file with just over 10,000 records. Each record is 4 fields. A combination of the first two fields (strings) is my search term. I want to search through the file (from that line forward) for a duplicate of that string. If no duplicate is found, output the row. If a duplicate is found, output the first row instance and do not output the duplicate instance.
My approach has been to make a copy of the file before I start, read a line from the original file, then loop through the copy searching for it. Caveat is, I of course find each record one (on the same line).
Here is a snippet of the file I'm using:
...
100010,MD9,07-27-2009,09:34
93079,MD9,07-27-2009,09:35
93078,MD9,07-27-2009,09:35
93077,MD9,07-27-2009,09:35
93080,MD9,07-27-2009,09:35
93081,MD9,07-27-2009,09:35
...
OPEN "iphistt1.dat" FOR INPUT AS #1
DO WHILE NOT EOF(1)
LINE INPUT #1, row$
cnt = cnt + 1
LOOP
CLOSE
OPEN "iphistt1.dat" FOR INPUT AS #1
OPEN "iptemp.dat" FOR OUTPUT AS #3
FOR i = 1 TO cnt
INPUT #1, ser$, loc$, dt$, tm$: srch$ = ser$ + loc$
lne$ = ser$ + cm$ + loc$ + cm$ + dt$ + cm$ + tm$
FOR j = 1 TO cnt
OPEN "iphistt2.dat" FOR INPUT AS #2
INPUT #2, ser2$, loc2$: srch2$ = ser2$ + loc2$
IF srch$ = srch2$ THEN
CLOSE #2
found = 1
EXIT FOR
END IF
CLOSE #2
NEXT
IF found = 0 THEN
dup = dup + 1
LOCATE 10, 10: PRINT dup; "/"; cnt
PRINT #3, lne$
ELSE
found = 0
END IF
NEXT
CLOSE
I now have it to where all duplicates are skipped, though the first instance should be printed to the output file. Changing where and how file #2 was opened also sped it up quite a bit.
OPEN "iphistt1.dat" FOR INPUT AS #1
DO WHILE NOT EOF(1)
LINE INPUT #1, row$
cnt = cnt + 1
LOOP
CLOSE
OPEN "iphistt1.dat" FOR INPUT AS #1
OPEN "iptemp.dat" FOR OUTPUT AS #3
FOR i = 1 TO cnt
INPUT #1, ser$, loc$, dt$, tm$: srch$ = ser$ + loc$
lne$ = ser$ + cm$ + loc$ + cm$ + dt$ + cm$ + tm$
OPEN "iphistt2.dat" FOR INPUT AS #2
FOR j = 1 TO cnt
INPUT #2, ser2$, loc2$: srch2$ = ser2$ + loc2$
IF srch$ = srch2$ THEN
found = found + 1
END IF
NEXT j
IF found < 2 THEN
dup = dup + 1
LOCATE 10, 10: PRINT dup; "/"; cnt
PRINT #3, lne$
ELSE
found = 0
END IF
CLOSE #2
NEXT i
CLOSE
Every time you close and open a file in a loop, it restarts at the beginning
Posted Aug 01 2009
Then when a duplicate is found then found = 1:
IF srch$ = srch2$ THEN
CLOSE #2
found = 1 'found flag
EXIT FOR
END IF
But you look for found = 0 to find the duplicate:
IF found = 0 THEN 'change to IF found THEN
dup = dup + 1
LOCATE 10, 10: PRINT dup; "/"; cnt
PRINT #3, lne$
found = 0
END IF 'changed, you don't need else
All you do in the ELSE is make found = 0. That should be done if found = 1
NOTE: When working with a flag number, you don't need an = for any number but 0.
IF flag THEN is the same as IF flag > 0 THEN
sintral
Re: Clippy
Posted Aug 05 2009
I'm pretty sure that I do want the program to close and then reopen the iphist2.dat file for each search term in iphist1. I wouldn't want to start searching for the next term on the line following the last found search term. I'm incrementing the found variable because I know it is going to find each search term at least once.
Here is my latest rewrite, which stores each duplicate search term in an array. This code results in an error: Extended Error 183.
DIM skip$(700) ' Create an array for storing records to skip
OPEN "iphistt1.dat" FOR INPUT AS #1
DO WHILE NOT EOF(1)
LINE INPUT #1, row$
cnt = cnt + 1 ' Get a count of records in the input file
LOOP
CLOSE
OPEN "iphistt1.dat" FOR INPUT AS #1
OPEN "iptemp.dat" FOR OUTPUT AS #3
FOR i = 1 TO cnt
INPUT #1, ser$, loc$, dt$, tm$ ' Input entire line, but with each field separated
srch$ = ser$ + loc$ ' Combine serial number and location fields to create the search term
lne$ = ser$ + cm$ + loc$ + cm$ + dt$ + cm$ + tm$ ' Rebuild the line in csv format for printing to output file
IF dup > 0 THEN ' If at least one duplicate has been found so far, it is safe to loop through the skip$() array.
FOR m = 1 to dup
IF skip$(m) = srch$ THEN ' If one of the items in our skip list matches our current search term.
skip = 1
m = dup ' Exit the skip array if we found a match.
END IF
NEXT m
END IF
IF skip = 0 THEN ' If were dealing with the first instance of this search term
OPEN "iphistt2.dat" FOR INPUT AS #2
FOR j = 1 TO cnt
INPUT #2, ser2$, loc2$: srch2$ = ser2$ + loc2$
IF srch$ = srch2$ THEN
found = found + 1 ' Build a total count, but do not exit the for loop
END IF
NEXT j
IF found > 1 THEN
FOR x = 1 to found
dup = dup + 1
skip$(dup) = srch$ ' Add current search term to skip list
LOCATE 10, 20: PRINT dup; "/"; cnt
NEXT x
PRINT #3, lne$ ' Print this instance, now that any following are to be ignored.
END IF
found = 0 ' Reset found for the next search term.
CLOSE #2
END IF
skip = 0 ' Reset skip boolean
NEXT i
CLOSE
Moneo
Remove duplicate records
Posted Aug 05 2009
Have you considered sorting your text file, on the search term or key, first?
With that size of a file you would need an external file sorting utility. Do you have such a utility available? I've been using a DOS-based utility called Opttech Sort, for many years.
Assuming you have such a utility, you would give it the size of the key as the largest key size that you have. After sorting your file to a new sorted file, reading the sorted file and eliminating duplicates, which are now adjacent, is a simple program to write.
Regards..... Moneo
Moneo
MSDOS Sort
Posted Aug 05 2009
If you decide to sort your file, you might try the sort utility called SORT which comes with MSDOS.
To see the SORT help, from the command line enter: sort /?
If it scrolls too fast, you may need to do: sort /? | more
Regards..... Moneo
09cOdE
Posted Aug 07 2009
I'm suprised this doesn't throw an error
as you are not closeing #1 you are just closeing
DIM skip$(700) ' Create an array for storing records to skip
OPEN "iphistt1.dat" FOR INPUT AS #1
DO WHILE NOT EOF(1)
LINE INPUT #1, row$
cnt = cnt + 1 ' Get a count of records in the input file
LOOP
CLOSE