Click here to LOGIN
Information
Tutorials & Articles
Programs
Feedback
 


Forum
Add a Post
Username:   (You must log on to use your member username)Hint: Use [code] and [/code] around text to highlight it as QB code.


Subject:
Message:
Forums -> Q & A -> Remove duplicates from a text file.
sintral
Remove duplicates from a text file.
Posted Aug 01 2009
I'm sure this is much easier than I'm making it. I have a comma-separated text file with just over 10,000 records. Each record is 4 fields. A combination of the first two fields (strings) is my search term. I want to search through the file (from that line forward) for a duplicate of that string. If no duplicate is found, output the row. If a duplicate is found, output the first row instance and do not output the duplicate instance.

My approach has been to make a copy of the file before I start, read a line from the original file, then loop through the copy searching for it. Caveat is, I of course find each record one (on the same line).

Here is a snippet of the file I'm using:
...
100010,MD9,07-27-2009,09:34
93079,MD9,07-27-2009,09:35
93078,MD9,07-27-2009,09:35
93077,MD9,07-27-2009,09:35
93080,MD9,07-27-2009,09:35
93081,MD9,07-27-2009,09:35
...

My (non-working) code is below. Any suggestions?


cm$ = CHR$(44)
SHELL ("COPY iphist.dat iphistt1.dat")
SHELL ("COPY iphist.dat iphistt2.dat")
CLS

OPEN "iphistt1.dat" FOR INPUT AS #1
DO WHILE NOT EOF(1)
   LINE INPUT #1, row$
   cnt = cnt + 1
LOOP
CLOSE

OPEN "iphistt1.dat" FOR INPUT AS #1
OPEN "iptemp.dat" FOR OUTPUT AS #3

FOR i = 1 TO cnt
   INPUT #1, ser$, loc$, dt$, tm$: srch$ = ser$ + loc$
   lne$ = ser$ + cm$ + loc$ + cm$ + dt$ + cm$ + tm$
   FOR j = 1 TO cnt
      OPEN "iphistt2.dat" FOR INPUT AS #2
      INPUT #2, ser2$, loc2$: srch2$ = ser2$ + loc2$
      IF srch$ = srch2$ THEN
         CLOSE #2
         found = 1
         EXIT FOR
      END IF
      CLOSE #2
   NEXT
   IF found = 0 THEN
      dup = dup + 1
      LOCATE 10, 10: PRINT dup; "/"; cnt
      PRINT #3, lne$
   ELSE
      found = 0
   END IF
NEXT
CLOSE

SHELL ("RENAME iphist.dat iphist.bak")
SHELL ("RENAME iptemp.dat iphist.dat")
KILL "iphistt2.dat": KILL "iphistt1.dat"
END

sintral
Slight rewrite
Posted Aug 01 2009
I now have it to where all duplicates are skipped, though the first instance should be printed to the output file. Changing where and how file #2 was opened also sped it up quite a bit.


cm$ = CHR$(44)
SHELL ("COPY iphist.dat iphistt1.dat")
SHELL ("copy iphist.dat iphistt2.dat")
CLS

OPEN "iphistt1.dat" FOR INPUT AS #1
DO WHILE NOT EOF(1)
   LINE INPUT #1, row$
   cnt = cnt + 1
LOOP
CLOSE

OPEN "iphistt1.dat" FOR INPUT AS #1
OPEN "iptemp.dat" FOR OUTPUT AS #3

FOR i = 1 TO cnt
   INPUT #1, ser$, loc$, dt$, tm$: srch$ = ser$ + loc$
   lne$ = ser$ + cm$ + loc$ + cm$ + dt$ + cm$ + tm$
   OPEN "iphistt2.dat" FOR INPUT AS #2
   FOR j = 1 TO cnt
      INPUT #2, ser2$, loc2$: srch2$ = ser2$ + loc2$
      IF srch$ = srch2$ THEN
         found = found + 1
      END IF
   NEXT j
   IF found < 2 THEN
      dup = dup + 1
      LOCATE 10, 10: PRINT dup; "/"; cnt
      PRINT #3, lne$
   ELSE
      found = 0
   END IF
   CLOSE #2
NEXT i
CLOSE

SHELL ("RENAME iphist.dat iphist.bak")
SHELL ("RENAME iptemp.dat iphist.dat")
KILL "iphistt2.dat": KILL "iphistt1.dat"
END

Clippy
Every time you close and open a file in a loop, it restarts at the beginning
Posted Aug 01 2009
Then when a duplicate is found then found = 1:

IF srch$ = srch2$ THEN
         CLOSE #2
         found = 1 'found flag
         EXIT FOR
END IF

But you look for found = 0 to find the duplicate:

IF found = 0 THEN 'change to IF found THEN
      dup = dup + 1
      LOCATE 10, 10: PRINT dup; "/"; cnt
      PRINT #3, lne$
      found = 0
   END IF     'changed, you don't need else

All you do in the ELSE is make found = 0. That should be done if found = 1

NOTE: When working with a flag number, you don't need an = for any number but 0.

IF flag THEN is the same as IF flag > 0 THEN

sintral
Re: Clippy
Posted Aug 05 2009
I'm pretty sure that I do want the program to close and then reopen the iphist2.dat file for each search term in iphist1. I wouldn't want to start searching for the next term on the line following the last found search term. I'm incrementing the found variable because I know it is going to find each search term at least once.

Here is my latest rewrite, which stores each duplicate search term in an array. This code results in an error: Extended Error 183.


cm$ = CHR$(44) ' ASCII comma character
SHELL ("COPY iphist.dat iphistt1.dat")
SHELL ("COPY iphist.dat iphistt2.dat")
CLS

DIM skip$(700) ' Create an array for storing records to skip
OPEN "iphistt1.dat" FOR INPUT AS #1
DO WHILE NOT EOF(1)
   LINE INPUT #1, row$
   cnt = cnt + 1 ' Get a count of records in the input file
LOOP
CLOSE

OPEN "iphistt1.dat" FOR INPUT AS #1
OPEN "iptemp.dat" FOR OUTPUT AS #3

FOR i = 1 TO cnt
   INPUT #1, ser$, loc$, dt$, tm$ ' Input entire line, but with each field separated
   srch$ = ser$ + loc$ ' Combine serial number and location fields to create the search term
   lne$ = ser$ + cm$ + loc$ + cm$ + dt$ + cm$ + tm$ ' Rebuild the line in csv format for printing to output file
   IF dup > 0 THEN ' If at least one duplicate has been found so far, it is safe to loop through the skip$() array.
FOR m = 1 to dup
IF skip$(m) = srch$ THEN ' If one of the items in our skip list matches our current search term.
skip = 1
m = dup ' Exit the skip array if we found a match.
END IF
NEXT m
   END IF
   IF skip = 0 THEN ' If were dealing with the first instance of this search term
OPEN "iphistt2.dat" FOR INPUT AS #2
FOR j = 1 TO cnt
INPUT #2, ser2$, loc2$: srch2$ = ser2$ + loc2$
IF srch$ = srch2$ THEN
found = found + 1 ' Build a total count, but do not exit the for loop
END IF
NEXT j
IF found > 1 THEN
FOR x = 1 to found
dup = dup + 1
skip$(dup) = srch$ ' Add current search term to skip list
LOCATE 10, 20: PRINT dup; "/"; cnt
NEXT x
PRINT #3, lne$ ' Print this instance, now that any following are to be ignored.
END IF
found = 0 ' Reset found for the next search term.
CLOSE #2
END IF
skip = 0 ' Reset skip boolean
NEXT i
CLOSE

Moneo
Remove duplicate records
Posted Aug 05 2009
Have you considered sorting your text file, on the search term or key, first?

With that size of a file you would need an external file sorting utility. Do you have such a utility available? I've been using a DOS-based utility called Opttech Sort, for many years.

Assuming you have such a utility, you would give it the size of the key as the largest key size that you have. After sorting your file to a new sorted file, reading the sorted file and eliminating duplicates, which are now adjacent, is a simple program to write.

Regards..... Moneo

Moneo
MSDOS Sort
Posted Aug 05 2009
If you decide to sort your file, you might try the sort utility called SORT which comes with MSDOS.

To see the SORT help, from the command line enter: sort /?
If it scrolls too fast, you may need to do: sort /? | more

Regards..... Moneo
09cOdE
Posted Aug 07 2009
I'm suprised this doesn't throw an error
as you are not closeing #1 you are just closeing


DIM skip$(700) ' Create an array for storing records to skip
OPEN "iphistt1.dat" FOR INPUT AS #1
DO WHILE NOT EOF(1)
   LINE INPUT #1, row$
   cnt = cnt + 1 ' Get a count of records in the input file
LOOP
CLOSE

Clippy
CLOSE just closes ALL files open.
Posted Aug 07 2009
You may not want to use that.

The QBasic Station, (C) Copyright 1997-2010