Devart Forum Index

The time now is Thu 09 Sep 2010 04:14

deleting duplicate records

 
Post new topic   Reply to topic    Devart Forum Index -> Universal Data Access Components
View previous topic :: View next topic  
Author Message
sandy771



Joined: 22 May 2007
Posts: 68

PostPosted: Wed 28 Jul 2010 15:59    Post subject: deleting duplicate records Reply with quote

I have a largeish database (10 million+ records)

I need to add a few thousand records to the database and then remove any duplicates. This is an operation that I have to do reasonably often.

I am after ideas as to the most efficient way to do this.

At the moment drop the index, append the records, sort and then step through and remove duplicates this would take a couple of minutes (acceptable) but I have a problem.

I am using UniQuery->next() to go through the table

If I check the current record against the previous and see that there is a duplicate I could use an sql query to delete the current record based on ROWID (sqlite). As I understand it the only way to do this is with a separate uniquery component - otherwise I would lose the cursor position when my query changes. Is this right

Is there a better way of doing this?


Last edited by sandy771 on Thu 29 Jul 2010 09:22; edited 1 time in total
Back to top
View user's profile Send private message
DepSoft



Joined: 27 Jul 2010
Posts: 17
Location: Western Australia

PostPosted: Thu 29 Jul 2010 04:07    Post subject: Reply with quote

I'm not familiar with SQLite but I would think it would be more efficient to do it something like this:

create a temp table with the same structure as your main table
add your batch of records to the temp table
delete from the temp table where key value in main table
add records in temp table to main table

It depends on the nature of the records and primary keys/indexes and whether your check for duplicates needs code to check duplicity, etc

It should be faster since more of the work is being done on the server without fetching all the records back

Regards, Paul
Back to top
View user's profile Send private message
sandy771



Joined: 22 May 2007
Posts: 68

PostPosted: Thu 29 Jul 2010 08:09    Post subject: Reply with quote

DepSoft wrote:
I'm not familiar with SQLite but I would think it would be more efficient to do it something like this:

create a temp table with the same structure as your main table
add your batch of records to the temp table
delete from the temp table where key value in main table
add records in temp table to main table

It depends on the nature of the records and primary keys/indexes and whether your check for duplicates needs code to check duplicity, etc

It should be faster since more of the work is being done on the server without fetching all the records back

Regards, Paul


There is a real possibilitythatthe temp table would be too big.

I have considered adding a flag field to each record, setting it for each duplicate and then clearing each record with the flag set. But looking for ideas.
Back to top
View user's profile Send private message
DepSoft



Joined: 27 Jul 2010
Posts: 17
Location: Western Australia

PostPosted: Thu 29 Jul 2010 09:18    Post subject: Reply with quote

The temp table would be created as an empty table and only be populated with your records for insertion into the main table.

If those records would make a temp table too big, wouldn't they also cause your main table to be too big once inserted?

Regards, Paul.
Back to top
View user's profile Send private message
sandy771



Joined: 22 May 2007
Posts: 68

PostPosted: Thu 29 Jul 2010 09:24    Post subject: Reply with quote

Ah OK sorry I have now read your first post properly, that may work I'll have a play.
Back to top
View user's profile Send private message
sandy771



Joined: 22 May 2007
Posts: 68

PostPosted: Sat 31 Jul 2010 11:43    Post subject: Reply with quote

The solution I went for in the end - which works well - was to sort the database, step through a file at a time, set a flag on each row that is the same as the one before and then do a delete from table where flag = 1

On 8 Million rows it still completes in about a minute
Back to top
View user's profile Send private message
bork
Devart Team


Joined: 12 Mar 2010
Posts: 467

PostPosted: Mon 02 Aug 2010 09:26    Post subject: Reply with quote

Hello

For the table:
CREATE TABLE test_table (
ID INTEGER NOT NULL,
Name VARCHAR(250) NOT NULL
)

You can select duplicates:
select rowid, * from test_table
where rowid not in (select min(rowid) from btest0 group by id, name)

You can delete duplicates:
delete from test_table
where rowid not in (select min(rowid) from btest0 group by id, name)

For your table you should add all your fields in the group by statement.
Back to top
View user's profile Send private message
sandy771



Joined: 22 May 2007
Posts: 68

PostPosted: Mon 02 Aug 2010 10:00    Post subject: Reply with quote

Thanks Bork I'll look at that.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    Devart Forum Index -> Universal Data Access Components All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum