| View previous topic :: View next topic |
| Author |
Message |
sandy771
Joined: 22 May 2007 Posts: 68
|
Posted: Wed 28 Jul 2010 15:59 Post subject: deleting duplicate records |
|
|
I have a largeish database (10 million+ records)
I need to add a few thousand records to the database and then remove any duplicates. This is an operation that I have to do reasonably often.
I am after ideas as to the most efficient way to do this.
At the moment drop the index, append the records, sort and then step through and remove duplicates this would take a couple of minutes (acceptable) but I have a problem.
I am using UniQuery->next() to go through the table
If I check the current record against the previous and see that there is a duplicate I could use an sql query to delete the current record based on ROWID (sqlite). As I understand it the only way to do this is with a separate uniquery component - otherwise I would lose the cursor position when my query changes. Is this right
Is there a better way of doing this?
Last edited by sandy771 on Thu 29 Jul 2010 09:22; edited 1 time in total |
|
| Back to top |
|
 |
DepSoft
Joined: 27 Jul 2010 Posts: 17 Location: Western Australia
|
Posted: Thu 29 Jul 2010 04:07 Post subject: |
|
|
I'm not familiar with SQLite but I would think it would be more efficient to do it something like this:
create a temp table with the same structure as your main table
add your batch of records to the temp table
delete from the temp table where key value in main table
add records in temp table to main table
It depends on the nature of the records and primary keys/indexes and whether your check for duplicates needs code to check duplicity, etc
It should be faster since more of the work is being done on the server without fetching all the records back
Regards, Paul |
|
| Back to top |
|
 |
sandy771
Joined: 22 May 2007 Posts: 68
|
Posted: Thu 29 Jul 2010 08:09 Post subject: |
|
|
| DepSoft wrote: | I'm not familiar with SQLite but I would think it would be more efficient to do it something like this:
create a temp table with the same structure as your main table
add your batch of records to the temp table
delete from the temp table where key value in main table
add records in temp table to main table
It depends on the nature of the records and primary keys/indexes and whether your check for duplicates needs code to check duplicity, etc
It should be faster since more of the work is being done on the server without fetching all the records back
Regards, Paul |
There is a real possibilitythatthe temp table would be too big.
I have considered adding a flag field to each record, setting it for each duplicate and then clearing each record with the flag set. But looking for ideas. |
|
| Back to top |
|
 |
DepSoft
Joined: 27 Jul 2010 Posts: 17 Location: Western Australia
|
Posted: Thu 29 Jul 2010 09:18 Post subject: |
|
|
The temp table would be created as an empty table and only be populated with your records for insertion into the main table.
If those records would make a temp table too big, wouldn't they also cause your main table to be too big once inserted?
Regards, Paul. |
|
| Back to top |
|
 |
sandy771
Joined: 22 May 2007 Posts: 68
|
Posted: Thu 29 Jul 2010 09:24 Post subject: |
|
|
| Ah OK sorry I have now read your first post properly, that may work I'll have a play. |
|
| Back to top |
|
 |
sandy771
Joined: 22 May 2007 Posts: 68
|
Posted: Sat 31 Jul 2010 11:43 Post subject: |
|
|
The solution I went for in the end - which works well - was to sort the database, step through a file at a time, set a flag on each row that is the same as the one before and then do a delete from table where flag = 1
On 8 Million rows it still completes in about a minute |
|
| Back to top |
|
 |
bork Devart Team
Joined: 12 Mar 2010 Posts: 467
|
Posted: Mon 02 Aug 2010 09:26 Post subject: |
|
|
Hello
For the table:
CREATE TABLE test_table (
ID INTEGER NOT NULL,
Name VARCHAR(250) NOT NULL
)
You can select duplicates:
select rowid, * from test_table
where rowid not in (select min(rowid) from btest0 group by id, name)
You can delete duplicates:
delete from test_table
where rowid not in (select min(rowid) from btest0 group by id, name)
For your table you should add all your fields in the group by statement. |
|
| Back to top |
|
 |
sandy771
Joined: 22 May 2007 Posts: 68
|
Posted: Mon 02 Aug 2010 10:00 Post subject: |
|
|
| Thanks Bork I'll look at that. |
|
| Back to top |
|
 |
|