top of page
Search
Writer's pictureMathilde Benedetto

Deduplication in PL/SQL based on date & serial

Updated: Oct 27, 2023

Here is an example of how to deduplicate a single table with PL/SQL.


Context:

We have a unique table. The client wants to get a data cleaning, based on an EMAIL field: many rows have email duplicates and the goal is to delete them and get a table where:

Number of distinct emails = total Nb rows of the whole table

We work here with the assumption that our criteria to deduplicate are date fields. We also have a serial on the table that allows us to identify unequivocally our records, every new entry is associated with a new incremental value.


Problem:

We have a lot of duplicate contacts in the table, based on this EMAIL field.

We must deduplicate the table for the client, and we have to:

- 1: analyze what is the current state of this table: what are the fields we can use to de-duplicate, do the different countings and get a picture on how to proceed

- 2: explain to the client the way we can deduplicate and get his feedback on the different criteria we will use.

- 3: prepare and launch the queries


Recap:

  • Oracle environment

  • 1 single table, no other table involved here

  • date fields + serial field are the criteria chosen to de-duplicate

  • PL/SQL language

Main table name: CORETABLE

Fields to use: We know our generic structure in terms of product.

We know we have :

- EMAIL: email of the contacts

- ID: our UNIQUE key, a serial automatically added on any new entry

- DATE_SUBSCRIBED: a date field AUTOMATICALLY filled in when the new record is inserted in the table (so ALWAYS has a value, cannot be empty)

- DATE_UNSUBSCRIBED: date field of the unsubscription, can be empty; if value, contact is unsubscribed, not reachable

- DATE_QUARANTINE: date field of the quarantine (hard bounces in Email marketing), can be empty; if value, contact is quarantined, so not reachable


Find below some cases and the results of the final decisions:

Date format: dd-mm-yyyy

Duplicate rows based on the EMAIL field
An example of duplicate rows based on the EMAIL field

Decisions:

- Keep active* contacts and delete active with a duplicate email in unsubscribe or quarantine

- Keep unsubscribed (no value on DATE_QUARANTINE)

- Keep quarantined (no value on DATE_UNSUBSCRIBED)

- Keep only oldest DATE_SUBSCRIBED or lower ID (if DATE_SUBSCRIBED is the same) when a contact is duplicated on the database with:

- on one line a value on DATE_QUARANTINE (DATE_UNSUBSCRIBED empty)

- on another line a value on DATE_UNSUBSCRIBED (DATE_QUARANTINE empty)


*Active means, row with a value on DATE_SUBSCRIBED and no value on DATE_UNSUBSCRIBED and DATE_QUARANTINE


Strategy:

- Use a flag to identify the rows to keep: we add a new field in the table and we push a single value in it.

- Use the ROW_NUMBER() function to do the extraction of the correct rows


QUERIES:


---------------------------------------------------------------
-- Backup original table and add a field to FLAG the contacts on CORETABLE
---------------------------------------------------------------
CREATE TABLE CORETABLE_BCKUP_16oct19 AS SELECT * FROM CORETABLE;

ALTER TABLE CORETABLE ADD (TOKEEP NVARCHAR2(6));
COMMIT;

UPDATE CORETABLE SET TOKEEP = null;
COMMIT;

------------------------------------------------------------------------
-- QUERY A: Keep oldest active and delete active with a duplicate email in unsubscribe or quarantine 
------------------------------------------------------------------------
UPDATE CORETABLE
SET TOKEEP = 'tokeep'
WHERE ID IN(
	SELECT 
	--COUNT(*)
	ID
	FROM(
		SELECT ct.*, ROW_NUMBER () OVER (PARTITION BY lower(ct.EMAIL) ORDER BY ct.DATE_SUBSCRIBED ASC, ct.ID ASC) AS rn
		FROM CORETABLE ct
		WHERE ct.DATE_UNSUBSCRIBED IS NULL 
		AND ct.DATE_QUARANTINE IS NULL
		AND lower(ct.EMAIL) NOT IN(
			SELECT lower(EMAIL)
			FROM CORETABLE
			WHERE DATE_UNSUBSCRIBED IS NOT NULL OR DATE_QUARANTINE IS NOT NULL
			)
		)
	WHERE rn = 1
);--510,920 rows updated.

COMMIT;

------------------------------------------------------------------------
-- QUERY B: Keep unsubscribed (contact is unsubscribed NOT quarantined)
------------------------------------------------------------------------
UPDATE CORETABLE
SET TOKEEP = 'tokeep' WHERE ID IN(
	SELECT 
	--COUNT(*)
	ID
	FROM (
		SELECT ct.*, ROW_NUMBER () OVER (PARTITION BY lower(ct.EMAIL) ORDER BY ct.DATE_SUBSCRIBED ASC, ct.ID ASC) AS rn
		FROM CORETABLE ct
		WHERE ct.DATE_UNSUBSCRIBED IS NOT NULL AND ct.DATE_QUARANTINE IS NULL
		)
	WHERE rn = 1
);--160,920 rows updated.

COMMIT;

------------------------------------------------------------------------
-- QUERY C: Keep quarantined (contact is quarantined NOT unsubscribed)
------------------------------------------------------------------------ 
UPDATE CORETABLE
SET TOKEEP = 'tokeep' WHERE ID IN(
	SELECT 
	--COUNT(*)
	ID
	FROM(
		SELECT ct.*, ROW_NUMBER () OVER (PARTITION BY lower(ct.EMAIL) ORDER BY ct.DATE_SUBSCRIBED ASC, ct.ID ASC) AS rn
		FROM CORETABLE ct
		WHERE ct.DATE_QUARANTINE IS NOT NULL AND ct.DATE_UNSUBSCRIBED IS NULL
		)
	WHERE rn = 1
);--171,585 rows updated. 

COMMIT;

------------------------------------------------------------------------
-- QUERY D: Keep quarantined AND unjoined (value in both DATE_UNSUBSCRIBED and DATE_QUARANTINE), with no duplicates with only DATE_UNSUBSCRIBED or only DATE_QUARANTINE - dedup the quar/unsub contacts exclusively
------------------------------------------------------------------------ 
UPDATE CORETABLE
SET TOKEEP = 'tokeep' 
WHERE ID IN(
	SELECT 
	--COUNT(*)
	ID
	FROM (
		SELECT ct.*, ROW_NUMBER () OVER (PARTITION BY lower(ct.EMAIL) ORDER BY ct.DATE_SUBSCRIBED ASC, ct.ID ASC) AS rn
		FROM CORETABLE ct
		WHERE ct.DATE_QUARANTINE IS NOT NULL AND ct.DATE_UNSUBSCRIBED IS NOT NULL
			AND lower(EMAIL) NOT IN (
				SELECT lower(EMAIL) 
				FROM CORETABLE ct
				WHERE (DATE_UNSUBSCRIBED IS NOT NULL AND DATE_QUARANTINE IS NULL) OR (DATE_QUARANTINE IS NOT NULL AND DATE_UNSUBSCRIBED IS NULL))
		)
	WHERE rn = 1
);--1,769 rows updated. 

COMMIT;

------------------------------------------------------------------------
-- QUERY E: Keep only oldest DATE_SUBSCRIBED or lower ID when a contact is twice on the database with :
-- DATE_QUARANTINE in one line (DATE_UNSUBSCRIBED empty) 
-- DATE_UNSUBSCRIBED in another line (DATE_QUARANTINE empty)
-- > to keep the contact who has the oldest DATE_SUBSCRIBED or lower ID when the DATE_SUBSCRIBED is the same.
------------------------------------------------------------------------
UPDATE CORETABLE
SET TOKEEP = null 
WHERE ID IN(
	SELECT 
	--COUNT(*)
	ID
	FROM (
		SELECT ct.*, ROW_NUMBER () OVER (PARTITION BY lower(ct.EMAIL) ORDER BY ct.DATE_SUBSCRIBED ASC, ct.ID ASC) AS rn
		FROM CORETABLE ct
		WHERE tokeep='tokeep' and lower(EMAIL) in (
			SELECT lower(EMAIL)
			FROM CORETABLE
			WHERE tokeep='tokeep'
			GROUP BY LOWER(EMAIL)
			HAVING COUNT(*) > 1
		)
	)WHERE rn != 1
);--2,683 rows updated.

COMMIT;--Commit complete.

------------------------------------------------------------------------
-- Countings
------------------------------------------------------------------------
SELECT count(*) FROM CORETABLE;-- 1192650
SELECT count (distinct(lower(email))) FROM CORETABLE;--842511

SELECT count(lower(email)) 
FROM CORETABLE 
WHERE tokeep='tokeep';-- 842511 rows

SELECT count(distinct lower(email)) 
FROM CORETABLE 
WHERE tokeep='tokeep';-- 842511 rows

SELECT count(lower(email)) FROM CORETABLE WHERE tokeep is null;--350139

------------------------------------------------------------------------
-- DELETE
------------------------------------------------------------------------
DELETE FROM CORETABLE WHERE tokeep is null;--350,139 rows deleted.



12 views0 comments

Recent Posts

See All

댓글


bottom of page