Here is an example of how to deduplicate a single table with PL/SQL.
Context:
We have a unique table. The client wants to get a data cleaning, based on an EMAIL field: many rows have email duplicates and the goal is to delete them and get a table where:
Number of distinct emails = total Nb rows of the whole table
We work here with the assumption that our criteria to deduplicate are date fields. We also have a serial on the table that allows us to identify unequivocally our records, every new entry is associated with a new incremental value.
Problem:
We have a lot of duplicate contacts in the table, based on this EMAIL field.
We must deduplicate the table for the client, and we have to:
- 1: analyze what is the current state of this table: what are the fields we can use to de-duplicate, do the different countings and get a picture on how to proceed
- 2: explain to the client the way we can deduplicate and get his feedback on the different criteria we will use.
- 3: prepare and launch the queries
Recap:
Oracle environment
1 single table, no other table involved here
date fields + serial field are the criteria chosen to de-duplicate
PL/SQL language
Main table name: CORETABLE
Fields to use: We know our generic structure in terms of product.
We know we have :
- EMAIL: email of the contacts
- ID: our UNIQUE key, a serial automatically added on any new entry
- DATE_SUBSCRIBED: a date field AUTOMATICALLY filled in when the new record is inserted in the table (so ALWAYS has a value, cannot be empty)
- DATE_UNSUBSCRIBED: date field of the unsubscription, can be empty; if value, contact is unsubscribed, not reachable
- DATE_QUARANTINE: date field of the quarantine (hard bounces in Email marketing), can be empty; if value, contact is quarantined, so not reachable
Find below some cases and the results of the final decisions:
Date format: dd-mm-yyyy
Decisions:
- Keep active* contacts and delete active with a duplicate email in unsubscribe or quarantine
- Keep unsubscribed (no value on DATE_QUARANTINE)
- Keep quarantined (no value on DATE_UNSUBSCRIBED)
- Keep only oldest DATE_SUBSCRIBED or lower ID (if DATE_SUBSCRIBED is the same) when a contact is duplicated on the database with:
- on one line a value on DATE_QUARANTINE (DATE_UNSUBSCRIBED empty)
- on another line a value on DATE_UNSUBSCRIBED (DATE_QUARANTINE empty)
*Active means, row with a value on DATE_SUBSCRIBED and no value on DATE_UNSUBSCRIBED and DATE_QUARANTINE
Strategy:
- Use a flag to identify the rows to keep: we add a new field in the table and we push a single value in it.
- Use the ROW_NUMBER() function to do the extraction of the correct rows
QUERIES:
---------------------------------------------------------------
-- Backup original table and add a field to FLAG the contacts on CORETABLE
---------------------------------------------------------------
CREATE TABLE CORETABLE_BCKUP_16oct19 AS SELECT * FROM CORETABLE;
ALTER TABLE CORETABLE ADD (TOKEEP NVARCHAR2(6));
COMMIT;
UPDATE CORETABLE SET TOKEEP = null;
COMMIT;
------------------------------------------------------------------------
-- QUERY A: Keep oldest active and delete active with a duplicate email in unsubscribe or quarantine
------------------------------------------------------------------------
UPDATE CORETABLE
SET TOKEEP = 'tokeep'
WHERE ID IN(
SELECT
--COUNT(*)
ID
FROM(
SELECT ct.*, ROW_NUMBER () OVER (PARTITION BY lower(ct.EMAIL) ORDER BY ct.DATE_SUBSCRIBED ASC, ct.ID ASC) AS rn
FROM CORETABLE ct
WHERE ct.DATE_UNSUBSCRIBED IS NULL
AND ct.DATE_QUARANTINE IS NULL
AND lower(ct.EMAIL) NOT IN(
SELECT lower(EMAIL)
FROM CORETABLE
WHERE DATE_UNSUBSCRIBED IS NOT NULL OR DATE_QUARANTINE IS NOT NULL
)
)
WHERE rn = 1
);--510,920 rows updated.
COMMIT;
------------------------------------------------------------------------
-- QUERY B: Keep unsubscribed (contact is unsubscribed NOT quarantined)
------------------------------------------------------------------------
UPDATE CORETABLE
SET TOKEEP = 'tokeep' WHERE ID IN(
SELECT
--COUNT(*)
ID
FROM (
SELECT ct.*, ROW_NUMBER () OVER (PARTITION BY lower(ct.EMAIL) ORDER BY ct.DATE_SUBSCRIBED ASC, ct.ID ASC) AS rn
FROM CORETABLE ct
WHERE ct.DATE_UNSUBSCRIBED IS NOT NULL AND ct.DATE_QUARANTINE IS NULL
)
WHERE rn = 1
);--160,920 rows updated.
COMMIT;
------------------------------------------------------------------------
-- QUERY C: Keep quarantined (contact is quarantined NOT unsubscribed)
------------------------------------------------------------------------
UPDATE CORETABLE
SET TOKEEP = 'tokeep' WHERE ID IN(
SELECT
--COUNT(*)
ID
FROM(
SELECT ct.*, ROW_NUMBER () OVER (PARTITION BY lower(ct.EMAIL) ORDER BY ct.DATE_SUBSCRIBED ASC, ct.ID ASC) AS rn
FROM CORETABLE ct
WHERE ct.DATE_QUARANTINE IS NOT NULL AND ct.DATE_UNSUBSCRIBED IS NULL
)
WHERE rn = 1
);--171,585 rows updated.
COMMIT;
------------------------------------------------------------------------
-- QUERY D: Keep quarantined AND unjoined (value in both DATE_UNSUBSCRIBED and DATE_QUARANTINE), with no duplicates with only DATE_UNSUBSCRIBED or only DATE_QUARANTINE - dedup the quar/unsub contacts exclusively
------------------------------------------------------------------------
UPDATE CORETABLE
SET TOKEEP = 'tokeep'
WHERE ID IN(
SELECT
--COUNT(*)
ID
FROM (
SELECT ct.*, ROW_NUMBER () OVER (PARTITION BY lower(ct.EMAIL) ORDER BY ct.DATE_SUBSCRIBED ASC, ct.ID ASC) AS rn
FROM CORETABLE ct
WHERE ct.DATE_QUARANTINE IS NOT NULL AND ct.DATE_UNSUBSCRIBED IS NOT NULL
AND lower(EMAIL) NOT IN (
SELECT lower(EMAIL)
FROM CORETABLE ct
WHERE (DATE_UNSUBSCRIBED IS NOT NULL AND DATE_QUARANTINE IS NULL) OR (DATE_QUARANTINE IS NOT NULL AND DATE_UNSUBSCRIBED IS NULL))
)
WHERE rn = 1
);--1,769 rows updated.
COMMIT;
------------------------------------------------------------------------
-- QUERY E: Keep only oldest DATE_SUBSCRIBED or lower ID when a contact is twice on the database with :
-- DATE_QUARANTINE in one line (DATE_UNSUBSCRIBED empty)
-- DATE_UNSUBSCRIBED in another line (DATE_QUARANTINE empty)
-- > to keep the contact who has the oldest DATE_SUBSCRIBED or lower ID when the DATE_SUBSCRIBED is the same.
------------------------------------------------------------------------
UPDATE CORETABLE
SET TOKEEP = null
WHERE ID IN(
SELECT
--COUNT(*)
ID
FROM (
SELECT ct.*, ROW_NUMBER () OVER (PARTITION BY lower(ct.EMAIL) ORDER BY ct.DATE_SUBSCRIBED ASC, ct.ID ASC) AS rn
FROM CORETABLE ct
WHERE tokeep='tokeep' and lower(EMAIL) in (
SELECT lower(EMAIL)
FROM CORETABLE
WHERE tokeep='tokeep'
GROUP BY LOWER(EMAIL)
HAVING COUNT(*) > 1
)
)WHERE rn != 1
);--2,683 rows updated.
COMMIT;--Commit complete.
------------------------------------------------------------------------
-- Countings
------------------------------------------------------------------------
SELECT count(*) FROM CORETABLE;-- 1192650
SELECT count (distinct(lower(email))) FROM CORETABLE;--842511
SELECT count(lower(email))
FROM CORETABLE
WHERE tokeep='tokeep';-- 842511 rows
SELECT count(distinct lower(email))
FROM CORETABLE
WHERE tokeep='tokeep';-- 842511 rows
SELECT count(lower(email)) FROM CORETABLE WHERE tokeep is null;--350139
------------------------------------------------------------------------
-- DELETE
------------------------------------------------------------------------
DELETE FROM CORETABLE WHERE tokeep is null;--350,139 rows deleted.
댓글