Hello
I working on the doing an application to customer match , where an application can have multiple customers (corporate) and customer can have multiple applications.
so i am doing a cartesian join and using UTL_MATCH .JARO_WINKLER_SMILIARTY for string matching.
Table A (applications) Approximatly 20,000 per day
Table B (CusomterPfile) approximatly 400,000 per month.
I have to match apps for month, so intermeidate volume is 20,000*400,000 = 8,000,000,000 and final data set with JW_SCORE>970
Its time consuming to create the table of 8billion and fillter for200K Approx having jw_score>970.
And also to filter in sub querry. please suggest a better approach .
CREATE TABLE tablea
(
APP_ID NUMBER(14),
APP_LOAD_DATE DATE,
SSN VARCHAR2(20 BYTE),
DOB DATE,
FIRST_NAME VARCHAR2(50 BYTE),
LAST_NAME VARCHAR2(50 BYTE),
ADDR VARCHAR2(92 BYTE),
CITY VARCHAR2(50 BYTE),
STATE VARCHAR2(20 BYTE),
ZIP VARCHAR2(20 BYTE)
);
CREATE TABLE tableb
(
ACCOUNT_NUMBER NUMBER(9) NOT NULL,
DOB DATE,
SSN NUMBER(9),
FIRST_NAME VARCHAR2(32 BYTE),
LAST_NAME VARCHAR2(60 BYTE),
ADDR VARCHAR2(100 BYTE),
CITY VARCHAR2(26 BYTE),
STATE CHAR(2 BYTE),
ZIP CHAR(5 BYTE),
ZIP_4 CHAR(4 BYTE)
);
SELECT A.* , JW_SSN+JW_DOB+JW_FNAME+JW_LNAME+JW_ADDR AS JW_SCORE FROM (
SELECT /*+ PARALLEL(16) */
a.appid
, a.ssn a_ssn
, b.ssn b_ssn
, a.dob AS a_dob
, b.dob AS b_dob
, a.first_name AS a_first_name
, b.first_name AS b_first_name
, a.last_name AS a_last_name
, b.last_name AS b_last_name
, A.FIRST_NAME ||' '|| A.LAST_NAME AS A_FULLNAME
, B.FIRST_NAME ||' ' ||B.LAST_NAME AS B_FULLNAME
, a.addr AS a_addr
, b.addr b_addr
, a.city AS a_city
, b.city b_city
, a.state AS a_state
, b.state b_state
, a.zip AS a_zip
, b.zip b_zip
, ( CASE WHEN ( A.ssn IS NULL or B.SSN IS NULL ) THEN 100
WHEN ( ( A.SSN = 111111111 OR B.SSN = 111111111)) THEN 100
ELSE UTL_MATCH.jaro_winkler_similarity ( LPAD( A.ssn,9,0) , LPAD( b.ssn ,9,0))
END)* 3 as JW_SSN
,( CASE WHEN ( A.dob IS NULL OR B.dob IS NULL ) THEN 100
ELSE UTL_MATCH.jaro_winkler_similarity ( A.dob ,B.dob)
END )*2 as JW_DOB
,( CASE WHEN ( A.first_name IS NULL OR B.first_name IS NULL ) THEN 100
ELSE UTL_MATCH.jaro_winkler_similarity ( A.first_name ,B.first_name)
END ) *2as JW_FNAME
, (CASE WHEN ( A.last_name IS NULL OR B.last_name IS NULL ) THEN 100
ELSE UTL_MATCH.jaro_winkler_similarity ( A.last_name ,B.last_name)
END) *2 as JW_LNAME
, CASE WHEN ( A.FIRST_NAME || A.LAST_NAME IS OR AND B.FIRST_NAME||B.LAST_NAME IS NULL ) THEN 100
ELSE UTL_MATCH.jaro_winkler_similarity ( A.FIRST_NAME ||' '|| A.LAST_NAME , B.FIRST_NAME ||' ' ||B.LAST_NAME)
END as JW_FULLNAME
, ( CASE WHEN ( A.addr IS NULL AND B.addr IS NULL ) THEN 100
ELSE UTL_MATCH.jaro_winkler_similarity ( A.addr ,B.addr)
END) * 1 as JW_ADDR
FROM tablea a, tableb b
WHERE 1 = 1
AND b.start_service_date >= a.app_load_date
) A
WHERE JW_SSN+JW_DOB+JW_FNAME+JW_LNAME+JW_ADDR > 970;
Plan hash value: 1977758127
----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 733M| 242G| | 5713 (98)| 00:00:12 | | | |
| 1 | PX COORDINATOR FORCED SERIAL| | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10001 | 733M| 242G| | 5713 (98)| 00:00:12 | Q1,01 | P->S | QC (RAND) |
| 3 | MERGE JOIN | | 733M| 242G| | 5713 (98)| 00:00:12 | Q1,01 | PCWP | |
| 4 | SORT JOIN | | 479K| 72M| 170M| 74 (5)| 00:00:01 | Q1,01 | PCWP | |
| 5 | PX RECEIVE | | 479K| 72M| | 71 (0)| 00:00:01 | Q1,01 | PCWP | |
| 6 | PX SEND BROADCAST | :TQ10000 | 479K| 72M| | 71 (0)| 00:00:01 | Q1,00 | P->P | BROADCAST |
| 7 | PX BLOCK ITERATOR | | 479K| 72M| | 71 (0)| 00:00:01 | Q1,00 | PCWC | |
| 8 | TABLE ACCESS FULL | JW_BAN_TABLE | 479K| 72M| | 71 (0)| 00:00:01 | Q1,00 | PCWP | |
|* 9 | FILTER | | | | | | | Q1,01 | PCWP | |
|* 10 | SORT JOIN | | 611K| 114M| 258M| 94 (5)| 00:00:01 | Q1,01 | PCWP | |
| 11 | PX BLOCK ITERATOR | | 611K| 114M| | 91 (2)| 00:00:01 | Q1,01 | PCWC | |
| 12 | TABLE ACCESS FULL | JW_APP | 611K| 114M| | 91 (2)| 00:00:01 | Q1,01 | PCWP | |
----------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
9 - filter(CASE WHEN (("A"."SSN" IS NULL) OR ("B"."SSN" IS NULL)) THEN 100 WHEN ((TO_NUMBER("A"."SSN")=111111111) OR
("B"."SSN"=111111111)) THEN 100 ELSE "UTL_MATCH"."JARO_WINKLER_SIMILARITY"(LPAD("A"."SSN",9,'0'),LPAD(TO_CHAR("B"."SSN"),9
,'0')) END *3+CASE WHEN (("A"."DOB" IS NULL) OR ("B"."DOB" IS NULL)) THEN 100 ELSE
"UTL_MATCH"."JARO_WINKLER_SIMILARITY"(INTERNAL_FUNCTION("A"."DOB"),INTERNAL_FUNCTION("B"."DOB")) END *2+CASE WHEN
(("A"."FIRST_NAME" IS NULL) OR ("B"."FIRST_NAME" IS NULL)) THEN 100 ELSE
"UTL_MATCH"."JARO_WINKLER_SIMILARITY"("A"."FIRST_NAME","B"."FIRST_NAME") END *2+CASE WHEN (("A"."LAST_NAME" IS NULL) OR
("B"."LAST_NAME" IS NULL)) THEN 100 ELSE "UTL_MATCH"."JARO_WINKLER_SIMILARITY"("A"."LAST_NAME","B"."LAST_NAME") END
*2+CASE WHEN (("A"."ADDR" IS NULL) AND ("B"."ADDR" IS NULL)) THEN 100 ELSE
"UTL_MATCH"."JARO_WINKLER_SIMILARITY"("A"."ADDR","B"."ADDR") END *1>970)
10 - access(INTERNAL_FUNCTION("B"."START_SERVICE_DATE")>=INTERNAL_FUNCTION("A"."APP_LOAD_DATE"))
filter(INTERNAL_FUNCTION("B"."START_SERVICE_DATE")>=INTERNAL_FUNCTION("A"."APP_LOAD_DATE"))
Note
-----
- Degree of Parallelism is 16 because of hint
thanks in Advance
You need "more" to join on to give a better reduction of rows through the join.
So you can provide some assumptions that *you* are probably making through the database. For example, you might assume that we'll match on first letter of each name, and the state they live in.
AND b.start_service_date >= a.app_load_date
and substr(b.first_name,1,1) = substr(a.first_name,1,1)
and substr(b.last_name,1,1) = substr(a.last_name,1,1)
and b.state = a.state