Dear Colleagues,
My program deals with URLs and needs to separate the URLs that probably contain date. Date format is unknown, but let's assume only number formats for simplicity. Currently I solve this task with a self-written package. It contains a list of reasonable date formats and tries to extract numbers and separators, and to convert these substrings to date data type using this list of formats. Obviously, performance is not too high.
My question is following: Maybe there is some elegant variant to guess presence of a date in a string when date format is unknown? It is better to make a false-positive decision (date exists) than false-negative (miss existing date).
Example:
with t as
(
select 1 as id, 'http://some1.com/file-02.02.2022.xls' as url from dual union all
select 2 as id, 'http://some2.com/file-20220202.xls' as url from dual union all
select 3 as id, 'http://some1.com/file-980992022.xls' as url from dual
)
select * from t
;
ID = 1 - date exists;
ID = 2 - date exists;
ID = 3 - date does not exist.
Thank you!
Regards,
Yury
I don't know of an in-built way to do this. This is a hard problem to solve in general.
Here's one way to approach it:
- Search the table for values that might be dates
- Extract the candidate date strings from the URL
- Pass these to a validation function to see if they are real dates. From 12.2 you can do this with VALIDATE_CONVERSION; you'll have to roll your own in earlier releases.
Here's an example to get you started:
with t as (
select 1 as id, 'http://some1.com/file-02.02.2022.xls' as url from dual union all
select 2 as id, 'http://some2.com/file-20220202.xls' as url from dual union all
select 3 as id, 'http://some1.com/file-980992022.xls' as url from dual
), possible_dates as (
select t.*,
coalesce (
regexp_substr ( url, '[0-9]{8}' ),
regexp_substr ( url, '([0-9]{2}[[:punct:]]){2}[0-9]{4}' )
) dt
from t
where regexp_like (
url, '[0-9]{8}'
)
or regexp_like (
url, '([0-9]{2}[[:punct:]]){2}[0-9]{4}'
)
)
select id, url, dt
from possible_dates d
where 1 in (
validate_conversion ( dt as date, 'ddmmyyyy' ),
validate_conversion ( dt as date, 'yyyymmdd' ),
validate_conversion ( dt as date, 'dd.mm.yyyy' )
);
ID URL DT
---------- ------------------------------------ ------------------------------------
1 http://some1.com/file-02.02.2022.xls 02.02.2022
2 http://some2.com/file-20220202.xls 20220202
There are lots of risks with this - the first is ensuring you cover all the possible date formats. The good news is Oracle Database is relatively forgiving of differing punctuation in the string.
For example, the conversion below works even though the string has hyphens and the format slashes:
select to_date ( '01-01-2022', 'dd/mm/yyyy' ) from dual;
TO_DATE('01
-----------
01-JAN-2022
The downside to the method above is the regexes return the first match. This can overlook possible dates. For example, this string contains the possible date 1 Jan 2022, but is prefixed by 99:
select regexp_substr ( '9901012022', '[0-9]{8}' )
from dual;
REGEXP_S
--------
99010120
So the regex returns the start which is an invalid date. You could get clever with the regex (e.g. "([0-3][0-9]){2}[0-9]{4}" ) to reduce the chance of this, but covering off all the possible day/month/year orders is a pain.