I developed this UDT for doing natural language sorting. I'll post it as is if you would like to use it.
I recently changed the type to support Unicode and use map instead of order method since the number of comparisons in the order method become inefficient with larger sets.
The map method is fairly efficient, but you don't get something for nothing. Tests of sorting ~85k records takes ~5 sec.
Hope this helps.
-- Create test collection type if not exists
create or replace type vc_coll as table of varchar2(32000);
/
create or replace type nlsort_obj_map_unicode as object
(
-------------------------------------------------------------------------------
-- Program: nlsort_obj_map_unicode
-- Description:
-- Object type for Natural Language Sort with FULL UNICODE SUPPORT.
-- Properly handles all Unicode characters including Chinese, Japanese,
-- Arabic, Cyrillic, emoji, etc.
--
-- By default, Oracle will sort values in ASCII value order while many times
-- users would expect a more natural ordering.
-- Ex:
-- ASCII NLSort NLSort (case insensitive)
-- A1 A1 A1
-- A10 A2 a1
-- A101 A10 A2
-- A2 A101 A10
-- a1 a1 a11
-- a11 a11 A101
--
-- nlsort_obj_map_unicode facilitates natural language sorting in SQL by passing the column name to sort.
-- Optional parameters can modify the way it will sort.
-- The constructor will split the text into array elements looking for two alternating groups.
--
-- "Not Number" [^0-9] (referenced as "Alpha" here out)
-- "Number" [0-9]
--
-- It will stop parsing when both the Number and Alpha search return null.
-- The resultant parsing will produce an array of values to use for sorting with most significant
-- in the lowest array index. The parser will determine if the value begins with a numeric or alpha.
-- Subsequent searches will alternate between Number and Alpha once the first type is determined.
--
-- Alpha | Number | Alpha | Number | .... | null | null
--
-- Comparison:
-- Sorting comparsion is dependent on the two values being compared.
-- The following are the basic rules applied for comparing elements A and B.
-- 1. If both A and B both start with the same type then the corresponding indicies can be compared as a number or
-- as string.
-- 2. If A and B do not start with the same type then the indicies will not match and so all comparisons for A and B
-- will be string.
--
-- The parameters p_sortOrder, p_caseSensitive, p_nullsFirst and p_negativeNumber can affect the resulting sort.
-- See below for those definitions.
--
-- Constructor parameters:
-- self - object instance is implicit in call. Do not provide in constructor call
-- self is defined here to use the NOCOPY option to pass by reference.
-- p_txt - Text to sort. Typically a column name but any text source can be provided.
-- p_sortOrder - [ASC|DESC] Default ASC. Determines Ascending sort order or Descending sort order.
-- When p_nullsFirst is set to N, null values will follow the p_sortOrder process.
-- ASC - nulls at bottom
-- DESC - nulls at top
-- p_caseSensitive - [Y|N] Default Y. Controls whether the comparison will be case sensitive or not
-- p_nullsFirst - [Y|N] Default N. Matches Oracle's default for how null values will sort.
-- Set to Y to always place p_txt=null values first regardless of the p_sortOrder value.
-- This does not affect the sorting of nulls within each array index. That is controlled by
-- p_sortOrder value.
-- p_negativeNumber - [Y|N] Default N. By default a dash ("-") preceding a number will parse the dash
-- with the Alpha element. When p_negativeNumber = Y, a single dash preceding a
-- number will parse with the numeric values making it negative. More than one dash will
-- only take one dash with the numbers leave then remaining dashes with the Alpha.
--
-- p_checkForDates - [Y|N] Default N. Instruct parser to look for date patterns in the text. Dates will be
-- converted to a text string that can sort chronologically (YYYYMMDD). Default date
-- formats are provided in the p_dateFormats parameter.
-- p_dateFormats - Default values for date formats are provided. This parameter allows overriding those
-- values if desired.
-- p_prependDateFmts - Parameter allows additional date formats to be provided in addition to the p_dateFormats
-- This is useful if the default data formats are desired but there are also others.
-- p_prependDateFmts allows additional to be added without knowing or overriding the p_dateFormats.
-- p_y2kCentLB - Lower bound for checking whether a 2 digit year is in 20th (19xx) or 21st (20xx) century
-- p_y2kCentLB - Upper bound for checking whether a 2 digit year is in 20th (19xx) or 21st (20xx) century
--
-- Notes:
--
-- Modified to use MAP MEMBER FUNCTION for better performance
-- The map function converts the multi-element array into a single NUMBER
-- that preserves sort order using hierarchical decimal encoding
--
-- UNICODE SUPPORT: Uses RAWTOHEX with CONVERT to extract true Unicode code points
-------------------------------------------------------------------------------
text varchar2(4000)
, valArr vc_coll
, sortOrder integer
, caseSensitive varchar2(1)
, nullsFirst varchar2(1)
, negativeNumber varchar2(1)
, debugMode varchar2(1)
, firstIdxNumeric varchar2(1)
, member procedure dbg(self in nlsort_obj_map_unicode, p_txt in varchar2)
, map member function getSortValue return number deterministic
, constructor function nlsort_obj_map_unicode
(
self in out nocopy nlsort_obj_map_unicode
, p_txt in varchar2
, p_sortOrder in varchar2 := 'ASC'
, p_caseSensitive in varchar2 := 'Y'
, p_nullsFirst in varchar2 := 'N'
, p_negativeNumber in varchar2 := 'N'
, p_checkForDates in varchar2 := 'N'
, p_dateFormats in vc_coll := vc_coll('MM/DD/YYYY', 'MM-DD-YYYY', 'DD-MON-YY')
, p_prependDateFmts in vc_coll := vc_coll()
, p_y2kCentLB in integer := 40
, p_y2kCentUB in integer := 99
, p_debugMode in varchar2 := 'N'
)
return self as result deterministic
);
/
show err
create or replace type body nlsort_obj_map_unicode
as
---------------------------------------------------------------------------
member procedure dbg(self in nlsort_obj_map_unicode, p_txt in varchar2)
is
begin
if self.debugMode = 'Y' then
dbms_output.put_line(p_txt);
end if;
end;
---------------------------------------------------------------------------
-- MAP MEMBER FUNCTION with FULL UNICODE SUPPORT
-- Converts the multi-element array into a single sortable number
-- Uses hierarchical decimal encoding with decreasing precision for each position
--
-- Encoding strategy:
-- Position 0: Integer part (9 digits)
-- Position 1: .0000001 to .9999999 (7 decimal places)
-- Position 2: .00000000001 to .99999999999 (11 decimal places after pos 1)
-- etc.
--
-- Each element (text or number) is normalized to a 0-9999999 range:
-- - Numbers: Normalized using sigmoid-like function to handle wide ranges
-- - Text: Converted using TRUE Unicode code points (not just ASCII)
---------------------------------------------------------------------------
map member function getSortValue
return number deterministic
is
c_max_elements constant integer := 8; -- Limit elements to avoid precision issues
c_element_range constant number := 10000000; -- 10 million per element (7 digits)
c_text_max_chars constant integer := 16; -- Use first 16 chars of text for encoding
l_result number := 0;
l_multiplier number := 1;
l_idx integer;
l_elem_value number;
l_is_numeric boolean;
l_num_val number;
l_text_val varchar2(4000);
l_text_numeric number;
l_char_code integer;
l_position integer := 0;
-- Normalize a number to 0-9999999 range using a bounded scaling function
function normalize_number(p_num in number) return number
is
l_abs_num number := abs(p_num);
l_sign number := sign(p_num);
l_normalized number;
begin
-- Use logarithmic scaling for large numbers, linear for small
if l_abs_num = 0 then
l_normalized := 5000000; -- Middle of range
elsif l_abs_num < 1000 then
-- Linear scaling for numbers 0-999
l_normalized := 5000000 + (l_sign * least(l_abs_num, 999) * 4000);
else
-- Logarithmic scaling for larger numbers
-- Map to range using: 5000000 + sign * (log10(abs(x)) * 500000)
-- This gives us roughly:
-- 1000 -> ~1500000 offset
-- 1000000 -> ~3000000 offset
-- Still fits in 7 digit range
l_normalized := 5000000 + (l_sign * least(log(10, l_abs_num) * 500000, 4999999));
end if;
return l_normalized;
end normalize_number;
-- Convert text to numeric value using TRUE Unicode code points
-- Uses UTL_RAW and CONVERT to properly extract Unicode values
function text_to_number(p_text in varchar2) return number
is
l_result number := 0;
l_char varchar2(4);
l_raw raw(100);
l_hex varchar2(100);
l_char_val number;
l_len integer := least(length(p_text), c_text_max_chars);
c_max_value constant number := 1114112; -- Maximum Unicode code point (U+10FFFF)
begin
if p_text is null then
return 0;
end if;
-- Process up to first 16 characters
for i in 1..l_len loop
l_char := substr(p_text, i, 1);
begin
-- Convert character to UTF-8 bytes, then to hex
l_raw := utl_raw.cast_to_raw(l_char);
l_hex := rawtohex(l_raw);
-- For single-byte (ASCII): hex is 2 chars (e.g., '41' for 'A')
-- For multi-byte (Unicode): hex is 4-8 chars (e.g., 'C3A9' for 'é')
-- Take first 4 hex digits (2 bytes) as the primary sort key
l_char_val := to_number(substr(l_hex, 1, least(4, length(l_hex))), 'XXXX');
exception
when others then
-- Fallback to ASCII if conversion fails
l_char_val := ascii(l_char);
end;
-- Build hierarchical encoding
l_result := l_result + (l_char_val * power(c_max_value, c_text_max_chars - i));
end loop;
-- Normalize to 0-9999999 range
l_result := (l_result / power(c_max_value, c_text_max_chars)) * 9999999;
return l_result;
end text_to_number;
begin
dbg('=== getSortValue for: ' || self.text || ' ===');
-- Handle null text
if self.text is null then
if self.nullsFirst = 'Y' then
-- Return very small number to sort first
return -1e125 * self.sortOrder;
else
-- Return very large number based on sort order
return 1e125 * self.sortOrder;
end if;
end if;
-- Start with base multiplier (10^9 = billions place)
l_multiplier := power(10, 9);
l_is_numeric := (self.firstIdxNumeric = 'Y');
l_idx := self.valArr.first;
-- Process each array element up to max limit
while l_idx is not null and l_position < c_max_elements loop
if self.valArr(l_idx) is not null then
if l_is_numeric then
-- Process numeric element
l_num_val := to_number(self.valArr(l_idx));
l_elem_value := normalize_number(l_num_val);
dbg('Position ' || l_position || ' (NUM): ' || l_num_val || ' -> ' || l_elem_value);
else
-- Process text element
l_text_val := self.valArr(l_idx);
l_elem_value := text_to_number(l_text_val);
dbg('Position ' || l_position || ' (TXT): ' || l_text_val || ' -> ' || l_elem_value);
end if;
-- Add to result with current multiplier
l_result := l_result + (l_elem_value * l_multiplier);
-- Reduce multiplier for next position (divide by element range)
-- This creates the hierarchical decimal structure
l_multiplier := l_multiplier / c_element_range;
else
-- Null element - use 0 value
dbg('Position ' || l_position || ': NULL');
end if;
-- Toggle between numeric and text for next iteration
l_is_numeric := not l_is_numeric;
l_idx := self.valArr.next(l_idx);
l_position := l_position + 1;
end loop;
dbg('Final map value: ' || l_result);
dbg('With sortOrder: ' || (l_result * self.sortOrder));
-- Apply sort order (DESC = -1 multiplier reverses order)
return l_result * self.sortOrder;
exception
when others then
dbg('ERROR in getSortValue: ' || sqlerrm);
raise;
end;
---------------------------------------------------------------------------
-- CONSTRUCTOR
---------------------------------------------------------------------------
constructor function nlsort_obj_map_unicode
(
self in out nocopy nlsort_obj_map_unicode
, p_txt in varchar2
, p_sortOrder in varchar2 := 'ASC'
, p_caseSensitive in varchar2 := 'Y'
, p_nullsFirst in varchar2 := 'N'
, p_negativeNumber in varchar2 := 'N'
, p_checkForDates in varchar2 := 'N'
, p_dateFormats in vc_coll := vc_coll('MM/DD/YYYY', 'MM-DD-YYYY', 'DD-MON-YY')
, p_prependDateFmts in vc_coll := vc_coll()
, p_y2kCentLB in integer := 40
, p_y2kCentUB in integer := 99
, p_debugMode in varchar2 := 'N'
)
return self as result
deterministic
is
iter integer := 1;
idx integer := 0;
patCharOrig varchar2(21) := '(\D+)\d*?';
patNumOrig varchar2(21) := '(\d+)\D*?';
patCharNeg varchar2(21) := '(\D+)-{0,1}?\d*?';
patNumNeg varchar2(21) := '(-{0,1}?\d+)\D*?';
pat1 varchar2(50);
pat2 varchar2(50);
pat1Iterx varchar2(50);
pat2Iterx varchar2(50);
pat1Offset integer := 0;
pat2Offset integer := 0;
chkDateFmt varchar2(20) := null;
reDateFmt varchar2(20) := null;
dateText varchar2(20) := null;
padYearIdx integer;
dateFormats vc_coll := vc_coll();
function convertIfNumber(p_txt in varchar2)
return Boolean
is
num number;
begin
num := to_number(p_txt);
self.valArr.extend(3);
self.valArr(1) := p_txt;
self.valArr(2) := null;
self.valArr(3) := null;
self.firstIdxNumeric := 'Y';
dbg('convertIfNumber: true ' || p_txt);
return true;
exception
when value_error then
dbg('convertIfNumber: false ' || p_txt);
return false;
when others then
raise;
end;
begin
text := p_txt;
valArr := vc_coll();
caseSensitive := case when p_caseSensitive in ('Y', 'N') then p_caseSensitive else 'Y' end;
nullsFirst := case when p_nullsFirst in ('Y', 'N') then p_nullsFirst else 'N' end;
negativeNumber := case when p_negativeNumber in ('Y', 'N') then p_negativeNumber else 'N' end;
sortOrder := case upper(p_sortOrder) when 'DESC' then -1 else 1 end;
debugMode := case when p_debugMode in ('Y', 'N') then p_debugMode else 'N' end;
if caseSensitive = 'N' then
text := upper(text);
end if;
if p_checkForDates = 'Y' then
dateFormats := p_prependDateFmts;
dateFormats := dateFormats multiset union p_dateFormats;
idx := dateFormats.first;
while (idx is not null)
loop
chkDateFmt := upper(dateFormats(idx));
dbg('idx, fmt=' || idx || ',' || chkDateFmt);
reDateFmt := regexp_replace(chkDateFmt, 'MON', '\w\w\w');
reDateFmt := regexp_replace(reDateFmt, '[MDYR]', '\d');
dateText := regexp_substr(text, reDateFmt);
dbg('reDateFmt, dateText=' || reDateFmt || ',' || dateText);
if length(regexp_substr(chkDateFmt, '[Y|R]+')) = 2 then
padYearIdx := regexp_instr(chkDateFmt, '[Y|R]');
chkDateFmt := substr(chkDateFmt,1,padYearIdx - 1) || 'YY' || substr(chkDateFmt, padYearIdx);
end if;
dbg('idx, fmt=' || idx || ',' || chkDateFmt);
while (dateText is not null)
loop
if padYearIdx > 0 then
dateText := substr(dateText,1,padYearIdx - 1) ||
case when to_number(substr(dateText, padYearIdx)) between p_y2kCentLB and p_y2kCentUB
then '19' else '20'
end ||
substr(dateText, padYearIdx);
end if;
dbg('date=' || to_char(to_date(dateText, chkDateFmt), 'MM/DD/YYYY'));
dateText := to_char(to_date(dateText, chkDateFmt), ' YYYYMMDD ');
text := regexp_replace(text, reDateFmt, dateText);
dateText := regexp_substr(text, reDateFmt);
end loop;
idx := dateFormats.next(idx);
padYearIdx := 0;
end loop;
end if;
if convertIfNumber(self.text) then
null;
else
dbg('Parsing...');
if regexp_instr(self.text, '^'|| patNumNeg, 1,1,0,'i',1) > 0 then
dbg('Numeric first');
self.firstIdxNumeric := 'Y';
pat1Iterx := case when p_negativeNumber = 'Y' then patNumNeg else patNumOrig end;
pat2Iterx := case when p_negativeNumber = 'Y' then patCharNeg else patCharOrig end;
pat1 := '^' || patNumNeg;
pat2 := patCharNeg;
if instr(self.text, '-') = 1 then
pat2Offset := 1;
end if;
else
dbg('Character first');
self.firstIdxNumeric := 'N';
pat1Iterx := case when p_negativeNumber = 'Y' then patCharNeg else patCharOrig end;
pat2Iterx := case when p_negativeNumber = 'Y' then patNumNeg else patNumOrig end;
pat1 := '^' || patCharNeg;
pat2 := case when p_negativeNumber = 'Y' then patNumNeg else patNumOrig end;
end if;
dbg('pat1,pat1Iter,firstIdx=' || pat1 || ',' || pat1Iterx || ',' || self.firstIdxNumeric);
dbg('pat2,pat2Iter,firstIdx=' || pat2 || ',' || pat2Iterx || ',' || self.firstIdxNumeric);
iter := 1;
loop
self.valArr.extend(2);
self.valArr(self.valArr.last-1) := regexp_substr(self.text, pat1, 1,iter + pat1Offset,'i',1);
self.valArr(self.valArr.last) := regexp_substr(self.text, pat2, 1,iter + pat2Offset,'i',1);
exit when valArr(valArr.last) is null and valArr(valArr.last -1) is null;
if iter = 1 then
pat1 := pat1Iterx;
pat2 := pat2Iterx;
end if;
iter := iter + 1;
end loop;
end if;
if p_negativeNumber = 'Y' then
idx := self.valArr.first;
if self.firstIdxNumeric = 'Y' then
idx := self.valArr.next(idx);
end if;
while (idx is not null)
loop
self.valArr(idx) := regexp_replace(self.valArr(idx), '(.*?)-$', '\1');
idx := self.valArr.next(idx);
idx := self.valArr.next(idx);
end loop;
end if;
return;
end;
end;
/
show err
Example uses:
sql->-- ======================================
sql->-- Test 1: Basic Natural Language Sort
sql->-- ======================================
sql->
sql->with test_data as (
2 select column_value as val
3 from table(vc_coll('A1', 'A10', 'A101', 'A2', 'a1', 'a11', 'File1', 'File2', 'File10', 'File20'))
4 )
5 select
6 val as original_value
7 from test_data
8 order by nlsort_obj_map_unicode(val);
ORIGINAL_VALUE
----------------------------------------
A1
A2
A10
A101
File1
File2
File10
File20
a1
a11
10 rows selected.
Elapsed: 00:00:00.09
sql->
sql->--
sql->-- With case-insensitive:
sql->with test_data as (
2 select column_value as val
3 from table(vc_coll('A1', 'A10', 'A101', 'A2', 'a1', 'a11', 'File1', 'File2', 'File10', 'File20'))
4 )
5 select
6 val as original_value
7 from test_data
8 order by nlsort_obj_map_unicode(val, p_caseSensitive => 'N');
ORIGINAL_VALUE
----------------------------------------
A1
a1
A2
A10
a11
A101
File1
File2
File10
File20
10 rows selected.
Elapsed: 00:00:00.09
sql->
sql->
sql->-- ======================================
sql->-- Test 2: Complex Mixed Values
sql->-- ======================================
sql->
sql->with test_data as (
2 select column_value as val
3 from table(vc_coll(
4 'Version 1.0.0',
5 'Version 1.0.10',
6 'Version 1.0.2',
7 'Version 1.10.0',
8 'Version 2.0.0',
9 'Build-100',
10 'Build-20',
11 'Build-3',
12 'Item_A_1',
13 'Item_A_10',
14 'Item_A_2'
15 ))
16 )
17 select
18 val as original_value
19 from test_data
20 order by nlsort_obj_map_unicode(val);
ORIGINAL_VALUE
----------------------------------------
Build-3
Build-20
Build-100
Item_A_1
Item_A_2
Item_A_10
Version 1.0.0
Version 1.0.2
Version 1.0.10
Version 1.10.0
Version 2.0.0
11 rows selected.
Elapsed: 00:00:00.08
sql->
sql->
sql->-- ======================================
sql->-- Test 3: Descending Order
sql->-- ======================================
sql->
sql->with test_data as (
2 select column_value as val
3 from table(vc_coll('A1', 'A2', 'A10', 'A20', 'A100'))
4 )
5 select
6 val as original_value
7 from test_data
8 order by nlsort_obj_map_unicode(val, p_sortOrder => 'DESC');
ORIGINAL_VALUE
----------------------------------------
A100
A20
A10
A2
A1
Elapsed: 00:00:00.08
sql->
sql->
sql->-- ======================================
sql->-- Test 4: Negative Numbers
sql->-- ======================================
sql->
sql->with test_data as (
2 select column_value as val
3 from table(vc_coll('Test-1', 'Test-10', 'Test-2', 'Test1', 'Test2', 'Test10'))
4 )
5 select
6 val as original_value,
7 'Standard' as sort_type
8 from test_data
9 order by nlsort_obj_map_unicode(val)
10 ;
ORIGINAL_VALUE |SORT_TYP
----------------------------------------|--------
Test1 |Standard
Test-1 |Standard
Test2 |Standard
Test-2 |Standard
Test10 |Standard
Test-10 |Standard
6 rows selected.
Elapsed: 00:00:00.09
sql->
sql->with test_data as (
2 select column_value as val
3 from table(vc_coll('Test-1', 'Test-10', 'Test-2', 'Test1', 'Test2', 'Test10'))
4 )
5 select
6 val as original_value,
7 'Negative Nums' as sort_type
8 from test_data
9 order by nlsort_obj_map_unicode(val, p_negativeNumber => 'Y');
ORIGINAL_VALUE |SORT_TYPE
----------------------------------------|-------------
Test-10 |Negative Nums
Test-2 |Negative Nums
Test-1 |Negative Nums
Test1 |Negative Nums
Test2 |Negative Nums
Test10 |Negative Nums
6 rows selected.
Elapsed: 00:00:00.09
sql->
sql->
sql->-- ======================================
sql->-- Test 5: NULL Handling
sql->-- ======================================
sql->
sql->with test_data as (
2 select column_value as val
3 from table(vc_coll('A1', 'A2', null, 'B1', null, 'C1'))
4 )
5 select
6 nvl(val, '<NULL>') as original_value,
7 'Nulls Default' as null_handling
8 from test_data
9 order by nlsort_obj_map_unicode(val)
10 ;
ORIGINAL_VALUE |NULL_HANDLING
----------------------------------------|-------------
A1 |Nulls Default
A2 |Nulls Default
B1 |Nulls Default
C1 |Nulls Default
<NULL> |Nulls Default
<NULL> |Nulls Default
6 rows selected.
Elapsed: 00:00:00.09
sql->
sql->
sql->with test_data as (
2 select column_value as val
3 from table(vc_coll('A1', 'A2', null, 'B1', null, 'C1'))
4 )
5 select
6 nvl(val, '<NULL>') as original_value,
7 'Nulls First' as null_handling
8 from test_data
9 order by nlsort_obj_map_unicode(val, p_nullsFirst => 'Y');
ORIGINAL_VALUE |NULL_HANDLI
----------------------------------------|-----------
<NULL> |Nulls First
<NULL> |Nulls First
A1 |Nulls First
A2 |Nulls First
B1 |Nulls First
C1 |Nulls First
6 rows selected.