Skip to Main Content

Breadcrumb

May 4th

Question and Answer

Connor McDonald

Thanks for the question.

Asked: October 10, 2020 - 4:10 am UTC

Last updated: October 29, 2020 - 2:18 pm UTC

Version: 19.8

Viewed 1000+ times

You Asked

Hello TOM,

Is there a way to perform a accent-insensitive conversion ONLY?
I'm using the construct:

SELECT utl_raw.cast_to_varchar2(NLSSORT('ABCöüä', 'nls_sort=binary_ai')) FROM dual;


which - as byproduct - is performing also a case-insensitive replacement, that is not required.

Something like this:
SELECT my_funct('ABCöüä') FROM dual;

should output:
ABCoua

and not:
abcoua


In case you're wondering why i need this, it's because we're storing the data in UNICODE format in our database, but we need to export it using LATIN-based characters to an international institution.

Thanks,

and we said...

Unfortunately, there is currently no built-in function to perform conversion to baseletters, which is what you are asking for. The method with NLSSORT is not supported. In addition, it adds the NUL character '\0' at the end of the string. You may not see it on the screen but it is there.

If you are exporting to Latin, then I assume you mean 7-bit ASCII (basic Latin letters A-Za-z), not the so-called Latin1 character set. Latin1 (aka WE8ISO8859P1) supports Western European accented letters, so no stripping of accents is usually necessary. Assuming you know the languages stored in your database and their alphabets -- the alphabets can usually be found in Wikipedia -- you can use the TRANSLATE function to map any unsupported characters to their preferred Latin equivalents.

A more generic approach could be to use DECODE to map accented letters to basic Latin followed by combining accents (Unicode code points in range U+0300-U+036F) and than use TRANSLATE to remove those accents. However, there are some Latin letters, like "eth" (Ð) or "ae ligature" (Æ), which do not decompose. Moreover, DECODE of a string close to 4000 byte string could exceed the allowed length of VARCHAR2 (with MAX_STRING_SIZE=EXTENDED, a string near the length of 32767 bytes would do this). Therefore, a simple TRANSLATE call to map known accented letters to base Latin equivalents is preferable. With TRANSLATE, you can map up to 4000 different letters, more than necessary.

We are thinking about adding relevant mapping functionality to the UTL_I18N.TRANSLITERATE function but this will not happen any time soon.

Rating

  (4 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Follow-up

Alex, October 29, 2020 - 4:47 pm UTC

Hi Sergiusz,

Your answer was exactly what i encountered:
(1) NLSSORT: Yes, ițve seen the ugly NUL in the output which is the reason of this topic actually; Adding the NUL character (from reasons i don't understand) is a show-stopper for my needs, although all the other conversions it's doing are PERFECT.
(2) option 2 (translate); there are 2 drawbacks here:
(a) TRANSLATE requires a 1-1 mapping so if i want to translate German 'ö' to 'oe', i'm out of luck.
(b) TRANSLATE requires me to enter ALL European (national) characters, both upper-case and lower-case, so it ends up being a long (and probably incomplete) list.
This approach is what i've actually implemented, TRANSLATE + big chain of single characters translated into their English correspondent (it's not optimal, but it's honest work).

Cheers,

To Sergiusz

J. Laurindo Chiappa, October 30, 2020 - 4:43 pm UTC

Yes, NLSSORT adds an extra CHR(0), but I´m able to remove it easily with REPLACE, and to my needs (I´m using Portuguese, and the Portuguese alphabet can be directly converted) NLS_SORT works perfectly, this way :

SQL> select utl_raw.cast_to_varchar2(nlssort('À NOITE,VOVÔ MÜLLER VÊ O ÍMÃ CAIR NO PÉ E VOVÓ SÓ PÕE AÇÚCAR NO CHÁ SEM ÂNIMO', 'nls_sort=binary_ai')) as result from dual;

the result is :

RESULT -------------------------------------------------------------------------------------------
a noite,vovo muller ve o ima cair no pe e vovo so poe acucar no cha sem animo

==> This phrase is just a Portuguese pangram, ie, a sentence using all letters of the particular language's alphabet.. You will not see here, but sqlplus shows a small square at the end, representing the extra CHR(0) inserted)...
Adding a replace :

SQL> select replace(utl_raw.cast_to_varchar2(nlssort('À NOITE,VOVÔ MÜLLER VÊ O ÍMÃ CAIR NO PÉ E VOVÓ SÓ PÕE AÇÚCAR NO CHÁ SEM ÂNIMO', 'nls_sort=binary_ai')), chr(0)) as result from dual;

the result is perfect - afaik this will not work for non-Latin alphabets such as German, but for LAs it works perfectly...

Regards,

Chiappa

To Chiappa

Dieter, October 30, 2020 - 7:23 pm UTC

There are two problems with your approach:

- Upper / lower case is not preserved, as required by the original post.
- The cast behaves inconsistently, see 2nd line of my example ...

WITH a AS
 (SELECT 'À NOITE,VOVÔ MÜLLER VÊ O ÍMÃ CAIR NO PÉ E VOVÓ SÓ PÕE AÇÚCAR NO CHÁ SEM ÂNIMO' text
    FROM dual
  UNION ALL
  SELECT '¡¿ !? Åû -> Æ æ œ <- õč ÄÖÜ äöü -> ß <-'
    FROM dual)
SELECT text,
       REPLACE(utl_raw.cast_to_varchar2(nlssort(text,
                                                'nls_sort=binary_ai')),
               chr(0)) AS RESULT
  FROM a;

To Dieter

J. Laurindo Chiappa, November 01, 2020 - 3:53 pm UTC

Yes, as I said the NLSSORT solution works only for the base symbols in Portuguese alphabet - for sure, all special symbols (such as Euro, macron, superscrit ordinals, etc) will not work/will not be transalated... Same for decomposable symbols, such as the latin ligature, what must be decompost in A and E - so, Sergiusz will not be able to use it but for Other readers, using only base Latin symbols, this solution can be feasible/usable...

Regards,

Chiappa