Skip to Main Content
  • Questions
  • determine charset/encoding of blob parameter

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, Hans-Werner.

Asked: August 14, 2017 - 6:40 am UTC

Last updated: March 18, 2022 - 4:49 am UTC

Version: 12.1.0.2.0

Viewed 10K+ times! This question is

You Asked

origin of sources:
https://oracle-base.com/articles/misc/oracle-rest-data-services-ords-restful-web-services-handling-complex-json-payloads

How can I guess the encoding of submitted data?
Consider a web service call like described in this very helpful article series on oracle-base.com
The POST data may be created in different encodings: utf8, iso8859-1, ...
Those apex_json.get_varchar2 calls return a varchar2 in the encoding of the database
l_emp_tab(l_emp_tab.last).ename := APEX_JSON.get_varchar2(p_path => 'departments[%d].department.employees[%d].employee_name', p0 => i, p1 => j);
Now I want to know whether I have to convert these strings to store them into the database.
They may contain special characters (umlaut) that are represented differently in different encodings.

This is a complicated example for 'How do I guess the encoding of binary data.' like notepad++ or unix file can do it.

Kind regards and thanks in advance

Hans


with LiveSQL Test Case:

and Connor said...

A couple of things in play here

1) For character data, character set conversion is handled by the database automatically. If you have data (in the database) characterset #1, and you retrieve with an oracle client in characterset #2, then we'll do the conversion *if possible*. (Obviously if you have data in #1 that cannot be mapped to #2 you get garbage). But for me, anything I build/do nowadays I endeavour to be doing it all in UTF8 to cater for the broadest possible requirements. But obviously, sometimes you are at the whim of data that is being passed to your from an external source - but in these cases, its reasonable to insist that the charset used to encode the data is provided to you as part of the metadata of the request.

2) For binary data, it is just that...binary. The responsibility for 'understanding' that data and what it represents falls to you. That might involve capturing (say) the mime type of data when its first loaded if possible. Or utilizing things like data signatures to attempt to classify the data in the same way that 'file' does

https://en.wikipedia.org/wiki/List_of_file_signatures


Rating

  (2 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

You don't answer the question

Santiago Coria, March 15, 2022 - 6:03 pm UTC

The Hans-Werner asks "How can I guess the encoding of submitted data?"
But the reply does not answer how or even if it is or not posible.
I understand that once the file is in a BLOB in the database the charset information is lost. Is it correct?
Connor McDonald
March 18, 2022 - 4:49 am UTC

There is no "characterset" as such for the blob, because its characterset independent (assuming a multibyte charset for the database which is the default).


Mustafa KALAYCI, March 31, 2022 - 4:51 am UTC

Some developers store text data in a BLOB column and they don't do this by converting text data into a raw data. They read "text file" as a byte array and insert this byte array into blob column (I certainly against this method, always use clob for text data). in this case, text file itself (not just content) are written into blob and when you tried to read it as text (to_clob, dbms_lob.conterToClob etc) some characters might be converted to non alphabetical characters because they are lost in conversion.

to solve this, you must know base encoding of text file and while converting data from blob to clob, you must use CharacterSet ID for example to_clob(my_column, 46). 46 is the charset id of WE8ISO8859P15 charset. (at worst you can guess what it could be, use nls_charset_id function to get charset id of a charset and try to convert via this charset id, if there is no strange characters at the end of the conversion, you might found your encoding)

if you didn't check the encoding at the beginning and insert the file into blob, then you can "guess" the encoding of the file. there are some examples on google about how to guess file encoding by looking at it's first bytes but non of them is guaranteed.

I strongly suggest read the text file content and insert it into a CLOB and even if CLOB is not an option for some reason, again read content of the file then convert it to raw and then insert into blob. that way you won't be dealing with conversions.

My two cents.

More to Explore

Administration

Need more information on Administration? Check out the Administrators guide for the Oracle Database