Understanding Different Character Sets for our Database

Let's explore about various Character set that we use for our database and what do they really mean.

We have used and seen character sets in our database such as US7ASCII, WE8ISO8859P1, AL32UTF8 and national character set such as AL16UTF16. Let's understand each one of them at a high level.

In the initial days of computers, US7ASCII character set was used as a standard to store and handle English characters. US7ASCII is a 7-bit ACSII code which was sufficient to store all English characters and symbols found in our keyboards (#, $, %, <, >, * etc.). For a 7-bit ASCII code, it can store maximum of 127 characters and symbols. So, it is a 1-byte character code, and it uses only 1 byte (but only 7-bit) per character.

Later, many software vendors and computer manufacturers used the 8th bit as well to increase the number of maximum characters and symbols to 256. This helped to store additional characters from Western Europe countries such as German, French or Dutch letters.

This 8-bit is used in character set WE8ISO8859P1. As you can see from its name, this character set stores additional characters and symbols from Western Europe (WE) and it uses 8 bits (1 byte) to represent a character or symbol.

As technology evolved and we got into internet era, we needed to store and manage all the alphabets and symbols from all languages including Chinese, Japanese and Korean etc.

AL32UTF8 character set is used for that purpose. AL32UTF8 uses 32-bit with UTF8 encoding scheme to store characters from all languages. It uses max up to 32-bit or 4 bytes for every character or symbol. This is a multi-byte character set as compared to single byte character sets as we saw in USASCII and WE8ISO8859P1.

Please note that AL32UTF8 uses maximum up to 4 bytes for any character. For example, an English alphabet takes only 1 byte. In that case, AL32UTF8 takes up only 1 byte space (and not the entire 4 bytes). But if you're using some character or symbol which need multiple bytes (such as Chinese character symbol), it will use the required number of bytes.

This means that UTF8 encoding scheme uses the same number of bytes as in US7ASCII and WE8ISO8859P1 to store English and European characters (using the same encoding method), but for other languages it uses multi-byte to represent characters.

In Oracle database, we also use National Character set as AL16UTF16. This character set uses UTF-16 (a 16-bit) encoding scheme. National character set is used if your machine or your database is not set up with UTF8 encoding scheme and you still use the old character set such as US7ASCII, and you want to use additional language characters in specific columns in a specific table, then you can use national character set only for that specific column (By defining it as NCHAR or NVARCHAR datatype).

That's about a high-level explanation about the character sets. Please provide your thought/ feedback in the comment section.

Oracle & PostgreSQL - Developer & Administrator

Understanding Different Character Sets for our Database

No comments:

Post a Comment

Report Abuse