Let's explore about various
Character set that we use for our database and what do they really mean.
We have used and seen character sets
in our database such as US7ASCII, WE8ISO8859P1, AL32UTF8 and national character
set such as AL16UTF16. Let's understand each one of them at a high
level.
In the initial days of computers,
US7ASCII character set was used as a standard to store and handle English
characters. US7ASCII is a 7-bit ACSII code which was sufficient to store all English
characters and symbols found in our keyboards (#, $, %, <, >, * etc.).
For a 7-bit ASCII code, it can store maximum of 127 characters and symbols. So,
it is a 1-byte character code, and it uses only 1 byte (but only 7-bit) per
character.
Later, many software vendors and
computer manufacturers used the 8th bit as well to increase the number of
maximum characters and symbols to 256. This helped to store additional
characters from Western Europe countries such as German, French or Dutch
letters.
This 8-bit is used in character set
WE8ISO8859P1. As you can see from its name, this character set stores
additional characters and symbols from Western Europe (WE) and it uses 8 bits
(1 byte) to represent a character or symbol.
As technology evolved and we got
into internet era, we needed to store and manage all the alphabets and symbols
from all languages including Chinese, Japanese and Korean etc.
AL32UTF8 character
set is used for that purpose. AL32UTF8 uses 32-bit with UTF8 encoding scheme to
store characters from all languages. It uses max up to 32-bit or 4 bytes for
every character or symbol. This is a multi-byte character set as compared to
single byte character sets as we saw in USASCII and WE8ISO8859P1.
Please note that AL32UTF8 uses
maximum up to 4 bytes for any character. For example, an English alphabet takes
only 1 byte. In that case, AL32UTF8 takes up only 1 byte space (and not the
entire 4 bytes). But if you're using some character or symbol which need
multiple bytes (such as Chinese character symbol), it will use the required
number of bytes.
This means that UTF8 encoding
scheme uses the same number of bytes as in US7ASCII and WE8ISO8859P1 to store English
and European characters (using the same encoding method), but for other
languages it uses multi-byte to represent characters.
In Oracle database, we also use
National Character set as AL16UTF16. This character set uses UTF-16 (a 16-bit)
encoding scheme. National character set is used if your machine or your
database is not set up with UTF8 encoding scheme and you still use the old character
set such as US7ASCII, and you want to use additional language characters in
specific columns in a specific table, then you can use national character set
only for that specific column (By defining it as NCHAR or NVARCHAR datatype).
That's about a high-level explanation about the character
sets. Please provide your thought/ feedback in the comment section.
No comments:
Post a Comment