Qt-interest Archive, July 2007
What is Size of a Japanese Character?
Message 1 in thread
Hi All,
I would like to know the size of a Japanese character.
In my below code each Japanese character is taking 3 bytes of memory, but It is suppose to take only 2bytes.
Please help me with this problem.
int main(int argc, char *argv[])
{
QApplication cApp(argc, argv);
char str[]="ああ"; //Japanese text with 2 characters
QString str1="ああああああああああああ"; // Japanese text with 12 characters
printf("size of str %d\n",sizeof(str));
printf("size of str1 %d\n",sizeof(str1.toLatin1().data()));
}
OUTPUT:
size of str 7
size of str1 4
The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.
WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
www.wipro.com
Message 2 in thread
Hi!
> char str[]="ああ"; //Japanese text with 2 characters
Probably your compiler encodes c-strings in UTF-8.
QString handels strings internally UTF-16 encoded (at lest unicode
encoded.. but it is not really important...). If I'm not mistaken you
could create a UTF-16 representation this way:
QString str = QString::fromUtf8("ああ");
printf("length of str %d", str.length());
If you want to have the UTF-16 representation you could do the following:
const ushort *uc = str.utf16();
Or the STL way:
std::wstring ws = str.toStdWString();
If you want to have kind of japanese encoding, thy this:
QTextCodec *codec = QTextCodec::codecForName("ISO 2022-JP");
QByteArray jap_str = codec->fromUnicode(str);
Please read the following detailed documentation:
http://doc.trolltech.com/4.3/qtextcodec.html
http://doc.trolltech.com/4.3/qstring.html
Greetings
Niklas
--
[ signature omitted ]
Message 3 in thread
Niklas Hofmann wrote:
> Hi!
>
>> char str[]="ãã"; //Japanese text with 2 characters
>
> Probably your compiler encodes c-strings in UTF-8.
Most compilers don't touch the encoding of string literals. They just
read the data as a string of bytes and shove that into the literal. So
the encoding of the string literal is whatever your text editor is using.
For that reason I tend to avoid encoding strings above 7-bit ASCII in
source code. If you have to do it, make REALLY sure everybody who works
on the code has their text editor set to use the same text encoding
(possibly using markup in the file, or if your compiler can handle it a
Unicode BOM).
> QString str = QString::fromUtf8("ãã");
> printf("length of str %d", str.length());
If the file is utf-8 encoded, then yes QString::fromUtf8(..) will work.
Of course, the length printed has nothing to do with what the writer
wanted to know - the size of a japanese character. That depends on how
it's represented. Inside a QString, it'll be either 2 or 4 bytes per
character (or has Qt4's QString switched to 4-bytes-per-char on all
platforms now?).
As utf-8 it's:
"\xe3\x81\x82\xe3\x81\x82"
ie 3 bytes per character. In simplified terms that's because there's a
UTF-8 "escape" followed by 2 bytes for each character.
As utf-16 it's:
"\xff\xfeB0B0"
ie
"\xff\xfe\x42\x30\x42\x30"
including the 2-byte BOM (so the characters themselves are 2 bytes each).
(if the backslashes appear as yen symbols to you, then you're decoding
this message as if it were encoded with the shift-JIS text codec,
whereas it's actually written in utf-8. Fix your mail client or get a
better one. Your original message was tagged as being shift-JIS /
ISO-2022-JP but used backslashes with the usual ASCII code, so it was
wrong.).
> If you want to have kind of japanese encoding, thy this:
>
> QTextCodec *codec = QTextCodec::codecForName("ISO 2022-JP");
> QByteArray jap_str = codec->fromUnicode(str);
... but be very careful to make sure all your source is in that
encoding, rather than some files or some literals within a file being in
(eg) utf-8 and others in shift-jis.
The most important thing is to think about text encodings, and to draw a
careful distinction between "byte strings" that may be encoded in any of
a number of ways (so they're only meaningful if you know how they're
encoded) and true Unicode text strings (like QString) that know their
own encoding.
--
[ signature omitted ]
Message 4 in thread
Hi!
> Most compilers don't touch the encoding of string literals. They just
> read the data as a string of bytes and shove that into the literal. So
> the encoding of the string literal is whatever your text editor is using.
>
> For that reason I tend to avoid encoding strings above 7-bit ASCII in
> source code. If you have to do it, make REALLY sure everybody who works
> on the code has their text editor set to use the same text encoding
> (possibly using markup in the file, or if your compiler can handle it a
> Unicode BOM).
exactly you are right!
>> QString str = QString::fromUtf8("ãã");
>> printf("length of str %d", str.length());
>
> If the file is utf-8 encoded, then yes QString::fromUtf8(..) will work.
I guess it is if I try to intepret the output mentioned in the mail of
the writer.
> Of course, the length printed has nothing to do with what the writer
> wanted to know - the size of a japanese character.
That is true.. but the size can be anything, if we talk about bytes..
therefore I mentioned the different encodings. The writer has to choose
the encoding and on basis of the encoding he can determine a size based
on bytes.
> (if the backslashes appear as yen symbols to you,
they appeared as backslashed... but were mixed up in the reply post..
maybe we should submit a bug report to mozilla (thunderbird) ;-) but the
topic is not my mail programm but the question of the original poster :-)
>
>> If you want to have kind of japanese encoding, thy this:
>>
>> QTextCodec *codec = QTextCodec::codecForName("ISO 2022-JP");
>> QByteArray jap_str = codec->fromUnicode(str);
>
> ... but be very careful to make sure all your source is in that
> encoding, rather than some files or some literals within a file being in
> (eg) utf-8 and others in shift-jis.
at this point I assumed a valid QString object... If this is the case,
this snippet should work, shouldn't it?
>
> The most important thing is to think about text encodings, and to draw a
> careful distinction between "byte strings" that may be encoded in any of
> a number of ways (so they're only meaningful if you know how they're
> encoded) and true Unicode text strings (like QString) that know their
> own encoding.
Exactly! That is what I wanted to highlight in my previous post!
Greetings
Niklas
--
[ signature omitted ]
Message 5 in thread
Niklas Hofmann wrote:
> QString str = QString::fromUtf8("ãã");
>>> printf("length of str %d", str.length());
>>
>> If the file is utf-8 encoded, then yes QString::fromUtf8(..) will work.
>
> I guess it is if I try to intepret the output mentioned in the mail of
> the writer.
Nah, in the original message it's iso-2022-jp text according to the MIME
headers in the mail source. First time I've noticed shift-JIS's
backslash-to-Yen substitution in practice, actually. It probably looked
like utf-8 because a well behaved mail client would've converted it to
something more suitable for your locale (like utf-8) if you copied the
chars from the message.
>>> If you want to have kind of japanese encoding, thy this:
>>>
>>> QTextCodec *codec = QTextCodec::codecForName("ISO 2022-JP");
>>> QByteArray jap_str = codec->fromUnicode(str);
>>
>> ... but be very careful to make sure all your source is in that
>> encoding, rather than some files or some literals within a file being in
>> (eg) utf-8 and others in shift-jis.
>
> at this point I assumed a valid QString object... If this is the case,
> this snippet should work, shouldn't it?
Yep, definitely. It's rarely safe to assume a valid QString when dealing
with people new to text encoding issues, though. In particular, unless
the user defines QT_NO_ASCII_CAST (which I've never seen used in
practice, even though it's awfully handy) they can pass a `const char*'
as `str' and Qt will make it into a QString ... er ... somehow. I love
the way this usually seems to work for the developer doing the testing,
but goes splat when run on a system with another locale :S
You accurately point out that that has little to do with the encoding
specified for QTextCodec::codecForName(...) though. As you said, your
code snippet just relies on a sane QString as input.
I guess GIGO is an awfully accurate term when it comes to text encodings
;-) .
--
[ signature omitted ]
Message 6 in thread
Hi!
>> QString str = QString::fromUtf8("ãã");
>>>> printf("length of str %d", str.length());
>>> If the file is utf-8 encoded, then yes QString::fromUtf8(..) will work.
>> I guess it is if I try to intepret the output mentioned in the mail of
>> the writer.
>
> Nah, in the original message it's iso-2022-jp text according to the MIME
> headers in the mail source. First time I've noticed shift-JIS's
> backslash-to-Yen substitution in practice, actually. It probably looked
> like utf-8 because a well behaved mail client would've converted it to
> something more suitable for your locale (like utf-8) if you copied the
> chars from the message.
I talked about the code and the printf-calls. If we have the following
situation, I guess the encoding of the source code file is UTF-8:
char str[]="ãã";
printf("size of str %d\n",sizeof(str));
=> size of str 7
I noticed too that the original mail was encoded in iso-2022-jp and I
have to say I have no experience with iso-2022-jp / shift-JIS and stuff
like this.. :-) Therefore I am not sure about the reason for
backslash-to-Yen substitution... Maybe you can shed some light on this
topic?
> I guess GIGO is an awfully accurate term when it comes to text encodings
> ;-) .
:-)
Greetings
Niklas
--
[ signature omitted ]
Message 7 in thread
Niklas Hofmann schrieb:
> Hi!
> ...
> I noticed too that the original mail was encoded in iso-2022-jp and I
> have to say I have no experience with iso-2022-jp / shift-JIS and stuff
> like this.. :-) Therefore I am not sure about the reason for
> backslash-to-Yen substitution... Maybe you can shed some light on this
> topic?
In latin encoding the ascii code for a backslash (\) will really display
as a backslash (\) whereas with iso-2022-jp the same "ascii" value will
display as a yen sign.
So when the original posting had iso-2022-jp encoding and showing a yen
sign (or the other shift-JIS encoding, I'm no expert either), you hit
the reply-button and change the encoding to say latin1 then this sign
will appear as backslash.
Why was quite puzzled when I tested our application on a real japanese
Windows XP and saw all the yen signs in the paths, e.g. in the Windows
File Explorer - apart from completely taken apart from all the other
japanese signs all over the place ;)
On another note: I guess the best practice is to use english text in all
source code (for user displayable text) and do the translation with the
Linguist which takes care of all this unicode stuff.
And if one really needs to read in an external text file, make sure you
a) know the encoding of the text file (e.g. UTF-8 for XML files) and b)
set this encoding for the QTextStream before reading in the file
(QTextStream::setEncoding in Qt 3). Once this text is in a QString
everything is fine :)
Cheers, Oliver
--
[ signature omitted ]
Message 8 in thread
naveen.kmrvm@xxxxxxxxx wrote:
> Hi All,
>
> I would like to know the size of a Japanese character.
>
> In my below code each Japanese character is taking 3 bytes of memory, but It is suppose to take only 2bytes.
Japanese characters are not valid in the latin-1 text encoding. Qt may
be able to preserve this mangling for you, but you should never do it.
You should probably be using a more appropriate 8-bit text encoding such
as utf-8 or shift-jis instead.
The amount of memory the QString uses is not the same as the size of the
static buffer the QString allocates to store an 8-bit representation of
the text when one is requested. QString uses a wide character
representation internally; have a look at the QString source to see how
it works. You can see the exact amount of memory used by a particular
QString, including over-allocation space, using a debugger.
Note that your second sizeof() usage is sizeof(char*), which evaluates
the size of a POINTER to char, ie 4 bytes on a 32 bit platform. So it
has nothing to do with what you wanted.
--
[ signature omitted ]