Trolltech Home | Qt-interest Home | Recent Threads | All Threads | Author | Date
All threads index page 1

Qt-interest Archive, February 2008
Utf8?


Message 1 in thread

Hi,

QFile file("./1.txt");
file.open(QFile::WriteOnly);
file.write(QString::fromUtf8("éä").toUtf8());
file.close();

Why the output is always "??"?

Thanks,
Liungfa

--
 [ signature omitted ] 

Message 2 in thread

On Fri, Feb 01, 2008 at 04:32:09PM -0500, Lingfa Yang wrote:
> Hi,
>
> QFile file("./1.txt");
> file.open(QFile::WriteOnly);
> file.write(QString::fromUtf8("éä").toUtf8());
> file.close();
>

Are you sure that your character sequence is really UTF-8? My mail client
shows two rectangles.

Your code works well with russian characters both in UTF-8 and KOI8-R
locales (after replacing string with valid utf-8)

-- 
 [ signature omitted ] 

Attachment: signature.asc
Description: Digital signature


Message 3 in thread

On Friday 01 February 2008 16:32, Lingfa Yang wrote:
> Hi,
> 
> QFile file("./1.txt");
> file.open(QFile::WriteOnly);
> file.write(QString::fromUtf8("??").toUtf8());
> file.close();
> 
> Why the output is always "??"?

"X" is a const char*
L"X" is a const wchar*

So, you should be using L" ... ".


--
 [ signature omitted ] 

Message 4 in thread

Lingfa Yang wrote:
>Hi,
>
>QFile file("./1.txt");
>file.open(QFile::WriteOnly);
>file.write(QString::fromUtf8("éä").toUtf8());
>file.close();
>
>Why the output is always "??"?

First of all, if x is a valid UTF-8 string, QString::fromUtf8(x).toUtf8() 
== x. So your code above is pointless: you're converting your two 
characters from UTF-8 to UTF-16, then back, only to get your two 
characters again. If you just wanted to write those two characters, you 
could have done exactly that:

   file.write("éä");
or, even better:
   char msg[] = "éä";
   file.write(msg, strlen(msg)); // avoids a QByteArray temporary

Second, the four lines above do not constitute a compileable example. 
Please give us the rest of the *small* program.

When dealing with character encoding, please indicate the hex dump of what 
you would have liked to see and what you actually got.

-- 
 [ signature omitted ] 

Attachment: signature.asc
Description: This is a digitally signed message part.


Message 5 in thread

Thiago Macieira wrote:
> Lingfa Yang wrote:
>   
>> Hi,
>>
>> QFile file("./1.txt");
>> file.open(QFile::WriteOnly);
>> file.write(QString::fromUtf8("éä").toUtf8());
>> file.close();
>>
>> Why the output is always "??"?
>>     
>
> First of all, if x is a valid UTF-8 string, QString::fromUtf8(x).toUtf8() 
> == x. So your code above is pointless: you're converting your two 
> characters from UTF-8 to UTF-16, then back, only to get your two 
> characters again. If you just wanted to write those two characters, you 
> could have done exactly that:
>
>    file.write("éä");
> or, even better:
>    char msg[] = "éä";
>    file.write(msg, strlen(msg)); // avoids a QByteArray temporary
>
> Second, the four lines above do not constitute a compileable example. 
> Please give us the rest of the *small* program.
>
> When dealing with character encoding, please indicate the hex dump of what 
> you would have liked to see and what you actually got.
>
>   
Here is the whole code:

#include <QFile>
int main(int argc, char *argv[])
{
QFile file("./1.txt");
file.open(QFile::WriteOnly);

//char text[] = "éä"; // "??"
//file.write(text, strlen(text)); // output is wrong!

char t[7]; t[0] = 0xFF; t[1] = 0xFE; t[2] = 0xD1; t[3]=0x9E; t[4]=0x53; 
t[5]=0x4F; t[6]='\0';
file.write(t, 6); // output is correct.

file.close();

return 1;
}

The text are two Chinese characters, which are copied from a 
document.xml inside a docx file. The original is:
<a:ea typeface="éä" pitchFamily="2" charset="-122" />


--
 [ signature omitted ] 

Message 6 in thread

Hi,

> [...]
> //char text[] = "éä"; // "??"
> [...]
> char t[7]; t[0] = 0xFF; t[1] = 0xFE; t[2] = 0xD1; t[3]=0x9E; t[4]=0x53; 
> t[5]=0x4F; t[6]='\0';
> [...]

I understand the character sequence put in 't' is encoded using UTF-8 as 
expected. Good.

If the characters put in 'text' are different, then clearly the encoding of 
the source file is different. Check the encoding used by your editor to start 
with. Then check the encoding expected by the compiler. At least one of them 
is not UTF-8.

--
 [ signature omitted ] 

Message 7 in thread

> Then check the encoding expected by the compiler. At least one of them 
> is not UTF-8.

You are right. It is not UTF-8. It is UTF-16 (starts with FF FE).
Thanks you,
Lingfa


--
 [ signature omitted ] 

Message 8 in thread

Lingfa Yang wrote:
>char t[7]; t[0] = 0xFF; t[1] = 0xFE; t[2] = 0xD1; t[3]=0x9E; t[4]=0x53;
>t[5]=0x4F; t[6]='\0';
>file.write(t, 6); // output is correct.

The encoding above is not UTF-8. It's UTF-16 Big Endian with a Byte Order 
Mark (BOM).

-- 
 [ signature omitted ] 

Attachment: signature.asc
Description: This is a digitally signed message part.


Message 9 in thread

Thiago Macieira wrote:
> Lingfa Yang wrote:
>   
>> char t[7]; t[0] = 0xFF; t[1] = 0xFE; t[2] = 0xD1; t[3]=0x9E; t[4]=0x53;
>> t[5]=0x4F; t[6]='\0';
>> file.write(t, 6); // output is correct.
>>     
>
> The encoding above is not UTF-8. It's UTF-16 Big Endian with a Byte Order 
> Mark (BOM).
>   

Yes, you're right! I took a hex editor, opened the  main.cpp file and 
found the first two characters are  "FF FE" (UTF-16) instead of "EF BB 
BF". My editor is VS2005. It does the change from ASCII to UTF-16 
automatically when I add Chinese characters, and I was not aware of the 
change.

Regards,
Lingfa


--
 [ signature omitted ] 

Message 10 in thread

	 Lingfa Yang writes

> Here is the whole code:
> 
> #include <QFile>
> int main(int argc, char *argv[])
> {
> QFile file("./1.txt");
> file.open(QFile::WriteOnly);
> 
> //char text[] = "éä"; // "??"
> //file.write(text, strlen(text)); // output is wrong!
> 
> char t[7]; t[0] = 0xFF; t[1] = 0xFE; t[2] = 0xD1; t[3]=0x9E; 
t[4]=0x53; 
> t[5]=0x4F; t[6]='\0';
> file.write(t, 6); // output is correct.
> 
> file.close();
> 
> return 1;
> }

These chinese characters are U+9ED1 and U+4F53, whose utf8 encoding are 
0xE9 0xBB 0x91 and 0xE4 0xBD 0x93. Notice that it's not what you're 
putting in your t[] array above, so could it be that it's your 
expectation on the output that would be wrong in the first place? In 
the commented call to file.write(), what output bytes are you 
expecting, and what are you getting instead?

Also, assuming that the source code is in utf8 and you want the 
compiled form to be in utf8 as well, it's also necessary that the 
compiler behaves as expected. What happens if you compile and run this:

#include <stdio.h>
int main(int argc, char** argv)
{
  const char t[]="éä";
  for (unsigned int i=0; i<sizeof(t); ++i) {
    printf("%02hh ", t[i]);
  }
}

-- 
 [ signature omitted ] 

Message 11 in thread

> In the commented call to file.write(), what output bytes are you 
> expecting, and what are you getting instead?

I expect to see two Chinese character as what I see in the editor, but 
instead, I saw two question marks.

> Also, assuming that the source code is in utf8 and you want the 
> compiled form to be in utf8 as well, it's also necessary that the 
> compiler behaves as expected. What happens if you compile and run this:
>
> #include <stdio.h>
> int main(int argc, char** argv)
> {
>  const char t[]="éä";
>  for (unsigned int i=0; i<sizeof(t); ++i) {
>    printf("%02hh ", t[i]);
>  }
> }
>

I compiled and run your code. In console, I saw nothing.
I checked the main.cpp file and found it is UTF-16 :-(
Thank you,
Lingfa



--
 [ signature omitted ] 

Message 12 in thread

	 Lingfa Yang writes

> I compiled and run your code. In console, I saw nothing.
> I checked the main.cpp file and found it is UTF-16 :-(

Up in the thread I see you're using VS2005. Have you tried to specify 
an explicit UTF-8 encoding for your source file in the VS IDE? (see 
"File/Advanced save options")? I remember it has worked for me in the 
past when troubleshooting encoding issues. I believe that this option 
is not just for the editor, it's forwarded to the compiler as well.

Otherwise, don't you get compilation warnings about using an inadequate 
codepage?

-- 
 [ signature omitted ] 

Message 13 in thread

>  (see "File/Advanced save options")?

Yes. This is crucial! - by choosing a correct encoding format, problem 
solved.

Thank you very much,
Lingfa

--
 [ signature omitted ] 

Message 14 in thread

> //char text[] = "éä"; // "??"
> //file.write(text, strlen(text)); // output is wrong!
write() function receives number of *bytes* to be written. So, if strlen()
calculates length of the string in UTF-8 format (likely for *nix systems),
write() will write (sorry for tautology :) ) 2 bytes instead of 6.

-- 
 [ signature omitted ] 

Message 15 in thread

Constantin Makshin wrote:
>> //char text[] = "éä"; // "??"
>> //file.write(text, strlen(text)); // output is wrong!
>
>write() function receives number of *bytes* to be written. So, if
> strlen() calculates length of the string in UTF-8 format (likely for
> *nix systems), write() will write (sorry for tautology :) ) 2 bytes
> instead of 6.

strlen() return the number of bytes in the 8-bit string. It doesn't care 
about encoding.

If you want to know the number of characters, either use QString and then 
toUcs4 or use mbstowcs.

-- 
 [ signature omitted ] 

Attachment: signature.asc
Description: This is a digitally signed message part.