| Trolltech Home | Qt-interest Home | Recent Threads | All Threads | Author | Date | |
| All threads index page 1 | |
Hi,
QFile file("./1.txt");
file.open(QFile::WriteOnly);
file.write(QString::fromUtf8("éä").toUtf8());
file.close();
Why the output is always "??"?
Thanks,
Liungfa
--
[ signature omitted ]
On Fri, Feb 01, 2008 at 04:32:09PM -0500, Lingfa Yang wrote:
> Hi,
>
> QFile file("./1.txt");
> file.open(QFile::WriteOnly);
> file.write(QString::fromUtf8("éä").toUtf8());
> file.close();
>
Are you sure that your character sequence is really UTF-8? My mail client
shows two rectangles.
Your code works well with russian characters both in UTF-8 and KOI8-R
locales (after replacing string with valid utf-8)
--
[ signature omitted ]
Attachment:
signature.asc
Description: Digital signature
On Friday 01 February 2008 16:32, Lingfa Yang wrote:
> Hi,
>
> QFile file("./1.txt");
> file.open(QFile::WriteOnly);
> file.write(QString::fromUtf8("??").toUtf8());
> file.close();
>
> Why the output is always "??"?
"X" is a const char*
L"X" is a const wchar*
So, you should be using L" ... ".
--
[ signature omitted ]
Lingfa Yang wrote:
>Hi,
>
>QFile file("./1.txt");
>file.open(QFile::WriteOnly);
>file.write(QString::fromUtf8("éä").toUtf8());
>file.close();
>
>Why the output is always "??"?
First of all, if x is a valid UTF-8 string, QString::fromUtf8(x).toUtf8()
== x. So your code above is pointless: you're converting your two
characters from UTF-8 to UTF-16, then back, only to get your two
characters again. If you just wanted to write those two characters, you
could have done exactly that:
file.write("éä");
or, even better:
char msg[] = "éä";
file.write(msg, strlen(msg)); // avoids a QByteArray temporary
Second, the four lines above do not constitute a compileable example.
Please give us the rest of the *small* program.
When dealing with character encoding, please indicate the hex dump of what
you would have liked to see and what you actually got.
--
[ signature omitted ]
Attachment:
signature.asc
Description: This is a digitally signed message part.
Thiago Macieira wrote:
> Lingfa Yang wrote:
>
>> Hi,
>>
>> QFile file("./1.txt");
>> file.open(QFile::WriteOnly);
>> file.write(QString::fromUtf8("éä").toUtf8());
>> file.close();
>>
>> Why the output is always "??"?
>>
>
> First of all, if x is a valid UTF-8 string, QString::fromUtf8(x).toUtf8()
> == x. So your code above is pointless: you're converting your two
> characters from UTF-8 to UTF-16, then back, only to get your two
> characters again. If you just wanted to write those two characters, you
> could have done exactly that:
>
> file.write("éä");
> or, even better:
> char msg[] = "éä";
> file.write(msg, strlen(msg)); // avoids a QByteArray temporary
>
> Second, the four lines above do not constitute a compileable example.
> Please give us the rest of the *small* program.
>
> When dealing with character encoding, please indicate the hex dump of what
> you would have liked to see and what you actually got.
>
>
Here is the whole code:
#include <QFile>
int main(int argc, char *argv[])
{
QFile file("./1.txt");
file.open(QFile::WriteOnly);
//char text[] = "éä"; // "??"
//file.write(text, strlen(text)); // output is wrong!
char t[7]; t[0] = 0xFF; t[1] = 0xFE; t[2] = 0xD1; t[3]=0x9E; t[4]=0x53;
t[5]=0x4F; t[6]='\0';
file.write(t, 6); // output is correct.
file.close();
return 1;
}
The text are two Chinese characters, which are copied from a
document.xml inside a docx file. The original is:
<a:ea typeface="éä" pitchFamily="2" charset="-122" />
--
[ signature omitted ]
Hi, > [...] > //char text[] = "éä"; // "??" > [...] > char t[7]; t[0] = 0xFF; t[1] = 0xFE; t[2] = 0xD1; t[3]=0x9E; t[4]=0x53; > t[5]=0x4F; t[6]='\0'; > [...] I understand the character sequence put in 't' is encoded using UTF-8 as expected. Good. If the characters put in 'text' are different, then clearly the encoding of the source file is different. Check the encoding used by your editor to start with. Then check the encoding expected by the compiler. At least one of them is not UTF-8. -- [ signature omitted ]
> Then check the encoding expected by the compiler. At least one of them > is not UTF-8. You are right. It is not UTF-8. It is UTF-16 (starts with FF FE). Thanks you, Lingfa -- [ signature omitted ]
Lingfa Yang wrote: >char t[7]; t[0] = 0xFF; t[1] = 0xFE; t[2] = 0xD1; t[3]=0x9E; t[4]=0x53; >t[5]=0x4F; t[6]='\0'; >file.write(t, 6); // output is correct. The encoding above is not UTF-8. It's UTF-16 Big Endian with a Byte Order Mark (BOM). -- [ signature omitted ]
Attachment:
signature.asc
Description: This is a digitally signed message part.
Thiago Macieira wrote: > Lingfa Yang wrote: > >> char t[7]; t[0] = 0xFF; t[1] = 0xFE; t[2] = 0xD1; t[3]=0x9E; t[4]=0x53; >> t[5]=0x4F; t[6]='\0'; >> file.write(t, 6); // output is correct. >> > > The encoding above is not UTF-8. It's UTF-16 Big Endian with a Byte Order > Mark (BOM). > Yes, you're right! I took a hex editor, opened the main.cpp file and found the first two characters are "FF FE" (UTF-16) instead of "EF BB BF". My editor is VS2005. It does the change from ASCII to UTF-16 automatically when I add Chinese characters, and I was not aware of the change. Regards, Lingfa -- [ signature omitted ]
Lingfa Yang writes
> Here is the whole code:
>
> #include <QFile>
> int main(int argc, char *argv[])
> {
> QFile file("./1.txt");
> file.open(QFile::WriteOnly);
>
> //char text[] = "éä"; // "??"
> //file.write(text, strlen(text)); // output is wrong!
>
> char t[7]; t[0] = 0xFF; t[1] = 0xFE; t[2] = 0xD1; t[3]=0x9E;
t[4]=0x53;
> t[5]=0x4F; t[6]='\0';
> file.write(t, 6); // output is correct.
>
> file.close();
>
> return 1;
> }
These chinese characters are U+9ED1 and U+4F53, whose utf8 encoding are
0xE9 0xBB 0x91 and 0xE4 0xBD 0x93. Notice that it's not what you're
putting in your t[] array above, so could it be that it's your
expectation on the output that would be wrong in the first place? In
the commented call to file.write(), what output bytes are you
expecting, and what are you getting instead?
Also, assuming that the source code is in utf8 and you want the
compiled form to be in utf8 as well, it's also necessary that the
compiler behaves as expected. What happens if you compile and run this:
#include <stdio.h>
int main(int argc, char** argv)
{
const char t[]="éä";
for (unsigned int i=0; i<sizeof(t); ++i) {
printf("%02hh ", t[i]);
}
}
--
[ signature omitted ]
> In the commented call to file.write(), what output bytes are you
> expecting, and what are you getting instead?
I expect to see two Chinese character as what I see in the editor, but
instead, I saw two question marks.
> Also, assuming that the source code is in utf8 and you want the
> compiled form to be in utf8 as well, it's also necessary that the
> compiler behaves as expected. What happens if you compile and run this:
>
> #include <stdio.h>
> int main(int argc, char** argv)
> {
> const char t[]="éä";
> for (unsigned int i=0; i<sizeof(t); ++i) {
> printf("%02hh ", t[i]);
> }
> }
>
I compiled and run your code. In console, I saw nothing.
I checked the main.cpp file and found it is UTF-16 :-(
Thank you,
Lingfa
--
[ signature omitted ]
Lingfa Yang writes > I compiled and run your code. In console, I saw nothing. > I checked the main.cpp file and found it is UTF-16 :-( Up in the thread I see you're using VS2005. Have you tried to specify an explicit UTF-8 encoding for your source file in the VS IDE? (see "File/Advanced save options")? I remember it has worked for me in the past when troubleshooting encoding issues. I believe that this option is not just for the editor, it's forwarded to the compiler as well. Otherwise, don't you get compilation warnings about using an inadequate codepage? -- [ signature omitted ]
> (see "File/Advanced save options")? Yes. This is crucial! - by choosing a correct encoding format, problem solved. Thank you very much, Lingfa -- [ signature omitted ]
> //char text[] = "éä"; // "??" > //file.write(text, strlen(text)); // output is wrong! write() function receives number of *bytes* to be written. So, if strlen() calculates length of the string in UTF-8 format (likely for *nix systems), write() will write (sorry for tautology :) ) 2 bytes instead of 6. -- [ signature omitted ]
Constantin Makshin wrote: >> //char text[] = "éä"; // "??" >> //file.write(text, strlen(text)); // output is wrong! > >write() function receives number of *bytes* to be written. So, if > strlen() calculates length of the string in UTF-8 format (likely for > *nix systems), write() will write (sorry for tautology :) ) 2 bytes > instead of 6. strlen() return the number of bytes in the 8-bit string. It doesn't care about encoding. If you want to know the number of characters, either use QString and then toUcs4 or use mbstowcs. -- [ signature omitted ]
Attachment:
signature.asc
Description: This is a digitally signed message part.