Qt-interest Archive, July 2007
MacOS X and special characters in file names
Message 1 in thread
Dear all,
I hope some of you can help me with the following:
I scan a directory for files which I open one after the other. Assume
we have a QString filename
that holds the filename "foo.jpg". I can open the (image) file using
QImageReader imagereader(filename).
Works fine.
Now I have another file called "bàr.jpg" (note the accent). Using the
very same command imagereader(filename) produces a NULL image which
is not so good.
I have another method written in C that expects "char *" which I get
by using filename.toLatin8().data(),
which does not work. So ImageReader() makes problems since it's
passed QString.
Bottom line: Anyone with ideas for "localized filenames", preferebly
on OSX? What about other platforms?
Thanks a lot in advance,
Shimaron
--
[ signature omitted ]
Message 2 in thread
Shimaron Greywolf wrote:
> by using filename.toLatin8().data(), which does not work.
you probably meant toLatin1() - but what exactly doesn't work? do the
special characters become "garbled"?
Maybe you need to specify some other codec?
toLocal8Bit()
toAscii()
toUtf8()
also see QTextCodec
Cheers,
Peter
--
[ signature omitted ]
Message 3 in thread
Shimaron Greywolf wrote:
> Now I have another file called "bàr.jpg" (note the accent). Using the
> very same command imagereader(filename) produces a NULL image which is
> not so good.
>
> I have another method written in C that expects "char *" which I get by
> using filename.toLatin8().data(),
... which is almost certainly wrong. When interacting with the local
system you almost always want to use `toLocal8Bit()' which will use the
system locale to select the correct 8-bit encoding.
Have a look at the LANG and LC_ALL environment variables (or the output
of the `locale' command if Mac OS X provides it). You'll probably find
that you're not using a latin-1 (iso-8859-1) locale.
The only time you should really be using toUtf8() or toLatin1() is if
you specifically know that a certain function takes 8-bit data that it
expects to be in a particular encoding rather than the system's default
8-bit encoding. Library routines and similar that have such requirements
will generally document it clearly. It's also useful when writing files
that must be of a particular encoding (think XML with a utf-8
declaration) but then you're usually better off setting the codec in
your stream writer.
The rest of the time, use toLocal8Bit(). But think about every use and
where the data is going to make sure it's really the right thing.
Also beware of the implicit conversion provided by Qt from `const char*'
to QString. It's often a good idea to explicitly use
QString::fromLocal8Bit() (or whatever alternative is appropriate)
whenever you're converting 8-bit character data to a QString. This can
bite you when you call a function that takes a QString with 8-bit
character data if Qt converts the character data by assuming it's in one
encoding (say, latin-1) and it's really data in another (say, utf-8 from
a platform with a utf-8 locale).
In other words, if your OS assumes 8-bit character data is utf-8, and Qt
is using latin-1 as its default codec, the following:
QImageReader("bàr.jpg");
... actually does an implicit `const char*' to `QString' conversion with
the wrong text codec. It interprets the character data wrong and gives
you a mangled QString.
You can explicitly set Qt's default text codec to the right one for your
platform, or you can be careful and explicit about all conversions at
Unicode <-> 8-bit boundaries.
Defining QT_NO_ASCII_CAST makes things cumbersome, but forces you to
deal with all these nasty potential bugs explicitly, and may be worth
considering. It tells Qt that there is no default conversion from 8-bit
character data to Unicode and forces you to specify one wherever it's
required.
Personally I think the Qt documentation needs an explanation of 8-bit
text encodings and locale issues that covers things like
QT_NO_ASCII_CAST and the fact that just because using toLatin1() gets
you a `char*' doesn't mean it's getting you the right `char*' on all
platforms/locales. I've seen lots of these bugs in several apps, it's a
really common misunderstanding.
--
[ signature omitted ]
Message 4 in thread
Craig Ringer wrote:
> Shimaron Greywolf wrote:
>
>> Now I have another file called "bàr.jpg" (note the accent). Using the
>> very same command imagereader(filename) produces a NULL image which is
>> not so good.
>>
>> I have another method written in C that expects "char *" which I get by
>> using filename.toLatin8().data(),
>
> ... which is almost certainly wrong. When interacting with the local
> system you almost always want to use `toLocal8Bit()' which will use the
> system locale to select the correct 8-bit encoding.
I thought OS X uses UTF8 for POSIX-style file paths? So wouldn't using
toUtf8() be right?
--
[ signature omitted ]
Message 5 in thread
Paul Miller wrote:
> I thought OS X uses UTF8 for POSIX-style file paths? So wouldn't using
> toUtf8() be right?
Sure, on Mac OX X.
Then someone builds and runs your app on Windows, or on Linux, and
wonders why they can't open `café.txt'.
I wasted quite a bit of time tracking down bugs like this (caused by
lack of understanding of, or lack of care about, text encodings) in a
couple of apps. Since they're generally so easily avoided I see little
reason to get into a position where they can occur in the first place.
In this case, for example, you don't really specifically need a UTF-8
encoded string. That's not the end goal, only the immediate requirement
to achieve the goal. What you really want is "an 8-bit representation of
my unicode string that the current platform & locale will interpret the
same way I do." ::toLocal8Bit() does the trick nicely for that, and on
Mac OS X should be equivalent to ::toUtf8();
Consider Linux. All that variety can be nice, but it's a royal pain as
well. Linux systems may run in a utf-8 locale (like Mac OS X) or they
may use a variety of nationality/language specific 8-bit text encodings.
Not all of these even have 7-bit ASCII as a common subset, eg
Shift-JIS, so you're not even truly safe if you're only reading and
writing ASCII. If your app uses something like QString::toUtf8() to get
a char* that it passes to a platform library call like fopen() it'll
work on some systems but not others. Sometimes even some user accounts
but not others. Fun.
So ... why hard code such an unnecessary platform-specific assumption
and potential bug?
--
[ signature omitted ]
Message 6 in thread
Craig Ringer wrote:
> Paul Miller wrote:
>
>> I thought OS X uses UTF8 for POSIX-style file paths? So wouldn't using
>> toUtf8() be right?
>
> Sure, on Mac OX X.
>
> Then someone builds and runs your app on Windows, or on Linux, and
> wonders why they can't open `café.txt'.
Good point. But I was asking specifically about OS X. In my
cross-platform stuff, I use the toLocal8Bit() calls myself.
--
[ signature omitted ]
Message 7 in thread
Paul Miller wrote:
> Craig Ringer wrote:
>> Paul Miller wrote:
>>
>>> I thought OS X uses UTF8 for POSIX-style file paths? So wouldn't using
>>> toUtf8() be right?
>>
>> Sure, on Mac OX X.
>>
>> Then someone builds and runs your app on Windows, or on Linux, and
>> wonders why they can't open `café.txt'.
>
> Good point. But I was asking specifically about OS X. In my
> cross-platform stuff, I use the toLocal8Bit() calls myself.
Fair enough. I was going to suggest that there's really no difference,
but a quick look at QString shows that QString::toUtf8() uses a
different implementation for the conversion than QString::toLocal8Bit().
The latter uses a QTextCodec as one would expect, but the former is
implemented directly in QString.
At a glance it seems like QString::toUtf8() does funny business to try
to preserve latin-1 strings through QString::fromUtf8("blah").toUtf8() .
I'm far from sure that interpretation is correct, however, as it's
mostly based on the comments and my experience seeing apps getting away
with the horrible torture of QString's encoding conversions without
apparent ill effects. (It'd be nice if a debug Qt build detected and
warned about this).
It seems that QTextodec doesn't try anything like that. So there is an
(unexpected, at least to me) difference between QString::toUtf8() and
QString::toLocal8Bit() in a utf-8 locale. Well, if you or another app
are already doing horrible things to strings that might cause the
QString::toUtf8 special case to be hit, anyway.
I'd be very interested in finding out definitively why this difference
exists and what other possible effects might result from it, actually.
--
[ signature omitted ]