Trolltech Home | Qt-interest Home | Recent Threads | All Threads | Author | Date
All threads index page 5

Qt-interest Archive, April 2007
UTF-16 native storage of QString - what about a UTF-8 option?


Message 1 in thread

Hey
 
I was modifying the QSQLiteDriver for some optimizations I wanted to
make (we need to optimize for data storage size) and came to realize
that QStrings store their data internally as UTF-16 (unsigned short*)
which accounted for dealing exclusively with sqlite_*16() methods.
 
Now I understand the bias of UTF-8 to western character sets and how
Java/C# support of native UTF-16 creates a momentum in that direction
and the fact that modern systems are not phased by a doubling in string
size (for ASCII text), but it looks to me like QT tries to be encoding
agnostic in their API .
 
So would it be possible to create a build option that makes QString use
UTF-8 as their default data storage (thus QString::toUtf8() would be
cheap while QString::utf16() would be expensive). This seems like a
optimization preference app developers could heartily use. Obviously the
current UTF-16 option would be still the default and preferred method
for people who are targeting a strong international user base, but it
would be nice to have a choice.
 
Does anybody else feel similarly? Or strongly reject such a suggestion? 
 
I guess the major hurdle may be the technical one as QT internals may
rely heavily on UTF-16 based assumptions.
--
 [ signature omitted ] 

Message 2 in thread

"Gabe F. Rudy" <rudy@xxxxxxxxxxxxxxx> wrote in message 
news:4BF696DFEB9D674EB9C92FCFC7E3AA811EABDD@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> I was modifying the QSQLiteDriver for some optimizations I wanted to
> make (we need to optimize for data storage size) and came to realize
> that QStrings store their data internally as UTF-16 (unsigned short*)
> which accounted for dealing exclusively with sqlite_*16() methods.
>
> Now I understand the bias of UTF-8 to western character sets and how
> Java/C# support of native UTF-16 creates a momentum in that direction
> and the fact that modern systems are not phased by a doubling in string
> size (for ASCII text), but it looks to me like QT tries to be encoding
> agnostic in their API .
>
> So would it be possible to create a build option that makes QString use
> UTF-8 as their default data storage (thus QString::toUtf8() would be
> cheap while QString::utf16() would be expensive). This seems like a
> optimization preference app developers could heartily use. Obviously the
> current UTF-16 option would be still the default and preferred method
> for people who are targeting a strong international user base, but it
> would be nice to have a choice.
>
> Does anybody else feel similarly? Or strongly reject such a suggestion?
>
> I guess the major hurdle may be the technical one as QT internals may
> rely heavily on UTF-16 based assumptions.

Indeed, and who owns the memory allocated for the utf16 representation? 
The function would need to return a QVector<ushort> for this.

Since all Win32 APIs take UTF16 strings (or locale-specific encoding), 
storing text in any encoding other than UTF16 or locale (which is only 
used on Win9x anyway) would actually introduce a lot of overhead at least 
on this platform.


Volker


--
 [ signature omitted ] 

Message 3 in thread

> 
> Indeed, and who owns the memory allocated for the utf16 
> representation?
> The function would need to return a QVector<ushort> for this.
> 
> Since all Win32 APIs take UTF16 strings (or locale-specific encoding),

> storing text in any encoding other than UTF16 or locale (which is only

> used on Win9x anyway) would actually introduce a lot of overhead at 
> least on this platform.
> 

Very good point, that pretty much kills that idea. I bet the situation
is similar on the Mac, where the native OS X calls expects 2 byte
unicode strings of some form.

KDE on Linux, being based on QT would be the only one to benefit, and
that would only be in the case where the whole environment was compiled
with that flag (although some distros like DSL might enjoy such an
option if it resulted in real-world memory savings).

--
 [ signature omitted ] 

Message 4 in thread

> "Gabe F. Rudy" <rudy@xxxxxxxxxxxxxxx> wrote in message
>
news:4BF696DFEB9D674EB9C92FCFC7E3AA811EABDD@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
..
> > I was modifying the QSQLiteDriver for some optimizations I wanted to
> > make (we need to optimize for data storage size) and came to realize
> > that QStrings store their data internally as UTF-16 (unsigned
short*)
> > which accounted for dealing exclusively with sqlite_*16() methods.
> >
> > Now I understand the bias of UTF-8 to western character sets and how
> > Java/C# support of native UTF-16 creates a momentum in that
direction
> > and the fact that modern systems are not phased by a doubling in
string
> > size (for ASCII text), but it looks to me like QT tries to be
encoding
> > agnostic in their API .
> >
> > So would it be possible to create a build option that makes QString
use
> > UTF-8 as their default data storage (thus QString::toUtf8() would be
> > cheap while QString::utf16() would be expensive). This seems like a
> > optimization preference app developers could heartily use. Obviously
the
> > current UTF-16 option would be still the default and preferred
method
> > for people who are targeting a strong international user base, but
it
> > would be nice to have a choice.
> >
> > Does anybody else feel similarly? Or strongly reject such a
suggestion?
> >
> > I guess the major hurdle may be the technical one as QT internals
may
> > rely heavily on UTF-16 based assumptions.
> 
> Indeed, and who owns the memory allocated for the utf16
representation?
> The function would need to return a QVector<ushort> for this.
> 
> Since all Win32 APIs take UTF16 strings (or locale-specific encoding),
> storing text in any encoding other than UTF16 or locale (which is only
> used on Win9x anyway) would actually introduce a lot of overhead at
least
> on this platform.
> 
> 
> Volker

Not if you tied the utf8 default behavior to call the non-uni versions
of the win32 code.  

As the OP stated, there are people who would want the memory footprint
benfits, who don't have unicodelanguage needs.

Scott

--
 [ signature omitted ] 

Message 5 in thread

> > Since all Win32 APIs take UTF16 strings (or locale-specific encoding),
> > storing text in any encoding other than UTF16 or locale (which is only
> > used on Win9x anyway) would actually introduce a lot of overhead at
least
> > on this platform.
> >
> >
> > Volker
>
> Not if you tied the utf8 default behavior to call the non-uni versions
> of the win32 code.

The 'A' versions of Win32 APIs are not expecting UTF8 though. They expect 
the encoding according to the current locale, which is some Win32 
codepage, not UTF8. So you would have to convert UTF8 -> Unicode -> 
locale.


Volker


--
 [ signature omitted ]