Qt-interest Archive, October 2007
How to parse an html?
Message 1 in thread
All XML players,
Not all tags in htmls have open-close paired.
For example, a line break tag is <br>, not <br />;
an image tag is <img src="1.jpg">, not <img src="1.jpg" />
Web Browser display them correctly, but when I setContent to a
QDomDocument, these tags bring parse error.
Is there a way to skip/auto-fix these open-close not-matched tags?
Is DOM parser a correct parser to parse sick htmls?
Thanks,
Lingfa
--
[ signature omitted ]
Message 2 in thread
Lingfa Yang schrub:
> All XML players,
> ...
> Is DOM parser a correct parser to parse sick htmls?
No. "Sick HTML" != "Valid XML"
;)
Cheers, Oliver
--
[ signature omitted ]
Message 3 in thread
>>
>> Is DOM parser a correct parser to parse sick htmls?
>
>
> No. "Sick HTML" != "Valid XML"
>
> ;)
>
Oliver,
You are right the sick html is not a valid xml. But they are everywhere,
and we have to face them.
I wonder why web browsers (IE or Firefox) have the capability to parse
them, or probably, web browsers have internal correction mechanism. I
wish Qt's parser can gain that power. Maybe it is time to subclass
QDomDocument and overwrite setContent.
Thanks,
Lingfa
--
[ signature omitted ]
Message 4 in thread
Lingfa Yang schrub:
>
>>>
>>> Is DOM parser a correct parser to parse sick htmls?
>>
>>
>> No. "Sick HTML" != "Valid XML"
>>
>> ;)
>>
> Oliver,
>
> You are right the sick html is not a valid xml. But they are everywhere,
> and we have to face them.
That's black magic: but you might want to read "Quirks mode" vs.
"standard mode" (or "strict mode") for example here:
http://en.wikipedia.org/wiki/Quirks_mode
Note that when you really /enforce/ your browser to be in
standard/strict (XML) mode by setting the proper DOCTYPE *and* the
proper MIME type to "application/xhtml+xml" (1), for example:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>...</head>
<body>...<body>
</html>
then you will *also* get plain XML error messages if your document
(which claims to be of MIME type "application/xhtml+xml") has a single
mistake!
(1) How to set the MIME type to "application/xhtml+xml": Read
http://juicystudio.com/article/content-negotiation.php (A) and
http://www.developershome.com/wap/wapServerSetup/tutorial.asp?page=settingUpMIME
(B)
For testing reasons you can also make sure that the file extension is
".xhtml" instead of ".html". This will tell your (Firefox/Opera) browser
to use the pure XML parser internally!
Also note that IE won't like this, since IE does *not* handle
"application/xhtml+xml" documents properly, but offers a "Save as..."
box instead. Quoting (A):
"Most modern browsers, with the exception of Internet Explorer 6,
support the MIME type application/xhtml+xml."
I think also IE 7 does not like "application/xhtml+xml".
In fact, your rendered XHTML document (rendered in "strict" mode) might
really look slightly different, compared to the "quirks mode". This is
because XHTML defines some properties such as the base line for images
to be different than in HTML 4.01, for example.
You might also consider this page your friend:
http://validator.w3.org/
> I wonder why web browsers (IE or Firefox) have the capability to parse
> them, or probably, web browsers have internal correction mechanism.
That's a fact: webbrowsers have *different* parsers built in, depending
on the DOCTYPE/MIME type one or the other is used.
> wish Qt's parser can gain that power. Maybe it is time to subclass
> QDomDocument and overwrite setContent.
Well... that would take a looooooooong time :) - good luck! - and
eventually you would end up with somethink like Gecko or WebKit! That
would be a *huge* subclass ;)
Maybe you whish to wait for Qt 4.4. instead, which to my understanding
will provide a WebKit interface! At least you can /render/ (X)HTML pages
then, not sure if it gives you access to the parser...
http://en.wikipedia.org/wiki/WebKit
Cheers, Oliver
--
[ signature omitted ]
Message 5 in thread
> That's black magic: but you might want to read "Quirks mode" vs.
> "standard mode" (or "strict mode") for example here:
>
> http://en.wikipedia.org/wiki/Quirks_mode
>
> Note that when you really /enforce/ your browser to be in
> standard/strict (XML) mode by setting the proper DOCTYPE *and* the
> proper MIME type to "application/xhtml+xml" (1), for example:
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
> "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>...</head>
> <body>...<body>
> </html>
>
> then you will *also* get plain XML error messages if your document
> (which claims to be of MIME type "application/xhtml+xml") has a single
> mistake!
>
> (1) How to set the MIME type to "application/xhtml+xml": Read
>
> http://juicystudio.com/article/content-negotiation.php (A) and
>
> http://www.developershome.com/wap/wapServerSetup/tutorial.asp?page=settingUpMIME
> (B)
>
> For testing reasons you can also make sure that the file extension is
> ".xhtml" instead of ".html". This will tell your (Firefox/Opera)
> browser to use the pure XML parser internally!
>
> Also note that IE won't like this, since IE does *not* handle
> "application/xhtml+xml" documents properly, but offers a "Save as..."
> box instead. Quoting (A):
>
> "Most modern browsers, with the exception of Internet Explorer 6,
> support the MIME type application/xhtml+xml."
>
> I think also IE 7 does not like "application/xhtml+xml".
>
> In fact, your rendered XHTML document (rendered in "strict" mode)
> might really look slightly different, compared to the "quirks mode".
> This is because XHTML defines some properties such as the base line
> for images to be different than in HTML 4.01, for example.
>
> You might also consider this page your friend:
>
> http://validator.w3.org/
>
>> I wonder why web browsers (IE or Firefox) have the capability to
>> parse them, or probably, web browsers have internal correction
>> mechanism.
>
>
> That's a fact: webbrowsers have *different* parsers built in,
> depending on the DOCTYPE/MIME type one or the other is used.
>
>> wish Qt's parser can gain that power. Maybe it is time to subclass
>> QDomDocument and overwrite setContent.
>
>
> Well... that would take a looooooooong time :) - good luck! - and
> eventually you would end up with somethink like Gecko or WebKit! That
> would be a *huge* subclass ;)
>
> Maybe you whish to wait for Qt 4.4. instead, which to my understanding
> will provide a WebKit interface! At least you can /render/ (X)HTML
> pages then, not sure if it gives you access to the parser...
>
> http://en.wikipedia.org/wiki/WebKit
>
> Cheers, Oliver
Oliver,
Your answers are always accurate, thoughtful, and full of useful
information.
Thanks a lot,
Best regards,
Lingfa
--
[ signature omitted ]
Message 6 in thread
> Not all tags in htmls have open-close paired.
Right
> For example, a line break tag is <br>, not <br />;
> an image tag is <img src="1.jpg">, not <img src="1.jpg" />
That's what XHTML is for :-) ... then you must use <br /> :-)
> Web Browser display them correctly, but when I setContent to a
> QDomDocument, these tags bring parse error.
Which is correct, because HTML (without the X) is no valid XML.
It's not just a question of missing closing tags, there are more problems
like the case insensitivity of html ....
> Is there a way to skip/auto-fix these open-close not-matched tags?
> Is DOM parser a correct parser to parse sick htmls?
AFAIK DOM is just usable for true XML - so the answer is no.
Regards,
Malte
--
[ signature omitted ]
Message 7 in thread
You should try QXmlStreamReader.
On 10/29/07, Lingfa Yang <lingfa@xxxxxxx> wrote:
> All XML players,
>
> Not all tags in htmls have open-close paired.
> For example, a line break tag is <br>, not <br />;
> an image tag is <img src="1.jpg">, not <img src="1.jpg" />
>
> Web Browser display them correctly, but when I setContent to a
> QDomDocument, these tags bring parse error.
>
> Is there a way to skip/auto-fix these open-close not-matched tags?
> Is DOM parser a correct parser to parse sick htmls?
>
> Thanks,
> Lingfa
>
>
> --
> To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx with "unsubscribe" in the subject or the body.
> List archive and information: http://lists.trolltech.com/qt-interest/
>
>
--
[ signature omitted ]
Message 8 in thread
You could also try running your html through htmltidy to get an xhtml version
of your data and then process that with DOM/SAX/other parsers.
Sean
On Monday 29 October 2007, Lingfa Yang wrote:
> All XML players,
>
> Not all tags in htmls have open-close paired.
> For example, a line break tag is <br>, not <br />;
> an image tag is <img src="1.jpg">, not <img src="1.jpg" />
>
> Web Browser display them correctly, but when I setContent to a
> QDomDocument, these tags bring parse error.
>
> Is there a way to skip/auto-fix these open-close not-matched tags?
> Is DOM parser a correct parser to parse sick htmls?
>
> Thanks,
> Lingfa
>
>
> --
> To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx with
> "unsubscribe" in the subject or the body. List archive and information:
> http://lists.trolltech.com/qt-interest/
--
[ signature omitted ]
Message 9 in thread
What do you actually want to *do* with the invalid HTML? Where does the HTML come from?
If you're aiming to extract data from invalid HTML rather than just displaying it, treating it as XML may be the wrong approach: you might be better off pattern matching with regular expressions.
On the other hand, if all the HTML is invalid in the same or similar ways, you may be able to use regular expressions to clean it up well enough for SAX or DOM parsing.
Sam Dutton
SAM DUTTON
SENIOR SITE DEVELOPER
200 GRAY'S INN ROAD
LONDON
WC1X 8XZ
UNITED KINGDOM
T +44 (0)20 7430 4496
F
E Sam.Dutton@xxxxxxxxx
WWW.ITN.CO.UK
P Please consider the environment. Do you really need to print this email?
-----Original Message-----
From: Lingfa Yang [mailto:lingfa@xxxxxxx]
Sent: Monday 29 October 2007 14:50
To: Qt Interest
Subject: How to parse an html?
All XML players,
Not all tags in htmls have open-close paired.
For example, a line break tag is <br>, not <br />; an image tag is <img src="1.jpg">, not <img src="1.jpg" />
Web Browser display them correctly, but when I setContent to a QDomDocument, these tags bring parse error.
Is there a way to skip/auto-fix these open-close not-matched tags?
Is DOM parser a correct parser to parse sick htmls?
Thanks,
Lingfa
--
[ signature omitted ]