Trolltech Home | Qt-interest Home | Recent Threads | All Threads | Author | Date
All threads index page 6

Qt-interest Archive, October 2007
How to parse an html?


Message 1 in thread

All XML players,

Not all tags in htmls have open-close paired.
For example, a line break tag is <br>, not <br />;
an image tag is <img src="1.jpg">, not <img src="1.jpg" />

Web Browser display them correctly, but when I setContent to a 
QDomDocument, these tags bring parse error.

Is there a way to skip/auto-fix these open-close not-matched tags?
Is DOM parser a correct parser to parse sick htmls?

Thanks,
Lingfa


--
 [ signature omitted ] 

Message 2 in thread

Lingfa Yang schrub:
> All XML players,
> ...
> Is DOM parser a correct parser to parse sick htmls?

No. "Sick HTML" != "Valid XML"

;)

Cheers, Oliver

--
 [ signature omitted ] 

Message 3 in thread

>>
>> Is DOM parser a correct parser to parse sick htmls?
>
>
> No. "Sick HTML" != "Valid XML"
>
> ;)
>
Oliver,

You are right the sick html is not a valid xml. But they are everywhere, 
and we have to face them. 

I wonder why web browsers (IE or Firefox) have the capability to parse 
them, or probably, web browsers have internal correction mechanism. I 
wish Qt's parser can gain that power. Maybe it is time to subclass 
QDomDocument and overwrite setContent.

Thanks,
Lingfa

--
 [ signature omitted ] 

Message 4 in thread

Lingfa Yang schrub:
> 
>>>
>>> Is DOM parser a correct parser to parse sick htmls?
>>
>>
>> No. "Sick HTML" != "Valid XML"
>>
>> ;)
>>
> Oliver,
> 
> You are right the sick html is not a valid xml. But they are everywhere, 
> and we have to face them.

That's black magic: but you might want to read "Quirks mode" vs. 
"standard mode" (or "strict mode") for example here:

   http://en.wikipedia.org/wiki/Quirks_mode

Note that when you really /enforce/ your browser to be in 
standard/strict (XML) mode by setting the proper DOCTYPE *and* the 
proper MIME type to "application/xhtml+xml" (1), for example:

   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd";>
   <html xmlns="http://www.w3.org/1999/xhtml";>
     <head>...</head>
     <body>...<body>
   </html>

then you will *also* get plain XML error messages if your document 
(which claims to be of MIME type "application/xhtml+xml") has a single 
mistake!

(1) How to set the MIME type to "application/xhtml+xml": Read

     http://juicystudio.com/article/content-negotiation.php (A) and
 
http://www.developershome.com/wap/wapServerSetup/tutorial.asp?page=settingUpMIME 
(B)

For testing reasons you can also make sure that the file extension is 
".xhtml" instead of ".html". This will tell your (Firefox/Opera) browser 
to use the pure XML parser internally!

Also note that IE won't like this, since IE does *not* handle 
"application/xhtml+xml" documents properly, but offers a "Save as..." 
box instead. Quoting (A):

"Most modern browsers, with the exception of Internet Explorer 6, 
support the MIME type application/xhtml+xml."

I think also IE 7 does not like "application/xhtml+xml".

In fact, your rendered XHTML document (rendered in "strict" mode) might 
really look slightly different, compared to the "quirks mode". This is 
because XHTML defines some properties such as the base line for images 
to be different than in HTML 4.01, for example.

You might also consider this page your friend:

   http://validator.w3.org/

> I wonder why web browsers (IE or Firefox) have the capability to parse 
> them, or probably, web browsers have internal correction mechanism.

That's a fact: webbrowsers have *different* parsers built in, depending 
on the DOCTYPE/MIME type one or the other is used.

> wish Qt's parser can gain that power. Maybe it is time to subclass 
> QDomDocument and overwrite setContent.

Well... that would take a looooooooong time :) - good luck! - and 
eventually you would end up with somethink like Gecko or WebKit! That 
would be a *huge* subclass ;)

Maybe you whish to wait for Qt 4.4. instead, which to my understanding 
will provide a WebKit interface! At least you can /render/ (X)HTML pages 
then, not sure if it gives you access to the parser...

   http://en.wikipedia.org/wiki/WebKit

Cheers, Oliver

--
 [ signature omitted ] 

Message 5 in thread

> That's black magic: but you might want to read "Quirks mode" vs. 
> "standard mode" (or "strict mode") for example here:
>
>   http://en.wikipedia.org/wiki/Quirks_mode
>
> Note that when you really /enforce/ your browser to be in 
> standard/strict (XML) mode by setting the proper DOCTYPE *and* the 
> proper MIME type to "application/xhtml+xml" (1), for example:
>
>   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
> "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd";>
>   <html xmlns="http://www.w3.org/1999/xhtml";>
>     <head>...</head>
>     <body>...<body>
>   </html>
>
> then you will *also* get plain XML error messages if your document 
> (which claims to be of MIME type "application/xhtml+xml") has a single 
> mistake!
>
> (1) How to set the MIME type to "application/xhtml+xml": Read
>
>     http://juicystudio.com/article/content-negotiation.php (A) and
>
> http://www.developershome.com/wap/wapServerSetup/tutorial.asp?page=settingUpMIME 
> (B)
>
> For testing reasons you can also make sure that the file extension is 
> ".xhtml" instead of ".html". This will tell your (Firefox/Opera) 
> browser to use the pure XML parser internally!
>
> Also note that IE won't like this, since IE does *not* handle 
> "application/xhtml+xml" documents properly, but offers a "Save as..." 
> box instead. Quoting (A):
>
> "Most modern browsers, with the exception of Internet Explorer 6, 
> support the MIME type application/xhtml+xml."
>
> I think also IE 7 does not like "application/xhtml+xml".
>
> In fact, your rendered XHTML document (rendered in "strict" mode) 
> might really look slightly different, compared to the "quirks mode". 
> This is because XHTML defines some properties such as the base line 
> for images to be different than in HTML 4.01, for example.
>
> You might also consider this page your friend:
>
>   http://validator.w3.org/
>
>> I wonder why web browsers (IE or Firefox) have the capability to 
>> parse them, or probably, web browsers have internal correction 
>> mechanism.
>
>
> That's a fact: webbrowsers have *different* parsers built in, 
> depending on the DOCTYPE/MIME type one or the other is used.
>
>> wish Qt's parser can gain that power. Maybe it is time to subclass 
>> QDomDocument and overwrite setContent.
>
>
> Well... that would take a looooooooong time :) - good luck! - and 
> eventually you would end up with somethink like Gecko or WebKit! That 
> would be a *huge* subclass ;)
>
> Maybe you whish to wait for Qt 4.4. instead, which to my understanding 
> will provide a WebKit interface! At least you can /render/ (X)HTML 
> pages then, not sure if it gives you access to the parser...
>
>   http://en.wikipedia.org/wiki/WebKit
>
> Cheers, Oliver

Oliver,

Your answers are always accurate, thoughtful,  and full of useful 
information.
Thanks a lot,

Best regards,
Lingfa

--
 [ signature omitted ] 

Message 6 in thread

> Not all tags in htmls have open-close paired.
Right

> For example, a line break tag is <br>, not <br />;
> an image tag is <img src="1.jpg">, not <img src="1.jpg" />
That's what XHTML is for :-) ... then you must use <br /> :-)

> Web Browser display them correctly, but when I setContent to a 
> QDomDocument, these tags bring parse error.
Which is correct, because HTML (without the X) is no valid XML.
It's not just a question of missing closing tags, there are more problems 
like the case insensitivity of html ....

> Is there a way to skip/auto-fix these open-close not-matched tags?
> Is DOM parser a correct parser to parse sick htmls?
AFAIK DOM is just usable for true XML - so the answer is no.

Regards,
Malte

--
 [ signature omitted ] 

Message 7 in thread

You should try QXmlStreamReader.

On 10/29/07, Lingfa Yang <lingfa@xxxxxxx> wrote:
> All XML players,
>
> Not all tags in htmls have open-close paired.
> For example, a line break tag is <br>, not <br />;
> an image tag is <img src="1.jpg">, not <img src="1.jpg" />
>
> Web Browser display them correctly, but when I setContent to a
> QDomDocument, these tags bring parse error.
>
> Is there a way to skip/auto-fix these open-close not-matched tags?
> Is DOM parser a correct parser to parse sick htmls?
>
> Thanks,
> Lingfa
>
>
> --
> To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx with "unsubscribe" in the subject or the body.
> List archive and information: http://lists.trolltech.com/qt-interest/
>
>

--
 [ signature omitted ] 

Message 8 in thread

You could also try running your html through htmltidy to get an xhtml version 
of your data and then process that with DOM/SAX/other parsers.

Sean

On Monday 29 October 2007, Lingfa Yang wrote:
> All XML players,
>
> Not all tags in htmls have open-close paired.
> For example, a line break tag is <br>, not <br />;
> an image tag is <img src="1.jpg">, not <img src="1.jpg" />
>
> Web Browser display them correctly, but when I setContent to a
> QDomDocument, these tags bring parse error.
>
> Is there a way to skip/auto-fix these open-close not-matched tags?
> Is DOM parser a correct parser to parse sick htmls?
>
> Thanks,
> Lingfa
>
>
> --
> To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx with
> "unsubscribe" in the subject or the body. List archive and information:
> http://lists.trolltech.com/qt-interest/



-- 
 [ signature omitted ] 

Message 9 in thread

What do you actually want to *do* with the invalid HTML?  Where does the HTML come from?

If you're aiming to extract data from invalid HTML rather than just displaying it, treating it as XML may be the wrong approach: you might be better off pattern matching with regular expressions.

On the other hand, if all the HTML is invalid in the same or similar ways, you may be able to use regular expressions to clean it up well enough for SAX or DOM parsing.

Sam Dutton


 





SAM DUTTON
SENIOR SITE DEVELOPER

200 GRAY'S INN ROAD
LONDON
WC1X 8XZ
UNITED KINGDOM
T +44 (0)20 7430 4496
F 
E Sam.Dutton@xxxxxxxxx
WWW.ITN.CO.UK

P  Please consider the environment. Do you really need to print this email?
-----Original Message-----

From: Lingfa Yang [mailto:lingfa@xxxxxxx] 
Sent: Monday 29 October 2007 14:50
To: Qt Interest
Subject: How to parse an html?

All XML players,

Not all tags in htmls have open-close paired.
For example, a line break tag is <br>, not <br />; an image tag is <img src="1.jpg">, not <img src="1.jpg" />

Web Browser display them correctly, but when I setContent to a QDomDocument, these tags bring parse error.

Is there a way to skip/auto-fix these open-close not-matched tags?
Is DOM parser a correct parser to parse sick htmls?

Thanks,
Lingfa


--
 [ signature omitted ]