Trolltech Home | Qt-interest Home | Recent Threads | All Threads | Author | Date
All threads index page 4

Qt-interest Archive, January 2008
Parse a huge XML file

Pages: Prev | 1 | 2 | Next

Message 1 in thread

I have to read a 60GB XML file.

Since this is too big to fit in memory, I want to build an index on it. 
Is there a way with Qt classes to parse XML data without having all 
document in memory, for instance with a non blocking parser that will 
send events at each xml tag he encounters, but without storing data 
incrementally in a QDomDocument?



Thanks,

Etienne

--
 [ signature omitted ] 

Message 2 in thread

Hi,

> Since this is too big to fit in memory, I want to build an index on it. 
> Is there a way with Qt classes to parse XML data without having all 
> document in memory, for instance with a non blocking parser that will 
> send events at each xml tag he encounters, but without storing data 
> incrementally in a QDomDocument?

Use the SAX parser instead of DOM:
	http://doc.trolltech.com/4.3/qtxml.html#the-qt-sax2-classes

--
 [ signature omitted ] 

Message 3 in thread

Dimitri wrote:
> Hi,
> 
>> Since this is too big to fit in memory, I want to build an index on
>> it. Is there a way with Qt classes to parse XML data without having
>> all document in memory, for instance with a non blocking parser that
>> will send events at each xml tag he encounters, but without storing
>> data incrementally in a QDomDocument?
> 
> Use the SAX parser instead of DOM:
>     http://doc.trolltech.com/4.3/qtxml.html#the-qt-sax2-classes

That's certainly a good start, but bear in mind that if
you are parsing an XML doc, you will probably be building
some kind of internal representation of it, and for a 60GB
document, that internal representation may still be huge.

OTOH, if you're merely searching through the doc to detect
the presence or absence of some particular element, or to
gather light-weight statistics, or to modify it and regenerate
it on the fly, SAX is the way to go.

-- 
 [ signature omitted ] 

Message 4 in thread

Yes, as data is huge, I want to build an index of the byte position of 
tags in the XML file, so that I can do fast random access to these tags 
afterwards. My index will be written directly on a SQL server. The SAX 
parser is exactly what I want except for one thing : I cannot find how 
to get the actual position in the file when an event is triggered.

Thanks,

Etienne



Stephen Collyer wrote:
> That's certainly a good start, but bear in mind that if
> you are parsing an XML doc, you will probably be building
> some kind of internal representation of it, and for a 60GB
> document, that internal representation may still be huge.
>
> OTOH, if you're merely searching through the doc to detect
> the presence or absence of some particular element, or to
> gather light-weight statistics, or to modify it and regenerate
> it on the fly, SAX is the way to go.
>
>   

--
 [ signature omitted ] 

Message 5 in thread

Have you considered using the database to store an indexed version of  
the document rather than just the index ?

On Jan 26, 2008, at 7:58 AM, Etienne SANDRE wrote:

> Yes, as data is huge, I want to build an index of the byte position  
> of tags in the XML file, so that I can do fast random access to  
> these tags afterwards. My index will be written directly on a SQL  
> server. The SAX parser is exactly what I want except for one  
> thing : I cannot find how to get the actual position in the file  
> when an event is triggered.
>
> Thanks,
>
> Etienne
>
>
>
> Stephen Collyer wrote:
>> That's certainly a good start, but bear in mind that if
>> you are parsing an XML doc, you will probably be building
>> some kind of internal representation of it, and for a 60GB
>> document, that internal representation may still be huge.
>>
>> OTOH, if you're merely searching through the doc to detect
>> the presence or absence of some particular element, or to
>> gather light-weight statistics, or to modify it and regenerate
>> it on the fly, SAX is the way to go.
>>
>>
>
> --
> To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx  
> with "unsubscribe" in the subject or the body.
> List archive and information: http://lists.trolltech.com/qt-interest/
>

--
 [ signature omitted ] 

Message 6 in thread

Yes. I explain you the problem more in details.

There are three type of tags in my XML file. Each has an id attribute 
which is unique for that kind of tag.

Type 1 : points. These describe coordinates of 2d points
Type 2 : segments. These have two point ids, describing the segment's ends
Type 3 : segmented line. These have an arbitrary number of segment id's

I want to store back all these data in a unique SQL table with 
GIS(spatial data) extensions. You can store an arbitrary vector geometry 
in a table field, and you can use very efficient 2d indexes on object's 
bounding box. So it's better.

If I directly recreate three tables (points, segments, lines) in a SQL 
database from the unprocessed XML data, I will have to do a conversion 
program that reads the data from the SQL tables, convert it to GIS 
objects, and store them in another SQL table. As the first db scheme is 
unefficient, reading a full line data is expensive : after retrieving 
the segment id's, you have to query them in the segment table, then 
query the point table for the corresponding id. I want to avoid this and 
directly build the final SQL table without building a temporary one.

In short : I do not want to store the XML data directly to SQL, because 
I want to build the table in a completely different scheme. This 
requires random acces reads to the segment and point datas. This is why 
I would like to create an index of tag location in the XML file from 
object id's.

Any idea appreciated (but this is a little off-topic...)

Thanks!

Etienne





Dan White wrote:
> Have you considered using the database to store an indexed version of 
> the document rather than just the index ?
>
> On Jan 26, 2008, at 7:58 AM, Etienne SANDRE wrote:
>
>> Yes, as data is huge, I want to build an index of the byte position 
>> of tags in the XML file, so that I can do fast random access to these 
>> tags afterwards. My index will be written directly on a SQL server. 
>> The SAX parser is exactly what I want except for one thing : I cannot 
>> find how to get the actual position in the file when an event is 
>> triggered.
>>
>> Thanks,
>>
>> Etienne
>>
>>
>>
>> Stephen Collyer wrote:
>>> That's certainly a good start, but bear in mind that if
>>> you are parsing an XML doc, you will probably be building
>>> some kind of internal representation of it, and for a 60GB
>>> document, that internal representation may still be huge.
>>>
>>> OTOH, if you're merely searching through the doc to detect
>>> the presence or absence of some particular element, or to
>>> gather light-weight statistics, or to modify it and regenerate
>>> it on the fly, SAX is the way to go.
>>>
>>>
>>
>> -- 
>> To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx 
>> with "unsubscribe" in the subject or the body.
>> List archive and information: http://lists.trolltech.com/qt-interest/
>>
>
> -- 
> To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx with 
> "unsubscribe" in the subject or the body.
> List archive and information: http://lists.trolltech.com/qt-interest/
>
>
>
>
> --No virus found in this incoming message.
> Checked by AVG Free Edition.Version: 7.5.516 / Virus Database: 
> 269.19.11/1244 - Release Date: 1/25/2008 7:44 PM
>
>

--
 [ signature omitted ] 

Message 7 in thread

On Saturday 26 January 2008 14:47:47 Etienne SANDRE wrote:
Ok, let's assume your XML looks like this:
<line>
	<seg> <point>..</point><point>..</point>
	<seg> <point>..</point><point>..</point>
	<seg> <point>..</point><point>..</point>
</line>
<line>
...
</line>
etc.

I'd recommend you use sax, read one <line /> each time (keep it im memory) and 
store it directly to the DB in the correct format.  
However, that all depends on the how the data exists in the XML file and how 
you want it in the DB. 
If this is no help, I have a few more suggestions for you, but I think we 
should take that of the list.

Happy coding,
Eric

> Yes. I explain you the problem more in details.
>
> There are three type of tags in my XML file. Each has an id attribute
> which is unique for that kind of tag.
>
> Type 1 : points. These describe coordinates of 2d points
> Type 2 : segments. These have two point ids, describing the segment's ends
> Type 3 : segmented line. These have an arbitrary number of segment id's
>
> I want to store back all these data in a unique SQL table with
> GIS(spatial data) extensions. You can store an arbitrary vector geometry
> in a table field, and you can use very efficient 2d indexes on object's
> bounding box. So it's better.
>
> If I directly recreate three tables (points, segments, lines) in a SQL
> database from the unprocessed XML data, I will have to do a conversion
> program that reads the data from the SQL tables, convert it to GIS
> objects, and store them in another SQL table. As the first db scheme is
> unefficient, reading a full line data is expensive : after retrieving
> the segment id's, you have to query them in the segment table, then
> query the point table for the corresponding id. I want to avoid this and
> directly build the final SQL table without building a temporary one.
>
> In short : I do not want to store the XML data directly to SQL, because
> I want to build the table in a completely different scheme. This
> requires random acces reads to the segment and point datas. This is why
> I would like to create an index of tag location in the XML file from
> object id's.
>
> Any idea appreciated (but this is a little off-topic...)
>
> Thanks!
>
> Etienne
>
> Dan White wrote:
> > Have you considered using the database to store an indexed version of
> > the document rather than just the index ?
> >
> > On Jan 26, 2008, at 7:58 AM, Etienne SANDRE wrote:
> >> Yes, as data is huge, I want to build an index of the byte position
> >> of tags in the XML file, so that I can do fast random access to these
> >> tags afterwards. My index will be written directly on a SQL server.
> >> The SAX parser is exactly what I want except for one thing : I cannot
> >> find how to get the actual position in the file when an event is
> >> triggered.
> >>
> >> Thanks,
> >>
> >> Etienne
> >>
> >> Stephen Collyer wrote:
> >>> That's certainly a good start, but bear in mind that if
> >>> you are parsing an XML doc, you will probably be building
> >>> some kind of internal representation of it, and for a 60GB
> >>> document, that internal representation may still be huge.
> >>>
> >>> OTOH, if you're merely searching through the doc to detect
> >>> the presence or absence of some particular element, or to
> >>> gather light-weight statistics, or to modify it and regenerate
> >>> it on the fly, SAX is the way to go.
> >>
> >> --
> >> To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx
> >> with "unsubscribe" in the subject or the body.
> >> List archive and information: http://lists.trolltech.com/qt-interest/
> >
> > --
> > To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx with
> > "unsubscribe" in the subject or the body.
> > List archive and information: http://lists.trolltech.com/qt-interest/
> >
> >
> >
> >
> > --No virus found in this incoming message.
> > Checked by AVG Free Edition.Version: 7.5.516 / Virus Database:
> > 269.19.11/1244 - Release Date: 1/25/2008 7:44 PM
>
> --
> To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx with
> "unsubscribe" in the subject or the body. List archive and information:
> http://lists.trolltech.com/qt-interest/


--
 [ signature omitted ] 

Message 8 in thread

Maybe QXmlStreamReader class? (Qt4.3)

It has a characterOffset() member function...

--
 [ signature omitted ] 

Message 9 in thread

Tr3wory : Thanks, QXmlStreamReader is much better than QXmlSimpleReader :
more simple to implement, and characterOffset() is exactly what I need.


Eric:
Nope. Unfortunately, my XML looks like this:

<point id=24235 x=3436.325 y=4306.96340 />
<point id=431665 x=457.2745 y=346.346 />
...
<seg id=2436 start_node_id=324 end_node_id=34373745 />
...
<line id=23511>
  <lineseg seg_id=2436 />
  <lineseg seg_id=45468 />
  <lineseg seg_id=253278 />
   ...
</line>

So I need to do random access to the data (due to the id references). This
is why I need to build an index. This is true, I could do it with mysql in a
temporary table, but I would prefer something lighter.

Message 10 in thread

On Tuesday 29 January 2008 11:58:30 Etienne Sandrà wrote:
> Tr3wory : Thanks, QXmlStreamReader is much better than QXmlSimpleReader :
> more simple to implement, and characterOffset() is exactly what I need.
>
>
> Eric:
> Nope. Unfortunately, my XML looks like this:
>
> <point id=24235 x=3436.325 y=4306.96340 />

BTW, this line is not a valid XML but this one is:
<point id="24235" x="3436.325" y="4306.96340" />

-- 
 [ signature omitted ] 

Attachment: signature.asc
Description: This is a digitally signed message part.


Message 11 in thread

Etienne Sandrà schrieb:
> So I need to do random access to the data (due to the id references). 
> This is why I need to build an index. This is true, I could do it with 
> mysql in a temporary table, but I would prefer something lighter.

any arguments against a custom solution in this case? the xml-elements 
seem to be few so you could write something yourself that build an index 
in one pass.

regards,
Georg

--
 [ signature omitted ] 

Message 12 in thread

Etienne Sandrà schrieb:
> So I need to do random access to the data (due to the id references).
> This is why I need to build an index. This is true, I could do it with
> mysql in a temporary table, but I would prefer something lighter.


You can use a native XML db like bekeley-db/xml  it has lot of advantages
for storing xml  rather than use a sql model that is not what you want or
plain xml files.

Take a look here:
http://www.oracle.com/database/berkeley-db/xml/index.html

Hope this help


any arguments against a custom solution in this case? the xml-elements
> seem to be few so you could write something yourself that build an index
> in one pass.
>
> regards,
> Georg
>
> --
> To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx with
> "unsubscribe" in the subject or the body.
> List archive and information: http://lists.trolltech.com/qt-interest/
>
>

Message 13 in thread

Georg Fritzsche wrote:
> Etienne Sandrà schrieb:
>> So I need to do random access to the data (due to the id references). 
>> This is why I need to build an index. This is true, I could do it with 
>> mysql in a temporary table, but I would prefer something lighter.
> 
> any arguments against a custom solution in this case? the xml-elements 
> seem to be few so you could write something yourself that build an index 
> in one pass.

+1

I also believe in this case simple XML parser like
this one

http://devnull.samersoff.net/cgi-bin/hg/hgwebdir.cgi/libdms5/file/559fd6c22e81/src/tools/src/dsXMLReader.cxx

is the best solution.

-- 
 [ signature omitted ] 

Message 14 in thread

On Tuesday 29 January 2008 10:58:30 Etienne Sandrà wrote:
I think you can use the knowledge you have of the behaviour of XML and parsing 
it to your advantage. 
You know, for example that the fist point you encounter has ID 245, the second 
one id 3457532 etc. Same for segments and same for lines
So with one pass, you can 'remap' all the ids into seperate files that are a 
bit more efficient. Then start again from there.

for example, you parse the xml file, looking only for lines
each line you store in memory: 
struct line{
unsigned int* lineids;
unsigned int* lineIndex;
unsigned char count; // assuming you have no more than 255 segments in a line 
ever.
}

for each line, you simply allocate the number of regments, store their ids and 
the count. 

now parse again, this time whenever you see a new segment assign it a new 
number (i++)
look for any line that contains this id and assign the number to the same 
location in lineIndex.

finally, write all the lines to a new file in a fixed format: for example 
count, index, index, index, count, index, index, count index, count index, 
index, index, index
The index will actually correspond to the index in the segment file you're 
going to build next the same way (a line will consist of 4 points so you can 
navigate through that file by using the index.

Something like that.
Just a thought.


> Tr3wory : Thanks, QXmlStreamReader is much better than QXmlSimpleReader :
> more simple to implement, and characterOffset() is exactly what I need.
>
>
> Eric:
> Nope. Unfortunately, my XML looks like this:
>
> <point id=24235 x=3436.325 y=4306.96340 />
> <point id=431665 x=457.2745 y=346.346 />
> ...
> <seg id=2436 start_node_id=324 end_node_id=34373745 />
> ...
> <line id=23511>
>   <lineseg seg_id=2436 />
>   <lineseg seg_id=45468 />
>   <lineseg seg_id=253278 />
>    ...
> </line>
>
> So I need to do random access to the data (due to the id references). This
> is why I need to build an index. This is true, I could do it with mysql in
> a temporary table, but I would prefer something lighter.


--
 [ signature omitted ] 

Message 15 in thread

Hi Etienne, 
 
Having read access to the character offset does not give you random access to the data - you will still have to reparse from the beginning of the file to get to back to any particular data element. 
 
Ideally, you would have a simple XML parser that allows you to save and RESTORE the whole parser state. But I have not seen anything that does this. And in your case, the saved state at each index entry is probably bigger than the data! 
 
Regards, 
 
Tony Rietwyk
 
 

-----Original Message-----
From: etienne.sandre.chardonnal@xxxxxxxxx [mailto:etienne.sandre.chardonnal@xxxxxxxxx] On Behalf Of Etienne SandrÃ
Sent: Tuesday, 29 January 2008 20:59
To: qt-interest@xxxxxxxxxxxxx
Subject: Re: Parse a huge XML file


Tr3wory : Thanks, QXmlStreamReader is much better than QXmlSimpleReader : more simple to implement, and characterOffset() is exactly what I need.


Eric:
Nope. Unfortunately, my XML looks like this:

<point id=24235 x=3436.325 y=4306.96340 />
<point id=431665 x=457.2745 y=346.346 />
...
<seg id=2436 start_node_id=324 end_node_id=34373745 />
...
<line id=23511>
  <lineseg seg_id=2436 />
  <lineseg seg_id=45468 />
  <lineseg seg_id=253278 />
   ...
</line>

So I need to do random access to the data (due to the id references). This is why I need to build an index. This is true, I could do it with mysql in a temporary table, but I would prefer something lighter.



Pages: Prev | 1 | 2 | Next