| Trolltech Home | Qt-interest Home | Recent Threads | All Threads | Author | Date | |
| All threads index page 4 | |
I have to read a 60GB XML file. Since this is too big to fit in memory, I want to build an index on it. Is there a way with Qt classes to parse XML data without having all document in memory, for instance with a non blocking parser that will send events at each xml tag he encounters, but without storing data incrementally in a QDomDocument? Thanks, Etienne -- [ signature omitted ]
Hi, > Since this is too big to fit in memory, I want to build an index on it. > Is there a way with Qt classes to parse XML data without having all > document in memory, for instance with a non blocking parser that will > send events at each xml tag he encounters, but without storing data > incrementally in a QDomDocument? Use the SAX parser instead of DOM: http://doc.trolltech.com/4.3/qtxml.html#the-qt-sax2-classes -- [ signature omitted ]
Dimitri wrote: > Hi, > >> Since this is too big to fit in memory, I want to build an index on >> it. Is there a way with Qt classes to parse XML data without having >> all document in memory, for instance with a non blocking parser that >> will send events at each xml tag he encounters, but without storing >> data incrementally in a QDomDocument? > > Use the SAX parser instead of DOM: > http://doc.trolltech.com/4.3/qtxml.html#the-qt-sax2-classes That's certainly a good start, but bear in mind that if you are parsing an XML doc, you will probably be building some kind of internal representation of it, and for a 60GB document, that internal representation may still be huge. OTOH, if you're merely searching through the doc to detect the presence or absence of some particular element, or to gather light-weight statistics, or to modify it and regenerate it on the fly, SAX is the way to go. -- [ signature omitted ]
Yes, as data is huge, I want to build an index of the byte position of tags in the XML file, so that I can do fast random access to these tags afterwards. My index will be written directly on a SQL server. The SAX parser is exactly what I want except for one thing : I cannot find how to get the actual position in the file when an event is triggered. Thanks, Etienne Stephen Collyer wrote: > That's certainly a good start, but bear in mind that if > you are parsing an XML doc, you will probably be building > some kind of internal representation of it, and for a 60GB > document, that internal representation may still be huge. > > OTOH, if you're merely searching through the doc to detect > the presence or absence of some particular element, or to > gather light-weight statistics, or to modify it and regenerate > it on the fly, SAX is the way to go. > > -- [ signature omitted ]
Have you considered using the database to store an indexed version of the document rather than just the index ? On Jan 26, 2008, at 7:58 AM, Etienne SANDRE wrote: > Yes, as data is huge, I want to build an index of the byte position > of tags in the XML file, so that I can do fast random access to > these tags afterwards. My index will be written directly on a SQL > server. The SAX parser is exactly what I want except for one > thing : I cannot find how to get the actual position in the file > when an event is triggered. > > Thanks, > > Etienne > > > > Stephen Collyer wrote: >> That's certainly a good start, but bear in mind that if >> you are parsing an XML doc, you will probably be building >> some kind of internal representation of it, and for a 60GB >> document, that internal representation may still be huge. >> >> OTOH, if you're merely searching through the doc to detect >> the presence or absence of some particular element, or to >> gather light-weight statistics, or to modify it and regenerate >> it on the fly, SAX is the way to go. >> >> > > -- > To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx > with "unsubscribe" in the subject or the body. > List archive and information: http://lists.trolltech.com/qt-interest/ > -- [ signature omitted ]
Yes. I explain you the problem more in details. There are three type of tags in my XML file. Each has an id attribute which is unique for that kind of tag. Type 1 : points. These describe coordinates of 2d points Type 2 : segments. These have two point ids, describing the segment's ends Type 3 : segmented line. These have an arbitrary number of segment id's I want to store back all these data in a unique SQL table with GIS(spatial data) extensions. You can store an arbitrary vector geometry in a table field, and you can use very efficient 2d indexes on object's bounding box. So it's better. If I directly recreate three tables (points, segments, lines) in a SQL database from the unprocessed XML data, I will have to do a conversion program that reads the data from the SQL tables, convert it to GIS objects, and store them in another SQL table. As the first db scheme is unefficient, reading a full line data is expensive : after retrieving the segment id's, you have to query them in the segment table, then query the point table for the corresponding id. I want to avoid this and directly build the final SQL table without building a temporary one. In short : I do not want to store the XML data directly to SQL, because I want to build the table in a completely different scheme. This requires random acces reads to the segment and point datas. This is why I would like to create an index of tag location in the XML file from object id's. Any idea appreciated (but this is a little off-topic...) Thanks! Etienne Dan White wrote: > Have you considered using the database to store an indexed version of > the document rather than just the index ? > > On Jan 26, 2008, at 7:58 AM, Etienne SANDRE wrote: > >> Yes, as data is huge, I want to build an index of the byte position >> of tags in the XML file, so that I can do fast random access to these >> tags afterwards. My index will be written directly on a SQL server. >> The SAX parser is exactly what I want except for one thing : I cannot >> find how to get the actual position in the file when an event is >> triggered. >> >> Thanks, >> >> Etienne >> >> >> >> Stephen Collyer wrote: >>> That's certainly a good start, but bear in mind that if >>> you are parsing an XML doc, you will probably be building >>> some kind of internal representation of it, and for a 60GB >>> document, that internal representation may still be huge. >>> >>> OTOH, if you're merely searching through the doc to detect >>> the presence or absence of some particular element, or to >>> gather light-weight statistics, or to modify it and regenerate >>> it on the fly, SAX is the way to go. >>> >>> >> >> -- >> To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx >> with "unsubscribe" in the subject or the body. >> List archive and information: http://lists.trolltech.com/qt-interest/ >> > > -- > To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx with > "unsubscribe" in the subject or the body. > List archive and information: http://lists.trolltech.com/qt-interest/ > > > > > --No virus found in this incoming message. > Checked by AVG Free Edition.Version: 7.5.516 / Virus Database: > 269.19.11/1244 - Release Date: 1/25/2008 7:44 PM > > -- [ signature omitted ]
On Saturday 26 January 2008 14:47:47 Etienne SANDRE wrote: Ok, let's assume your XML looks like this: <line> <seg> <point>..</point><point>..</point> <seg> <point>..</point><point>..</point> <seg> <point>..</point><point>..</point> </line> <line> ... </line> etc. I'd recommend you use sax, read one <line /> each time (keep it im memory) and store it directly to the DB in the correct format. However, that all depends on the how the data exists in the XML file and how you want it in the DB. If this is no help, I have a few more suggestions for you, but I think we should take that of the list. Happy coding, Eric > Yes. I explain you the problem more in details. > > There are three type of tags in my XML file. Each has an id attribute > which is unique for that kind of tag. > > Type 1 : points. These describe coordinates of 2d points > Type 2 : segments. These have two point ids, describing the segment's ends > Type 3 : segmented line. These have an arbitrary number of segment id's > > I want to store back all these data in a unique SQL table with > GIS(spatial data) extensions. You can store an arbitrary vector geometry > in a table field, and you can use very efficient 2d indexes on object's > bounding box. So it's better. > > If I directly recreate three tables (points, segments, lines) in a SQL > database from the unprocessed XML data, I will have to do a conversion > program that reads the data from the SQL tables, convert it to GIS > objects, and store them in another SQL table. As the first db scheme is > unefficient, reading a full line data is expensive : after retrieving > the segment id's, you have to query them in the segment table, then > query the point table for the corresponding id. I want to avoid this and > directly build the final SQL table without building a temporary one. > > In short : I do not want to store the XML data directly to SQL, because > I want to build the table in a completely different scheme. This > requires random acces reads to the segment and point datas. This is why > I would like to create an index of tag location in the XML file from > object id's. > > Any idea appreciated (but this is a little off-topic...) > > Thanks! > > Etienne > > Dan White wrote: > > Have you considered using the database to store an indexed version of > > the document rather than just the index ? > > > > On Jan 26, 2008, at 7:58 AM, Etienne SANDRE wrote: > >> Yes, as data is huge, I want to build an index of the byte position > >> of tags in the XML file, so that I can do fast random access to these > >> tags afterwards. My index will be written directly on a SQL server. > >> The SAX parser is exactly what I want except for one thing : I cannot > >> find how to get the actual position in the file when an event is > >> triggered. > >> > >> Thanks, > >> > >> Etienne > >> > >> Stephen Collyer wrote: > >>> That's certainly a good start, but bear in mind that if > >>> you are parsing an XML doc, you will probably be building > >>> some kind of internal representation of it, and for a 60GB > >>> document, that internal representation may still be huge. > >>> > >>> OTOH, if you're merely searching through the doc to detect > >>> the presence or absence of some particular element, or to > >>> gather light-weight statistics, or to modify it and regenerate > >>> it on the fly, SAX is the way to go. > >> > >> -- > >> To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx > >> with "unsubscribe" in the subject or the body. > >> List archive and information: http://lists.trolltech.com/qt-interest/ > > > > -- > > To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx with > > "unsubscribe" in the subject or the body. > > List archive and information: http://lists.trolltech.com/qt-interest/ > > > > > > > > > > --No virus found in this incoming message. > > Checked by AVG Free Edition.Version: 7.5.516 / Virus Database: > > 269.19.11/1244 - Release Date: 1/25/2008 7:44 PM > > -- > To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx with > "unsubscribe" in the subject or the body. List archive and information: > http://lists.trolltech.com/qt-interest/ -- [ signature omitted ]
Maybe QXmlStreamReader class? (Qt4.3) It has a characterOffset() member function... -- [ signature omitted ]
Tr3wory : Thanks, QXmlStreamReader is much better than QXmlSimpleReader : more simple to implement, and characterOffset() is exactly what I need. Eric: Nope. Unfortunately, my XML looks like this: <point id=24235 x=3436.325 y=4306.96340 /> <point id=431665 x=457.2745 y=346.346 /> ... <seg id=2436 start_node_id=324 end_node_id=34373745 /> ... <line id=23511> <lineseg seg_id=2436 /> <lineseg seg_id=45468 /> <lineseg seg_id=253278 /> ... </line> So I need to do random access to the data (due to the id references). This is why I need to build an index. This is true, I could do it with mysql in a temporary table, but I would prefer something lighter.
On Tuesday 29 January 2008 11:58:30 Etienne Sandrà wrote: > Tr3wory : Thanks, QXmlStreamReader is much better than QXmlSimpleReader : > more simple to implement, and characterOffset() is exactly what I need. > > > Eric: > Nope. Unfortunately, my XML looks like this: > > <point id=24235 x=3436.325 y=4306.96340 /> BTW, this line is not a valid XML but this one is: <point id="24235" x="3436.325" y="4306.96340" /> -- [ signature omitted ]
Attachment:
signature.asc
Description: This is a digitally signed message part.
Etienne Sandrà schrieb: > So I need to do random access to the data (due to the id references). > This is why I need to build an index. This is true, I could do it with > mysql in a temporary table, but I would prefer something lighter. any arguments against a custom solution in this case? the xml-elements seem to be few so you could write something yourself that build an index in one pass. regards, Georg -- [ signature omitted ]
Etienne Sandrà schrieb: > So I need to do random access to the data (due to the id references). > This is why I need to build an index. This is true, I could do it with > mysql in a temporary table, but I would prefer something lighter. You can use a native XML db like bekeley-db/xml it has lot of advantages for storing xml rather than use a sql model that is not what you want or plain xml files. Take a look here: http://www.oracle.com/database/berkeley-db/xml/index.html Hope this help any arguments against a custom solution in this case? the xml-elements > seem to be few so you could write something yourself that build an index > in one pass. > > regards, > Georg > > -- > To unsubscribe - send a mail to qt-interest-request@xxxxxxxxxxxxx with > "unsubscribe" in the subject or the body. > List archive and information: http://lists.trolltech.com/qt-interest/ > >
Georg Fritzsche wrote: > Etienne Sandrà schrieb: >> So I need to do random access to the data (due to the id references). >> This is why I need to build an index. This is true, I could do it with >> mysql in a temporary table, but I would prefer something lighter. > > any arguments against a custom solution in this case? the xml-elements > seem to be few so you could write something yourself that build an index > in one pass. +1 I also believe in this case simple XML parser like this one http://devnull.samersoff.net/cgi-bin/hg/hgwebdir.cgi/libdms5/file/559fd6c22e81/src/tools/src/dsXMLReader.cxx is the best solution. -- [ signature omitted ]
On Tuesday 29 January 2008 10:58:30 Etienne Sandrà wrote:
I think you can use the knowledge you have of the behaviour of XML and parsing
it to your advantage.
You know, for example that the fist point you encounter has ID 245, the second
one id 3457532 etc. Same for segments and same for lines
So with one pass, you can 'remap' all the ids into seperate files that are a
bit more efficient. Then start again from there.
for example, you parse the xml file, looking only for lines
each line you store in memory:
struct line{
unsigned int* lineids;
unsigned int* lineIndex;
unsigned char count; // assuming you have no more than 255 segments in a line
ever.
}
for each line, you simply allocate the number of regments, store their ids and
the count.
now parse again, this time whenever you see a new segment assign it a new
number (i++)
look for any line that contains this id and assign the number to the same
location in lineIndex.
finally, write all the lines to a new file in a fixed format: for example
count, index, index, index, count, index, index, count index, count index,
index, index, index
The index will actually correspond to the index in the segment file you're
going to build next the same way (a line will consist of 4 points so you can
navigate through that file by using the index.
Something like that.
Just a thought.
> Tr3wory : Thanks, QXmlStreamReader is much better than QXmlSimpleReader :
> more simple to implement, and characterOffset() is exactly what I need.
>
>
> Eric:
> Nope. Unfortunately, my XML looks like this:
>
> <point id=24235 x=3436.325 y=4306.96340 />
> <point id=431665 x=457.2745 y=346.346 />
> ...
> <seg id=2436 start_node_id=324 end_node_id=34373745 />
> ...
> <line id=23511>
> <lineseg seg_id=2436 />
> <lineseg seg_id=45468 />
> <lineseg seg_id=253278 />
> ...
> </line>
>
> So I need to do random access to the data (due to the id references). This
> is why I need to build an index. This is true, I could do it with mysql in
> a temporary table, but I would prefer something lighter.
--
[ signature omitted ]
Hi Etienne, Having read access to the character offset does not give you random access to the data - you will still have to reparse from the beginning of the file to get to back to any particular data element. Ideally, you would have a simple XML parser that allows you to save and RESTORE the whole parser state. But I have not seen anything that does this. And in your case, the saved state at each index entry is probably bigger than the data! Regards, Tony Rietwyk -----Original Message----- From: etienne.sandre.chardonnal@xxxxxxxxx [mailto:etienne.sandre.chardonnal@xxxxxxxxx] On Behalf Of Etienne Sandrà Sent: Tuesday, 29 January 2008 20:59 To: qt-interest@xxxxxxxxxxxxx Subject: Re: Parse a huge XML file Tr3wory : Thanks, QXmlStreamReader is much better than QXmlSimpleReader : more simple to implement, and characterOffset() is exactly what I need. Eric: Nope. Unfortunately, my XML looks like this: <point id=24235 x=3436.325 y=4306.96340 /> <point id=431665 x=457.2745 y=346.346 /> ... <seg id=2436 start_node_id=324 end_node_id=34373745 /> ... <line id=23511> <lineseg seg_id=2436 /> <lineseg seg_id=45468 /> <lineseg seg_id=253278 /> ... </line> So I need to do random access to the data (due to the id references). This is why I need to build an index. This is true, I could do it with mysql in a temporary table, but I would prefer something lighter.