Using very large datasets

Sep 30, 2016 at 5:14 PM
Edited Sep 30, 2016 at 5:18 PM
I've been using the netDxf library for exporting propriatry data to dxf, and it really works amazing. The library helped me to stay away as far as possible from writing bare dxf codes ;-)

Now the next challenge: exporting very large datasets!
Exporting about 1 million records results in a OutOfMemoryException.
That is logical of course, because the whole document has to be constructed in-memory first, before it can be saved to a file.
Looking at the source, I think it might be possible to create another sort of DxfWriter that is capable of creating the file sequentially.

My questions are:
  • do others have the same experience with large dataset?
  • is there anyone who has a bright idea? I have, maybe not so bright, but any other input is welcome!
Oct 3, 2016 at 5:53 PM
With so little information is hard to know what is happening. I have run tests generating files with up to two million lines and I have not seen any problems. A file with a million entities is big but not that big, it should not cause any problems.

When and who is raising the OutOfMemoryException?

Are you working with a 32 or 64 bit system?

Does your problem only occurs when saving the file? Without more information, I think about two places that can raise an OutOfMemoryException for very big files, how I am storing the document handles and the way I am caching the text conversions during the saving process. Both of them could be easily fixable.

What is the content of your file? Perhaps the problem comes from some kind of combination of different DxfObjects.

Does your file contains Text or MText entities with lots of text, by any chance?

Have you tried saving your file with different versions? If you have not tried it, try to save your file as a AutoCad2010 version. If that version works the problem might be in the way I am caching the text conversions. Prior version requires that the non-ASCII characters of a text string to be encoded in a special way.

Oct 3, 2016 at 7:47 PM
Edited Oct 3, 2016 at 7:51 PM
Thank you for your info.

Sorry, I realize it's not a lot of information.
I just think that it is logical that eventually, the will be to much data in memory. So I thought it would be neat to write the dxf sequentially, without first building the whole DxfDocument. Though I understand that is difficult because of dependencies within the dxf...

The dotnet app is a simple console (at this moment), with targetFramework 4.0. The OS is windows 7 professional x64.
I did not check where the OutOfMemoryException occured: during building up the dxf, of during the save.
But what I saw was that the memory that is was used by the console, was rising, until about 1.2GB+.
The entities that were used are just polygons, not texts. So using another version will not make much difference, I think.

But besides all this: I think it would be great that the dxf file is written without first building it. (Not hindered by knowledge and the impossibilities....)

For a POC, I added a public delegate property to the DxfWriter. This delegate is called when the entities are being written.
And I added a "appendEntity" method, that uses the ICodeValueWriter to write entities to the stream.

The delegate function (outside the DxfWriter) creates a new entity and uses "appendEntity" to get it written.
Of course, as you will know already, this doesn't work. One problem is that the entity doesn't have an owner.
But when I fixed that, some other dxf dependency was missing. Don't know what exactly, but the dxf would not open in AutoCAD anymore...

Besides that, I've been refactoring a bit, so that the Save method is not 1 big method, but uses the new DxfClassesSectionWriter, DxfHeaderSectionWriter, etc.
Oct 5, 2016 at 11:26 AM
It is not possible to do what you suggest with the current implementation, bear in mind that the data of a DxfDocument can be build not only by manually generating its contents, but also it can come from an external dxf that you are able to edit and then save the new version, therefore the whole document must be fully loaded into memory.

The information the entities is spread across the different sections of the dxf, some of that data must be defined before the actual entities section while other is define after, so is not that straightforward to write information about entities. It is not like you can write your information in the entities section and be done with it, without taken care that you also have to write information in other places before and after the actual definition of the entity.

Since you are working with polygons, what entity are you using LwPolylies or Polylines? If your polygons are flat, always use the LwPolylines they are a lot more memory efficient. I run a couple more test creating a document with a million polygons with ten vertexes each, in any of both cases I haven't found any issues, besides the obvious slowdown. The LwPolyline case generated a file of 630MB while the Polygon case was 1.55GB.