January 19, 2009
I enjoyed this piece by Jochem on a test conversion of his company’s database from MS SQL to open source Postgres. Performance was similar, disk space was 75% less and licensing costs? Well… those would go to $0.
I enjoyed this piece by Jochem on a test conversion of his company’s database from MS SQL to open source Postgres. Performance was similar, disk space was 75% less and licensing costs? Well… those would go to $0.
Zack Steinkamp said:
on January 20, 2009 at 9:39 am
While I applaud the move away from MSSQL, I’m curious why someone would put 300KB of XML into one column of a database row.
Something like that should be stored outside of the DB, named with the row’s primary key value. It’s a tiny bit more work to implement, but will pay off in database scalability, performance, and overall system flexibility.
brian said:
on January 20, 2009 at 12:58 pm
Mmm, I don’t know that I agree. Particularly with XML support inside most enterprise databases now, XML is a very valid resource to store in the database. What’s the difference between 300kb of XML or 300kb of text? Would you store the text on the file system? If the database can act on the XML, I think it makes sense to keep it internal.
I also question whether or not it will actually improve “performance”. Which performance? It may improve database scalability, but that is only one component of the system.
File system storage has its own mixed bag of issues… good for some things, not for others. While I personally store binary assets like images on the filesystem and use the database to provide a pointer, I don’t think it’s a blanket solution. It’s an option but in my opinion by no means a best practice.
Zack Steinkamp said:
on January 20, 2009 at 2:34 pm
Absolutely no difference, unless as you’ve said the DB can make use of the XML as a native type.
It would be a very very special case app for which this architecture would make sense.
brian said:
on January 20, 2009 at 4:01 pm
So you would stick 300kb of text in the file system too?
What, exactly, do you use your database for?
Zack Steinkamp said:
on January 20, 2009 at 4:37 pm
Well, in *a* file system. May be on the DB server, may be on a different host. If your DB serves to lookup which blob you’re interested in using, then there’s a more efficient system to serve the blob, as long as the blob is large enough so that it can be fetched or stored or replicated more efficiently than in the RDBMS.
This gets nicely at the fuzzy area between data that is efficiently stored and retrieved in the DB and those that are not. That size threshold will vary based on lots of factors. 300KB is very very likely to be larger than what is efficient.
This is not a black and white issue when details of the application, platform software, and hardware are not known.
brian said:
on January 21, 2009 at 3:51 pm
Fair enough – too many undefined parameters. Where might you be better off storing 300kb of XML that you wanted to store and process? Are there options other than local-to-the-app server storage? Anything over the (public) wire is going to be orders of magnitude slower.
Zack Steinkamp said:
on January 21, 2009 at 4:42 pm
Of course, going through the internet to fetch the data is not going to be a performant solution vs. fetching from a local DB.
If the DB was on the local machine, then the performance difference between the DB software reading (and caching) 300KB from the disk is not going to appreciably different than the OS reading (and caching) 300KB from the disk.
In the example you linked to, 24GB of 32GB (75%) of the data set was this XML data. If that 24GB was moved to a memcached farm (or other RAM-based clustered caching solution), then performance would likely be better across the board vs. a disk-based solution, assuming a LAN connection between hosts and a normal distribution of requests to chunks of XML.
Maintenance of the database will be faster and simpler without that extra 24GB of data to push around. Not sure how Postgres does “ALTER TABLES” but MySQL always makes a full copy of the table…
As usual, the answer is “it all depends” based on performance need, nature of the system, and available resources.