We are currently developing a system which wants to analyze all the domains in the internet. This is a really challenging task and not easily done in a few months time. Besides loads of problems, like finding so many domains and parsing them in a reasonable amount of time we also implement a MongoDB cluster to store the analyzed information. Our database has currently 200GB split into two shards but we expect this to grow up to 1-2 TB of data.
There are a lot of posts like this on the web so I’m probably not telling you something new (especially if you are senior dev dba who just goes: “oh my god… i knew that 10 years ago, it’s the same with every database” :)) but I really wanted to share the following things which bugged me quite a while:
Run every query in production
So you wrote a nicely tdd’ed, async, evented backend job system which runs smoothly on you latest mac book pro and suddenly in production with are large dataset the system stops responding. Queries are slow and not returning in an appropriate amount of time. This happened to us when we switched from SSDs back to normal SATA drives (space/price reasons) and we suddenly realized we now really have to tune and understand those queries. So: Run every query you need in production or on a very similar environment and be sure to have a index for the most important queries at least.
Explain every query and set proper indexes
In order to understand your queries better you should always run .explain() on them to really understand what they are doing. Are they using the right index? How fast is the query in general? Do I need to hint() the query optimizer maybe? Run an .explain() for every query
Make sure your cluster can handle the amount of writes
In our first highly async, evented solution which basically killed the db with .update() statements when switched from SSDs to SATA drives which could not handle the write load. the disk were running at 100% according to iostat most of the time. We needed to think about a solution which does not hammer the database so much. We did this by combining jobs into groups and other smaller tweaks. What kind of queries are you doing? Is your cluster write or read intensive?
TDD does not help here…. experiment with the real data
This DBA related kind of work is not solvable with a TDD approach. You have to experiment and test/work with real data in a real environment. I do not think one will find a good DB design with TDD. At least we experienced this in this special case. TDD won’t solve all your problems 🙂
Mongo needs WAY more hd space than traditional RDBMS
Because we outgrew our first SSDs so quickly we analyzed why that happened and compared a 30M db import into mysql with mongo. Because mongo needs to store meta information for a doc, i.e. the keys inside the doc. we found that it needs roughly three times more space than a mysql db to store the same amount of data. Your findings might vary. Also if you have growing documents because you evolve your schema or add keys later Mongo might need to rewrite the doc onto another position on disk which is slow in an update. Check out the Padding Factor to understand this more deeply and how to avoid it. Use short keys and and preallocate documents in Mongo to not waste space.
Go for SSDs!
As you can see from the points above it’s really important to not be IO bound in any way. The major bottleneck are hard drives nowadays and SSDs are cheaper and cheaper so you should consider using them if you are working on a decent data size. They solve a lot of problems, even badly designed queries 😉 So to get a good feeling for your system it might be actually better to start with SATA and when you tuned that go to SSDs to really rock. SSDs are awesome and worth the price!
That’s it! If you want to check the sites which we currently power with this database have a look at PagesLike which is a tool to find similar websites which are related to another or WebInformationService where we analyze the visitors of a webpage and give additional technical information. UPDATE: We recently launched our german pedants as well. If you know german and would like to check them out please have a look at WebInformationService and PagesLike – Ähnliche Webseiten finden.