It may seem like I’ve forgotten about this blog lately, but that’s not the case. The truth is last week I was on vacation, and before and after that I’ve been working on a project tangentially related to home automation, which I’ll probably be posting lots about in a couple of weeks.
However, today I wanted to touch on a topic that many of you will be familiar with: database design. When we talk about database design, we mean a database schema or, more generally, and entity relationship diagram (ERD).
If you do any kind of data logging, or you’re using a database as the data-store for your configuration data, you’ll have to do some kind of database design. Both of these cases call for a “normalized” design. In fact, de-normalized designs are typically only used for heavy-duty data-mining applications, so they’re pretty rare. The advantage of a normalized database is that it follows the “once and only once” (OAOO) software development principle, that says there should be one, and only one, definitive source for any particular fact. So, for instance, don’t store the operator’s name all over the place; rather, store the operator’s name in a table called Operator, include an OperatorId column that’s assigned once when the operator’s row is created but never changes, and then use the OperatorId as a foreign key in your other tables. This gives you several advantages: less database storage (an Id is typically shorter than a name), a single place to change the name (typos are always common and people change their names) and if you do have to change it, you only have to lock one database row to do the edit during the database transaction, instead of every database row that uses this person’s name.
That’s pretty standard stuff, but I want to take a slight tangent. By default, don’t store data you can calculate from other data. This is actually for the same reason. For instance, you wouldn’t store a person’s age, you’d store their birth date. That’s because the age changes all the time. I’m not saying you’d never store a calculated value, but doing so is an optimization, and “premature optimization is the root of all evil.”
Let me give you a real-life example. Lets say you wanted to record the production throughput of an automobile assembly line. Let’s assume you’re already storing the VIN numbers of each vehicle, along with some other data (various part serial numbers, etc.). I’ve seen implementations where someone’s added a new table called LineThroughput, with one row per time period, and a counter in each row (in fairness, I’ve done it too). Every time a vehicle comes off the line, the application finds the applicable row and increments the counter (or adds a new one as required). PLC programmers are particularly likely to do this because we’re used to having limited memory in the PLC, and PLCs come with built-in counter instructions that make this really easy. However, this is a subtle form of denormalization. The database already knows how many vehicles were made, because it has a record for each VIN. All you have to do is make sure that it has a datetime column for when the vehicle rolled off the line. A simple query will give you the total number of vehicles in any time period. If you follow the route of adding the LineThroughput table, you risk having a numerical discrepancy (maybe the database isn’t available when you go to increment the counter, for instance).
Just storing the datetime field has one more advantage: the database is more “immutable”. If data is only written, errors are less likely. If you do want to create a summary table later (for performance reasons because you query it a lot), then you can create it when the time period is over, and once you’ve written the record, you’ll never have to update the row. Again, this is better because the row is “immutable”. The data is supposed to be a historical record. Pretend it’s written in pen, not pencil. (You might be horrified to know that some electronic voting machines seem to use the LineThroughput table method to record your votes, which makes them extremely susceptible to vote-tampering.)
I hope that’s enough information to make my points: normalize your database, don’t record redundant information, or information you can calculate, and avoid situations where you have to update rows repeatedly, particularly if you’re doing data logging.