Single Source of Truth

Creating a reliable Database
database

One of the most important software engineering principles is to ensure that we have a single source of truth. For the context of creating web applications, the single source of truth is the database. Therefore, designing a database to reliably store and manipulate data is one of our utmost priorities.

To keep things simple for this beginner series, we will focus on a SQL Database. Examples will be given using an open-source Database Engine known as PostgreSQL. For other SQL Engines, e.g., Oracle, their syntax may slightly differ, however, the similar principles will apply. We will also be narrowing our scope to learn about database constraints and normalization. Other topics such as security and performance will be left for another day.

Database Constraints

constraints

Database constraints give you control over the data stored inside the tables

In order to ensure that our database is the single source of truth, we need to ensure that our validations occur at the database level. This can be done by adding constraints to what we can store in our database through the different kinds of data types.

Let's say we want to store a user's username into our database. We will be using a character varying data type. We can further strengthen the database validation by limiting the username to be a maximum of 50 characters long:

CREATE TABLE user(id serial primary key, username varchar(50));

Checks

In addition to using the different data types that Postgres provides, we can also add additional constraints. Let's say we want to store the price of an item in the database. Logically speaking we would want to store the price as a non-negative number with 2 decimal places. We can achieve this in the following manner


CREATE TABLE products (
  id integer primary key,
  name varchar(50),
  price numeric(5,2) CHECK (price >= 0)
);
                

The

numeric(5,2)
allows us to have a maximum and minimum of 5 digits of which 2 are decimal points(-999.99 to 999.99)

Primary Keys

Primary Keys are a unique identifier for a row of data

Primary keys are usually indexed automatically by the database to allow for fast access. For example, according to PostgreSQL documentation a B-Tree index is automatically created for primary keys.

Foreign keys

Foreign keys are references to a row with a primary key in another table

For Foreign keys, there are additional behaviors that you can set when the primary key is deleted.


CREATE TABLE order_items (
    product_id integer REFERENCES products ON DELETE RESTRICT,
    order_id integer REFERENCES orders ON DELETE CASCADE,
    quantity integer,
    PRIMARY KEY (product_id, order_id)
);
                

ActionDescription
NO ACTION (Default)Referenced row will not be deleted
RESTRICTPrevents Deletion of Primary Key
CASCADEReferenced rows will be deleted as well
Not Null

Not Null constraint does not allow null values

It is important to note that null does not mean empty string. If you want to validate for a non-empty string, you need to add an additional length check.


CREATE TABLE user (
  id serial primary key,
  username VARCHAR NOT NULL CONSTRAINT non_empty CHECK(length(username) > 0)
)
                

Unique

Unique constraint means that every value in the column must be unique

It is important to note that a unique contain can also be placed on multiple columns at the same time. For example, if we want to go back to our trustworthy AirBnB application, lets say we want to create a favorites table to store flats that the user saved. In order to prevent double insertions (saving to favorites twice), we add a unique constraint on user_id and flat_id


class CreateFavorites < ActiveRecord::Migration[6.1]
  def change
    create_table :favorites do |t|
      t.references :user, null: false, foreign_key: true, index: true
      t.references :flat, null: false, foreign_key: true, index: true
      t.timestamps
    end
    add_index :favorites, [:user_id, :flat_id], unique: true
  end
end
                

Database Normalization

database normalization

Normalization is a database design technique that reduces data redundancy, and anomalies.

Now that we have talk about the different database constrains for each column, we will now turn our attention to the table itself. To keep things simple, we will only focus from 1st Normal Form to 3.5 Normal Form, otherwise known as Boyce-Codd normal form(BCNF) which is usually the requirements needed to pass most technical assessments. There are greater levels of normalization but we won't be covering them here.

To keep things simple for the examples, we will assume that columns without the word ID contains more information related to that domain. E.g. Student will be used to represent StudentName, StudentAge etc.

1st Normal Form

Each Table Cell should contain a single value

Values stored in a column should be of the same data type

Columns and Tables should have unique names

Order in which the data is stored should not matter

Let's say we have to store the following information in the database about a students' grade report.


Student_Grade_Report (StudentId, Student, Major, Advisor, [Course, Grade])
                

It is easy to identify from the current table design that one student will take multiple courses and obtain one grade for each course. However, in order to satisfy the rule of each table cell should contain a single value and not an array of value, we should split the tables into the following manner.


Student (StudentId, Student, Major, Advisor)
StudentCourse (StudentId, Course, Grade)
                

So why do we want to divide the data this way, and why is it better? Well first of all, notice that by dividing the table into 2, updates can safely occur to both tables at the same time. If we were to go with our original design, if 2 read and writes were to occur at almost the same time, the data of one may override another:


Row1 = student, [course1, course2]
User1 and User2 reads Row1
User1 Adds course3 to Row1 and press submit
User2 Remove course2 to Row1 and press submit
Result: Row1 = student, [course1]
                

2nd Normal Form

Be in 1st Normal Form

There should be no partial dependency

What 2nd Normal Form means is that we should ensure that the relationship between the different domains are stored in a separate table, together with the data that is dependent on that relationship. For this case we have 2 different domains, student and course. The relationship table would allow us to know which courses the student are taking and its resulting grade. Therefore we end up with the following design:


Student (StudentId, Student, Major, Advisor)
StudentCourse (StudentId, CourseId, Grade)
Course (CourseId, Course)
                

This dependency table allows us to keep our data clean. Notice that before we used to have a lot of repeated data if we were to populate our StudentCourse table. The course data for all the students taking the same course will repeat for every record. This repeated data is bad because if we want to update the course information we have to find all the records containing the course details and update it. If there are 1 billion records in the database, this operation will a long time to complete. This change will also allow simultaneous updates of the course grade and the course information. E.g. Admin Staff updates the course information while the professor keys in the grades for the student. Both transactions can occur without interfering with each other.

3rd Normal Form

Be in 2nd Normal Form

No transitive functional dependencies

In order to get a table into the 3rd normal form, we need to take a look at each table in 2nd Normal Form and ask ourselves if it contains any further dependencies that should be separated. If we look at the student table, we will notice that it actually consist of 2 types information, the student information and the major along with their advisor. There is a dependency here as a student can take multiple majors, hence it is possible for a student to have 2 records in this table. In order to remove this dependency we should store these 2 items in separate tables.


Student (StudentId, Student)
StudentMajor (StudentId, Major, Advisor)
StudentCourse (StudentId, CourseId, Grade)
Course (CourseId, Course)
                

Boyce-Codd Normal Form(BCNF)

Be in 3rd Normal Form

A relation is in BCNF if, and only if, every determinant is a candidate key.

BCNF is actually a special case of 3rd Normal Form and is highly depended on the business rules. To keep things simple we will be focusing on the StudentMajor table. Let's say we have the following business rules:

  1. A Student can have multiple Majors
  2. For each Major, a Student have 1 Advisor
  3. A Major have several Advisors
  4. An Advisor can only advise one Major
  5. An Advisor can advise many Students


StudentMajor (StudentId, Advisor, Major)
                

If we take a look at our previous table design above we can see that we are not able to establish these business rules. From the table above we have no idea if Advisor can advise several Majors, or if a Major can have several Advisors. The only way we are able to establish these business rules is to split the tables into the following.


StudentAdvisor (StudentId, AdvisorId)
AdvisorMajor (AdvisorId, Advisor, MajorId)
Major (MajorId, Major)
                

By looking at the table above, we can now see. A student have an Advisor. An Advisor advises one major and a Major can have several advisors.

Conclusion

The database of our web application is our user's single source of truth. Therefore, we need to design it using the normalization rules to reduce redundancies and anomalies. To help developers, some companies employ a Database Administrator to approve the database design.

In addition, although the advantages of database normalization cannot be seen immediately, an incorrectly designed database will result in huge re-writes in the future. This will hinder the development velocity of a particular product. It may also result in performance issues as the application scales to handle thousands if not millions of users.

Now that we have covered the basics of backend design, I will be covering more of the frontend portion of the sprint planning in the following week.

Next Article: Designing the Interface coming on 07/03/2022

Last Updated: 28/02/2022