Data Normalization

9 November 2020, 11:19 am by antelove19

Data Normalization
Data Normalization

Sources:

Wrox Press Beginning MySQL 2005 1


Data Normalization

One of the concepts most important to a relational database is that of normalized data.

Normalized data is organized into a structure that preserves the integrity of the data while minimizing redundant data. The goal of all normalized data is to prevent lost data and inconsistent data, while minimizing redundant data.

A normalized database is one whose tables are structured according to the rules of normalization. These rules — referred to as normal forms — specify how to organize data so that it is considered normalized.

When Codd first introduced the relational model, he included three normal forms. Since then, more normal forms have been introduced, but the first three still remain the most critical to the relational model.

The degree to which a database is considered normalized depends on which normal forms can be applied.

For example, some database designs aim for only the second normal form; however, some databases strive to achieve conformance to the fourth or fifth normal form. There is often a trade-off between strict adherence to the normal forms and system performance. Often, the more normalized the data, the more taxing it can be on a system. As a result, a database design must strike a balance between a fully normalized database and system performance. In most situations, the first three normal forms provide that balance.

First Normal Form

Of all the normal forms, the first is the most important. It provides the foundation on which all other normal forms are built and represents the core characteristics of any table. To be in compliance with the first normal form, a table must meet the following requirements:

  • Each column in a row must be atomic. In other words, the column can contain only one value for any given row.
  • Each row in a table must contain the same number of columns. Given that each column can contain only one value, this means that each row must contain the same number of values.
  • All rows in a table must be different. Although rows might include the same values, each row, when taken as a whole, must be unique in the table.

Take a look at an example to help illustrate these requirements.

Wrox Press Beginning MySQL 2005 Figure 4.2

Figure 4.2

Figure 4-2 contains a table that violates the first normal form.

For example, the fifth row contains two values in the BookTitle column: Postcards and The Shipping News. Although a value can consist of more than one word, as in The Shipping News, only one value can exist in a column. As a result, the BookTitle column for that row is not atomic because it contains two values. In addition, the row as a whole contains more values than the other rows in the table, which also violates the first normal form.

Another way in which the table violates the first normal form is found in the second and third rows, which are identical. Duplicate rows can exist for a number of reasons, and without the necessary data to distinguish them, you cannot tell whether this is an error in data entry or whether there are supposed to be two records for this one book.

For example, the rows might be duplicated because they refer to different editions of the book, or perhaps the book has been translated into different languages. The point is, each row in a table must be unique.

In order to conform to the first normal form, you must eliminate the duplicate values in the BookTitle column, ensure that each row contains the same number of values, and avoid duplicated rows. One way to achieve the necessary normalization is to place the data in separate tables, based on the objects represented by the data.

In this case, the obvious place to start is with authors and books. All data related to authors is placed in one table, and all data related to books is placed in another table, as shown in Figure 4-3.

Wrox Press Beginning MySQL 2005 Figure 4.3

Figure 4.3

Notice that a row has been created for each book and that a translation-related column has been added for that table. This eliminates the duplicated rows, as long as two books with the same name are not translated into the same language.

To get around the possibility of two rows containing data about books with the same name and language, an identifying column (BookID) is added to the table and configured as the primary key (shown in gray). Because the column is the primary key, each value in the column must be unique. As a result, even the rows that contain duplicated book titles and languages remain unique from one another (when taken as a whole). The same is true of the Authors table. Because the AuthID column is defined a primary key (shown in gray), authors can share the same name and each row will still be unique.

By creating tables for both authors and books, adding a primary key column to each table, and placing only one value in each column, you are ensuring that the data conforms to the first normal form. As you can see in Figure 4-3, a third table (AuthorBook) is also being used. This table allows you to match the IDs for authors and books in a way that supports books written by multiple authors, authors who have written multiple books, and multiple authors who have written multiple books. Had you tried to match the authors to their books in one of the two tables, the table would potentially fill with enormous amounts of redundant data, which would defeat one of the purposes of the relational database.

Another thing to notice is that a primary key has been defined on the AuthorBook table. The primary key is made up of two columns. (Both columns are shown in gray.) As a result, each set of values in the two columns must be unique. In other words, no two rows can contain the same AuthID and BookID values, although values can be repeated in individual columns. For example, the AuthID value of 1009 is repeated, but each instance of that value is associated with a different BookID value. Because of the primary key, no AuthID value can be associated with a BookID value more than once.

A primary key made up of more than one column is referred to as a composite primary key.

Creating this table might appear as though all you’ve done is to create a lot more data than you initially had to start. The example here, however, represents only a small amount of data. The advantages of normalizing data are best seen when working with large quantities of data.

Second Normal Form

The second normal form builds on and expands the first normal form. To be in compliance with the second normal form, a table must meet the following requirements:

  • The table must be in first normal form.
  • All nonprimary key columns in the table must be dependent on the entire primary key.

Given that the first of these two rules is fairly obvious, take a look at the second one. As you’ll recall from earlier in the chapter, a primary key is one or more columns in a table that uniquely identify each row so that no two rows, when taken as a whole, are identical. To illustrate how the second normal form works, first take a look at an example of a table that violates the second normal form.

In the AuthorBook table in Figure 4-4, a primary key is defined on the AuthLN and BookTitle columns. Together, the values in these two columns must uniquely identify each row in the table.

Wrox Press Beginning MySQL 2005 Figure 4.4

Figure 4.4

You can see how the primary key works in the fourth and fifth rows, which are related to the author Annie Proulx. Although both rows are concerned with the same author, they refer to different books. As a result, the values Proulx and Postcards identify one row, and the values Proulx and The Shipping News identify the second row. Although the values in either one of the individual primary key columns can be duplicated (in that column), the values in both columns, when taken as a whole, must be unique. This is another example of a composite primary key.

Now examine how this table applies to the second normal form. As previously stated, all nonprimary key columns in the table must be dependent on the entire primary key, which, in this case, is made up of the author’s last name and the book title. Based on the way that the table is currently defined, the AuthFN and AuthMN columns are dependent on the AuthLN column, and the Copyright column is dependent on the BookTitle column. The AuthFN and AuthMN columns are not dependent on the BookTitle column, though, and the Copyright column is not dependent on the AuthLN column. As a result, the table violates the second normal form.

Another problem with the table is the columns used for the primary key. By defining the primary key in this way, you’re assuming that two authors with the same last name won’t write a book with the same title and that no one author will write two books with the same title. This assumption, though, might not necessarily be true. If two authors with the same last name write books with the same title, the primary key would prevent you from adding the second book to the table.

The most effective way to normalize the data in the AuthorBook table is to use the solution that you saw for the first normal form: Create a table for the authors and one for the books, add a primary key column to each table, and create a third table that matches up the identifiers for authors and books, as shown in Figure 4-5. For the Authors table, the primary key is the AuthID column, and for the Books, table, the primary key is the BookID column. Now the columns in each table are dependent on their respective primary keys, and no columns exist that are not dependent on the primary key.

Wrox Press Beginning MySQL 2005 Figure 4.5

Figure 4.5

In addition, a primary key has been defined on the AuthID and BookID columns of the AuthorBook table. As a result, any of the primary key columns in a row, when taken as a whole, must be unique from all other rows. Because there are no other columns in this table, the issue of dependent columns is not a concern, so you can assume that this table also conforms to the second normal form.

Third Normal Form

As with the second normal form, the third normal form builds on and expands the previous normal form. To be in compliance with the third normal form, a table must meet the following requirements:

  • The table must be in second normal form.
  • All nonprimary key columns in the table must be dependent on the primary key and must be independent of each other.

If you take a look at Figure 4-6, you see an example of a table that violates the third normal form. Notice that a primary key is defined on the BookID column. For each book, there is a unique ID that identifies that book. No other book can have that ID; therefore, all characteristics related to that book are dependent on that ID. For example, the BookTitle and Copyright columns are clearly dependent on the primary key. For each book ID, there is a title and a copyright date.

To illustrate this better, take a look at the first row in the table. As you can see, the book is assigned a BookID value of 14356. The title for this ID is Hell’s Angels, and the copyright is 1966. Once that ID is assigned to that title and copyright, that title and copyright become dependent on that ID. It identifies that title and copyright as part of a unique row. Despite their dependence on the primary key, the BookTitle and Copyright columns are independent from each other. In other words, you can include the BookTitle and the Copyright columns, but you don’t necessarily need to include both because one isn’t dependent on the other for their meaning. The ChineseSign column is very different from the BookTitle and Copyright columns. It provides the Chinese astrological year sign for the year that the book was copyrighted. The ChineseSign value has nothing to do with the BookID and is not related to the book itself. Instead, the ChineseSign column is totally dependent on the Copyright column. Without the Copyright column, the ChineseSign column would have no meaning. As a result, the ChineseSign column violates the third normal form.

Wrox Press Beginning MySQL 2005 Figure 4.6

Figure 4.6

To ensure that the data conforms to the third normal form, you should separate the data into two tables, one for books and one for Chinese astrological year, as shown in Figure 4-7. From there, you should assign a primary key to the Year column of the ChineseYears table. Because each year must be unique, it is a good candidate for a primary key column. You don’t necessarily have to add a column to a table to use as a primary key if an existing column or columns will work.

By separating the data into two tables, each column is now dependent on its respective primary key, and no columns are dependent on nonkey columns.

Wrox Press Beginning MySQL 2005 Figure 4.7

Figure 4.7

By making certain that the data conforms to the third normal form, you’re ensuring that it has been normalized according to all three normal forms. And although there are even more normal forms that you can conform to, for the most part, the first three normal forms meet most of your database design needs. If you plan to focus heavily on database design or plan to design complex databases, you’re encouraged to research other references for more details about all normal forms and the relational model.

In the meantime, you can go a long way to achieving a normalized database by thinking in terms of separating data into entities, discrete categories of information. For example, books represent one entity; publishers represent another. If you keep in mind that, whenever designing a database, you want eachtable to represent a distinct entity, you go a long way in designing a database that achieves the third normal form.


  1. Wrox Press, "Wrox Press Beginning MySQL 2005", page 111 




Video not yet
0 comment