98-364 MTA Microsoft Database Fundamentals – Relational Database Concepts
Normalization is the process of structuring a relational database in order to improve data integrity and reduce data redundancy. So there are basically two main reasons why we would want to normalise a database. The first one is to improve data integrity by creating rules for our data and enforcing these rules. The second one is to reduce data redundancy by splitting tables into smaller logical units and creating relationships between them. Imagine you have a customer table storing the name, date of birth, email, and address information of each customer. This works great, but then you come across a customer who has two addresses. To store both of this customer’s addresses, you’d need two entries for this customer. As a result of this, information such as name, date of birth, and email for one customer would appear twice in the customer table.
This is what we call redundant data, and it is the major cause of data inconsistency. This is where normalisation comes in. To normalise a relational database, there are a set of rules we have to follow. These set of rules are referred to as the “normal forms.” For the exam, it is imperative that you know how to normalise a relational database to at least the third normal form. These are usually the standard normal forms used when normalising a relational database. So the first normal form of a relation in a relational database is the first normal form only if the following conditions are met: the table must have no duplicate rows. Secondly, the table must not have multivalued attributes. Let’s say we have a table where we store data about our products, such as name, product number, and color. Say we have a product that is both white and blue. Storing both colours in the same column would then be a violation of this criterion. To satisfy this criterion, we need to have only atomic or single-valued attributes. So either we store blue or white or come up with a brilliant way to indicate a mixed-colored product while still maintaining an atomic value. Third, the entries in the column or attribute must be of the same data type.
If a column is, for example, made to store text values, then you shouldn’t insert date and time values into that column. And last but not least, the table must not have repeating attributes. Repeating attributes are columns that repeat the same data across rows. Let’s look at an example. We have been given the following table: Our duty is to normalise this table so it satisfies the first normal form. If we look at the table, we see that there are no duplicate rows. Despite the fact that this relationship is unique, it remains in an unnormalized state. That’s because we have multivalued attributes in the name column, and we also have repeating attributes. Having the last name and first name stored in a single column makes it very inefficient to search based on specifically the last name or the first name. Also, sorting on one of these values becomes very problematic. To fix this, we’re going to split the name column into a last name and a first name column. Once we’ve done that, we get the following result: The next thing we’re going to look at are the repeating attributes in this relationship. The repeating attributes in this table are the course one and course two attributes. Both attributes repeat the same data across each row. As you can see, Liam Smith appears twice in this table. That’s because he has enrolled in four courses.
Assume someone has enrolled in ten courses. This would mean that we would see the same last name, first name, and birth date combination at least five times. and that’s a lot of redundant data. Another reason why a repeating group should always be avoided is because they produce a lot of nulls. If someone, for example, enrolled in an uneven number of courses, for example, three or five courses, we would have a null in one of the columns. While nulls are not forbidden, it’s always good practise to avoid them if possible. So to fix this, we’re going to remove these two columns and create a new relationship for managing data about the courses. Once we’ve done that, we come up with the following results: But we’re still not done yet. In our unnormalized table, we had an analogical relationship between students and courses. Basically, a student could take multiple courses, and a single course could be taken by multiple students. When we have these types of relationships, we speak of a many-to-many relationship. To reduce data redundancy, we introduce an associative entity, or a junction table. This table will keep track of the relationships between the students and the courses. Because a student can only take each course once, every combination between the student ID and the course ID will be unique. Therefore, we can make the primary key a composite key consisting of both the student ID and the course ID columns. Our table is now normalised to the first normal form.
Let’s take a look at the second normal form. Only if the following conditions are met does a relation have the second normal form: there should be no partial dependencies and no partial dependences. This means that every non-primary key attribute should be fully functional and dependent on the primary key. So if we have a table with a composite primary key, every column that is not part of the primary key should be fully functional and dependent on all the primary keys. It’s very important to know that the second normal form only applies to relations where we have a composite primary key and the table itself has at least one non-primary key attribute. If we have a relation where we have a single primary key attribute and that relation is already normalised to the first normal form, it is safe to say that the relation is also at least in the second normal form. In our previous example, the primary key consisted only of the student ID column. So, to COVID the second normal form, we’ll look at a relation where the primary key is also a composite key. This table stores the grades a student receives for a specific course and the name of that course. The primary key is a combination of the student ID and the course ID column. Each row in this relation is uniquely identified by this combination.
Therefore, the relation is normalised to at least the first normal form. To verify if this table is in the second normal form, we have to test the dependency of the grade and the course name columns. Let’s first look for partial dependencies. A partial dependency is a situation where a non-primary key attribute depends only on a part of the primary key and not on the whole primary key. There are a number of questions we can ask when testing for partial dependencies. Let’s first examine the grade column. If I have the student ID, will I be able to determine a single grade? The answer is no. A student can have multiple grades for multiple courses, so it’s impossible to determine a single grade based on only the student ID. There is no partial dependency here. If I have the course ID, would I be able to determine a single grade? The answer is also no. Multiple students can have the same course and thus have multiple grades, so it’s impossible to determine a single grade based on only the course ID. Also, there is no partial dependency here. Because we haven’t identified any partial dependencies for the grade column, the grade column is fully functional and dependent on the whole primary key. Let’s examine the course name column.
If I have the student ID, would I be able to determine a single course name? The answer is no. Again, the same reason applies. A student can have multiple courses, and each of these courses will have a different name. So it’s not possible to determine a single-course name based on only the student ID. There is no partial dependency here. If I have the course ID, would I be able to determine a single course name? The answer is yes. If I were to ask you what the course name is, where the course ID is equal to 1, you’d be able to give me one single answer. Same if I were to ask what the course name is where the course ID is equal to five. Even though there are two rows where the course ID is equal to five, you’d still be able to give me a single answer. This means that the course name is functionally dependent on the course ID. Since the course ID is part of the primary key, the course name is partially dependent on the whole primary key and violates the second normal form. So these are the dependencies we have identified in this relationship.
A combination of the student ID and the course ID determines the grade, which is a fully functional dependency. A combination of the student ID and the course ID determines the course name, which is a partial dependency because the course ID, which is part of the primary key, determines the course name. The easiest way to get rid of this partial dependency is to create a new table where we manage the courses, remove the course name column from the student grade table, and enforce a relationship between these two tables based on the Coursed column. Once we’ve done that, we get the following result: the student grade table is now normalised to the second normal form. Let’s look at the third normal form. A relationship belongs to the third normal for Only if the following conditions are met: therelation must be in the second normal form, and no transitive dependencies exist. This means that no primary key attributes should not be transitively dependent on the primary key. Transitive dependency is a term used to indicate an indirect dependency. Let’s look at a different example where the table is already normalised to the second normal form. In this table, there’s a single primary key attribute, namely the employee ID.
To normalise this relation to the third normal form, we need to search for transitive dependencies and remove them. If we have the employee ID, we can determine the last name, the first name, the department ID, and the department name. If we look at the department ID and the department name columns, you’ll notice that the department name can also be determined by the department ID column. Because we can determine the department name using the department ID, the department name is fully functionally dependent on the department ID. Since the department ID is fully functionally dependent on the employee ID, we have a transitive dependency between the employee ID and the department name. These are the dependencies we have identified in this relationship. We need to remove the transitive dependency between the employee ID and the department name. To achieve this, we’ll create a new table that manages the departments. Once we’ve done that, we come up with the following result: our table is now normalised to the third normal form. We’ve now covered the most commonly used normal forms when normalizing a relational database. That’s it for this video. In the next video, we’re going to look at the different types of constraints we can use to enforce data integrity. Thank you for watching. see you then.
Data integrity refers to the accuracy, consistency, and reliability of data that is stored in a database. In the last decade, we’ve seen that data is frequently used to make business decisions or predictions. Now, imagine you have an inaccurate database. This would obviously result in a lot of inaccurate business decisions, which might eventually lead to business failure. Therefore, it’s imperative that you have data integrity within your database. You, who are watching as a future database designer or database developer, will be responsible for implementing data integrity, and that is done primarily using constraints. So what is a constraint in general? A constraint is a rule or limit. We can define one or a number of attributes to enforce data integrity.
This way, we can make sure that only valid data is entered into the database, which will result in only valid data being retrieved from the database. There are several types of constraints we can use. First, we have the data type constraint. Whenever we specify a data type for a column, we’re basically creating a hard rule for that column. This rule defines what type of data can be inserted into the column. Certain data types provide the ability to specify parameters to further restrict the data, such as defining the maximum size of data that can be entered or the maximum number of decimals a numerical value can have. The data type constraint is the most important constraint because not only does it specify what type of data can be stored, but it also defines how that data will be stored and the amount of space that will be used to store the data. This is why it’s important to always choose the right data type.
Secondly, we have the primary key constraint, which we have talked about in the previous videos. A table can have only one primary key and can be declared on one or a number of attributes to enforce uniqueness within a table. This means that there can never be duplication of values within the primary key columns. Another thing about this constraint is that it doesn’t accept null markers. A null marker indicates an empty or unknown field. So either you have the database generate the values for a primary key attribute or you explicitly supply them when inserting new data. Next, we have the foreign key constraint. This type of constraint is used to enforce referential integrity. What this essentially means is that if we have two tables that relate to each other, we need to ensure that the relationship is always valid. We can do this by using the primary key of the parent table as a foreign key in the child table; this way, only the values in the parent table can be referenced in the child table. Then we have the unique key constraint.
The unique key constraint is used to enforce uniqueness in one column or in a combination of columns. Let’s say, for example, you have a user table and don’t want to allow someone to use an existing username. So in this case, you’d create a unique key constraint on the username column, thus enforcing that no duplicate usernames exist in comparison to the primary key constraint. The unique key constraint accepts only one occurrence of a null marker and can be defined multiple times within a table. Then we have the null ability constraint. The nullability constraint allows us to specify whether a column is optional or required. Imagine you have a tablet that stores customer information. One of the most important pieces of information about a customer is the customer’s name. We don’t want a situation where we have a customer whose name is not defined. So when we create a new customer, we need to ensure that a value has to be entered for this attribute. Otherwise, we don’t want the database to accept the data. On the other hand, we also have attributes that are less important, but we still like to store them if they exist.
For example, a secondary address It’s a piece of nice information to have about a customer, but it’s not necessarily required; it’s just optional. So in conclusion, with the null ability constraint, we can determine whether it’s necessary or not that data be entered for a specific field when adding or modifying data. Then we have the check constraint. With this constraint, we can create complex criteria for our data. Let’s say we accept only customers who are at least 18 years old. We could then create a check constraint that would validate the age of the customer. If the customer is at least 18 years old, the row will be inserted, while if the customer does not meet the criterion, the statement will raise an error. This is just a simple example of what a check constraint can do. Finally, we have the default constraint. This constraint is used for providing a default value if no value was specified by the user when adding or modifying data. These are all the constraints you should be familiar with. That’s it for this part. Thank you for watching. In the next video, we’re going to discuss COVID database keys. see you then.
They ensure that each row within a table can be uniquely identified by one or a combination of attributes. Secondly, they help to establish and identify the relationship between relations, along with enforcing referential integrity in relational databases. We have a number of keys, but in this course we’re only going to cover a few of them. We’re going to start with the candidate key. A candidate key is one attribute or the sort-of-smallest combination of attributes that can uniquely identify each row in a table. Let’s take a look at an example. Here we have a customer table that stores certain information about the customers of a company.
To identify the candidate keys in this relation, we must first determine which attributes can be used to uniquely identify each row. If we look at the first column, you’ll notice that we only have unique values here. The name also gives a nice hint. When we have a column with a prefix or suffix ID, it’s usually a column that can be used to uniquely identify either a group of rows or a single row. In this case, the customer ID can be used to uniquely identify each row. This is our first candidate key. Next, we have the company name. Generally, the company name can be used to uniquely identify each row. However, due to the fact that we have customers in about 125 countries, it might be possible to encounter duplicate values in this column. So this column on its own cannot be used to uniquely identify each row. Then we have the contact name. The likelihood of someone having the same name, specifically the same first name, initial, and family name, is extremely low. Therefore, this column qualifies as a candidate key and can also be used to uniquely identify each row. The country attribute comes next. There’s not a lot to say here, as it is obvious that this column does not qualify as a candidate key.
Next, we have the phone number attribute. Having the country code registered along with a phone number makes it nearly impossible for someone to have the same phone number. So the phone number can also be used to uniquely identify each row. Finally, we have the email attribute. This is also an easy one. It’s not possible for two people to have the same email address. Therefore, this column also qualifies as a candidate key. The company name cannot be used as a candidate key on its own. Because we have customers all over the world, it is inevitable that we’ll come across companies that share the same name. However, having two registered companies that share the same name within a single jurisdictional country is not possible. Therefore, if I combine the company name attribute with the country attribute, I would be able to uniquely identify each row, making this combination a candidate key. These are all of the candidate keys found in this relation. The next key we’re going to look at is the primary key. A primary key is the candidate key on which the primary key constraint will be defined and is the main reference key for the table. In the previous slide, we identified the following candidate keys: Our duty now is to choose which one of these keys will be the most appropriate key to become the primary key. When choosing the primary key, there are a few things we need to pay attention to. First of all, you don’t want a primary key whose values might be changed at some point in the future.
This can have a serious impact on the system. Therefore, the contact name, phone number, and email aren’t suitable to be the primary key. That leaves two candidate keys. When we create a primary key constraint, a clustered index is also created by default. The cost of indexes is that they need to be maintained. Maintaining a large index is more expensive than maintaining a small index, so having a primary key consisting of multiple attributes should be a carefully thought-out decision. The most appropriate candidate key to use as the primary key for this table is the customer ID attribute. The other candidate keys that are not chosen to be the primary key become secondary or alternate keys. The next key we’re going to look at is the composite key. A composite key is a primary key that consists of at least two attributes that uniquely identify each row in a table, one or more of which are not simple keys in their own right. A simple key is a single attribute that can uniquely identify each row within a table.
Let’s take a look at the following table: As you can see, we have removed the customer ID column. Given the new definition of the table, we have identified the following candidate keys: Out of these candidate keys that we’ve identified, three of them are simple keys. Due to the fact that their values might be changed at a later date, we’re not going to use any of them as the primary key. If we look at the last candidate key, we cannot use the attributes individually to uniquely identify each row in the table. The company name in its own right does not uniquely identify each row, nor does the country attribute. But if we combine these attributes, we get a unique identifier for each row within the table. Therefore, the primary key can be the combination of the company name and the country attribute, making this primary key a composite key. The next key we’re going to look at is the surrogate key. A surrogate key is an artificial key that does not contain a fact about the object being modeled. It’s usually a system-generated integer that will be used as the primary key to uniquely identify each row in a table. In the previous example, we were able to find a suitable candidate key to be used as the primary key. But what if we didn’t find one? Let’s take a look at the following table: This table stores the users who interact with a specific application. We have two candidate keys: the username and the email attribute. Both of these can be changed at a later date and therefore aren’t the best options for a primary key. Since we have not found any suitable key to be used as the primary key, we introduce a surrogate key.
This is the user ID attribute. It does not contain a fact about the object being modeled, but it can effectively be used as the main reference key to this table. This is usually how primary keys are created. To create a surrogate key in SQL Server, we need to enable the identity property of the column. This will automatically generate the values for the primary key attribute. The next key we’re going to look at is the foreign key. A foreign key is a primary key from one table that is being referenced from another table. Unlike the other keys we’ve covered, the foreign key has a different purpose. As we’ve seen in the previous videos, the foreign key is used to establish relationships between tables and also to enforce referential integrity. Let’s look at the following example: As you can see, both the student ID and the courseID attribute, as well as the student grade table, are foreign keys. That’s because they reference a primary key from another table. This way, we’re ensuring that the relationship between the tables is always valid. Finally, we’re going to look at the compound key. A compound key is a primary key made up of at least two attributes that uniquely identify each row in a table, and the attributes that comprise the compound key are simple keys in their own right.
This sounds very similar to a composite key, and that’s because every compound key is also a composite key. However, not every composite key is a compound key. For a key to be a compound key, each column it’s made of has to be a key in its own right. In one of the previous slides, we came up with the following composite key: This is an example of a composite key that is not a compound key. None of these attributes are keys on their own, but if they are taken together, they form a unique identifier for each row, which makes them a composite key. Let’s take a look at an example of a compound key. In our system, a student can take a course only once. Therefore, any combination of a student ID and a course ID results in a unique identifier for each row in the table. These keys can form the primary key. Because both the student ID and the course ID are foreign keys on their own, a combination of them becomes a compound key and can also be referred to as a composite key. We’ve now covered the fundamentals of database keys. Thank you for watching. In the next section, we’re going to start with querying, querying, and manipulating data. see you then.
Popular posts
Recent Posts