ASQ CQA – 5. Quality Tools and Techniques Part 9

  1. 5C Basic Statistics

Let’s understand some basic concepts related to statistics. When we say statistics, statistics is related to numbers. But then this has two main branches. One is the descriptive statistics, and the second is the inferential statistics. When we we say descriptive statistics, descriptive is to describe, to tell. Inferential is to infer or guess about something. So what we do in descriptive statistics, we describe something. Let’s say you have a class in which you have 20 students.

You take the measurement of each of these students, find out the height of that. Then with that height, you can describe that on the average, the height of students in this class is this much. By taking the mean of those heights, you can describe that class. On the other hand, what you want to do in inferential is that you take some samples and based on that sample, you make a guess. So let’s say you take a sample of 500 people and ask them that which party they are going to vote for. So once you know these 200, 300 people, then based on that, you can make a guess that based on your survey, this is the party which is going to win. This is inferential statistics. We will not be talking much about inferential statistics.

Our focus will be on descriptive statistics, where we will be describing things, describing the data, describing whatever information which we collected. When we talk of descriptive statistics, in that also, there are two key aspects. One is the measurement of central tendency and the second is the measurement of dispersion. We will be talking about each of these and what are the measurements related to central tendency and what are the measurements related to dispersion as we go in next few videos. But let’s understand that central tendency is something which in a common language, you can consider that as an average.

In the example which I earlier told that you took the height of students in a class and you took the average of that, that was central tendency. Dispersion was that how much variation you have in the heights? Some people might be higher, some people might be with the lower height. What is the dispersion, what’s the range, what is the spectrum of these heights? When you looked at that sample, that will be something which is called as measurement of dispersion. We will be talking about three measurements of central tendency which are mean, mode and median. And we will be talking about three measurements of dispersion, which are range, standard deviation and variance. Let’s understand each of these in next few videos.

  1. 5C1 Measures of Central Tendency

So as we earlier said that when we talk of measurement of central tendency, there are three measurements for that mean, median and mode. Let’s quickly look at these and then we will go into details of each of these and I will explain you with the help of a very simple example. Mean is something which is most commonly known as the average. Let’s say in this case of height of students, you took the average of those students. That was the mean. Median is that when you arrange this information in ascending or descending order, increasing or decreasing order, then the middle value is the median. And what is mode? Mode is the most occurring item in this data that is called as mode. Now, with this basic understanding, let’s look at an example of mean. How do we calculate mean? What is the advantage and disadvantage of calculating mean? So, as I earlier said, that mean is commonly known as the average. So now in this particular simple example, let’s say if I ask you to find out the mean of these five numbers, these five numbers are 104 989104 and 10 four. So if I ask you to find out the mean of that, that’s very simple. What you do is you add all these numbers. So 10 four plus 98 plus 90 plus 104 plus 104, you add all of these and then divide by number of items and number of items in this case is five. So once you add these, divide by five, this will give you 100.

So that is the mean. But then what is the problem with the mean? The main problem with the mean is that if you have an outlier, and when I say outlier, outlier is something totally very high or very low value, that can adversely affect the mean value. Let’s say if you go to a place and find out the average salary of those people, let’s say you have 20 friends and you ask each of your friends and you find out the average salary of that. But then if one of your friend is, let’s say Bill Gate or Warren Buffet, then just because of this one person’s salary income, the average will drastically shift. But then this average doesn’t represent the actual average of that group. So any outlier, any extreme value can influence mean in a big way. That is the reason sometimes you don’t use mean and instead of mean you use median. So with this, what is median?

So earlier we took mean of these numbers and now we are taking the median of those numbers and numbers remain the same 104 989104 and 10 four. Now we want to take the median of that. To take the median of that, the first thing which you do is you arrange these numbers in ascending or descending order, increasing or decreasing order, whatever you prefer, you can do that. So what I have done is I have put these numbers in increasing order. So the first number here is 90, which is the lowest one. The next one is 98. Then 10 four comes three times, so I have put 1043 times. So now what I have done is I have put these numbers in an increasing order. Then the median is the middle number in this, since I have five numbers here.

So if I leave two numbers on the left, two numbers on the right, the middle value is that third number, which is 10 four. So that is the median. Now, earlier I was talking that median is not affected by extreme values. Let’s consider that instead of 10 four, I change one of these numbers to 1040. I add 10 here. So this is extreme value. Unlike other four numbers, this fifth number is extreme value. Now, even with this number, if I put them in increasing order, so what will happen here is that the fifth number will become 100:40. In this case, by changing 10 four to 1040, mean will definitely get affected. Mean will get affected because if you add all these numbers, this fifth number is a big number. So the mean will get affected, which we calculated earlier. But when you look at the median, median doesn’t get affected by this extreme value because the last number, whether that’s 104 or 1040 or 1 million, that doesn’t affect the median. The median is still the third number in this ascending list, which is 10 four, even if you have an extreme value.

So that’s the reason sometimes you use median just to avoid extreme values influencing the central tendency. So this was about the median. But now one question remained here in median is that here we had five numbers. So once you had five numbers, it was easy to find the middle number. So in these five numbers, if you left two numbers on the left, two numbers on the right, the middle number was that third number. What if instead of five, you had six numbers in your original data? Let’s look at that example on the next slide. So here I have one more number added to that list and that number is 85 that I have added here. So now, if I want to find out the median of this, just like earlier, the first thing which I will do is arrange these numbers in an ascending order.

And now the first number is 85, not the 90 which earlier was, and everything else remains the same. Now, median is the central value, but then there is no central value. If I look here, this is my central value. Central value is somewhere here. If I leave three numbers on the left, three numbers on the right, then I don’t have any number which is the central number. But then what I can do is I can take two numbers, one on the left of this, one number on the right of that, and take the average of that, that will give me the median here. So median is the average of 98 and 10 four here. So I add 98 plus 10 four divided by two. That gives me the median as 10 one. So this was about the median. So we have talked about the mean, we have talked about the median. Now let’s talk about the third central tendency measurement, which is the mode. So here is the third measurement of central tendency, which is the mode. Mode is the most occurring item.

So if I look at these items, 10498, 908-510-4104, what you need to do is which of these numbers is most frequently occurring. So if I see 104104 is coming three times here, 98 is coming one time here, 90 is coming one time, 85 is coming for the one time here. So based on that, I can say that 10 four is the most occurring item. So that way the mode is 10 four. Where you use mode, you use mode where you want to find out which is the one which is most occurring. Let’s say if I have a shirt manufacturing company, I manufacture shirts, and you know that people have different sizes, so the sizes could be 30, 34, 38, whatever sizes these are.

But then once I take the survey of the population and then I find out that what are the sizes people have, and then what I need to do is focus on those sizes of the shirt which is most occurring. Because I need to increase my sale. I really cannot make 1000 shirts, let’s say, for each of these sizes. So what I need to do is I need to focus on the sizes which are most occurring. So if I say that in this population, let’s say a majority of people have a size of 34, then what I need to do is I need to focus on that particular shirt size. So that’s one example of mode where you can use mode as against mean or median. So with this, we complete our discussion on three measurements of central tendency. Mean, median and mode. Mean is the average, median is the central value when you arrange the data in ascending or descending order. And the mode is the most occurring item, most occurring number. With this, let’s move on to the next item, which is the measurement of dispersion. Let’s do that on the next.

  1. 5C2 Measures of Dispersion

Coming to the measurement of dispersion or the variation, the first question would be why do we need this measurement? Once we know the average of something, that should be good enough for that? Let’s take an example that you appear for an interview and you have been offered a job with a salary of $100,000 a year. Now you challenge that. That why you are offering me $100,000. Now, HR tells you that that’s the average salary which everyone gets in your level of experience. Will you be satisfied? Probably. The next thing which should come to your mind is what is the variation here you can have $100,000 as an average where most of the people have, let’s say a salary of 99,0010 $1,000. So everyone has 99 to 10 one, and you have been given $100,000. Okay, that’s fair, that’s not a problem. But then there could be another scenario where you have too much of variation.

Too much of variation means some of people in that grade might have a salary of, let’s say $80,000. Some might have a salary of $200,000. So 80,000, $200,000 and you have been placed in $100,000. This still might be the average. But then this is not something in the middle of that range. So you need to understand the range. Whatever data you are collecting, you still need to understand the broad spectrum dispersion or the variation in that. Whether that’s the case of a manufacturing taking a measurement of that, whether that’s the measurement of how much time your call center, each call takes, whatever measurement you are taking, you need to understand the measurement of dispersion of that as well. So when you talk about measurement of dispersion or measurement of variation, there are three commonly used measurements. One is the range, which is quite simple. We will talk about that. And second is standard deviation. And third is the variance. Let’s start with a simple one, which is the range. So the first measurement of dispersion we are talking here is the range. Range is the highest value and the lowest value. Find out the difference between that. That’s simple. Let’s take this same data which we took in mean, median and mold. Here I have numbers 10498, 908-510-4104. These are heights of students, let’s say. And then if you want to find the range of that, the highest number in this is 10 four. That goes fast minus the lowest, which is 85.

The difference of that is 19. That’s the range. That is a very simple measurement of variation. How much variation you have in this group? If you look at the range, that’s 19. Easy to calculate. But then there are some problems with this. The biggest problem with this is the outlier or the extreme value. If you have one extreme value here, that will change the range drastically. Let’s say in this, if I just put instead of 10 four, if I make this as 1040, then the range will become a huge it will be a big difference in the range. That’s the reason sometimes range doesn’t give you a good indication of the spread because even a single number can change the whole thing and rest of these numbers won’t count. So what you are just taking is the two numbers only out of this big list of numbers.

But then what is the other option? The other option would be the standard deviation and the variance other two items which are connected to each other. We will talk about that. Let’s move on to the next slide and understand how do we calculate the standard deviation. Before we go through the calculation of standard deviation, let’s understand one thing, that when we say standard deviation, this is also known as the sigma. Sigma is something which is the one which you see in six sigma. So when you talk of six sigma, you are talking of this sigma which is the standard deviation. To find out standard deviation, there are a number of steps. Let’s go slowly step by step and understand that the first thing which you want to do is find out the mean of these numbers.

So mean of these numbers is 100 which we have calculated earlier as well. You add these five numbers divided by five, that will give you 100. So that’s the mean, that’s the step number one to calculate the standard deviation, step number two is find out how much each of these numbers is away from the mean so the first number was 104. How much is this away from 100. That will be 104 -100 the second number was 98. How much far it is from 100 that’s 98 -100 and so on so you find out how much each of these numbers is away from the mean and those distances are here on the right, four minus two, minus ten, four and four. If you look at these numbers and if you find out the average of these distances from the mean and probably this is the sort of thing which we want to do when we want to calculate the dispersion, the spread, that how much each of these numbers is away from the mean.

So this is the distance from the mean. But if you find out the average of this, this average is always going to be zero. Whatever numbers you have, if you take the mean of that and if you find out the distance from mean, then if you do the average of that, that will be zero. So here if you take the average of four minus two minus ten four and four, that will be twelve in plus and twelve in minus the average will be zero. This will always be zero. So that’s something which we cannot do, we cannot take the mean of that. Then what we need to do is we need to square that, take a square of this, take the square of four, take the square of two. The one thing which square will do is that after you take square the sign goes away. The square of minus two is minus two, multiplied by minus two is equal to plus four. So what we do is we take square of that. Let’s do that on the next slide here. What I’ve done is I have taken square of all these numbers, all these distances from the mean and the sign which you see here, after four and two, that is the sign for square. So I’m talking about this sign here. So now the square of these are four. Square is 16 minus two, square is plus four minus ten, square is plus 104, square is 16 and four square is 16. So these are the numbers which I get as a square of the distance from means. Now, if I take average of that and let’s do that on the next slide. So here I have that slide. So now if I take the average or mean of that, that will give me the variance.

So I add all these numbers divided by five items that will give me the variance here. So variance is 30. 4. Remember we talked about two measurements of dispersion which we will be talking here. Standard division and variance. These are connected with each other. Variance is the square of the standard division. So this is something which we get as a variance. One thing which you need to remember here is that you can find out the variance of the whole group or the variance of a sample. Just remember this thing that if I tell you that these numbers were samples, then instead of five, you divide it by four, not by five. If I tell you that this is the whole population, then you divide it by the number of items. That’s something which you need to remember.

N and n minus one, where you divide by n. When you are looking at the standard deviation of the whole population, you divide by n. And that’s what we are doing here. But just in case, if you get an example where I say that this data is a sample data, in that case, instead of dividing by five, you need to divide by five minus one. So with this, now we have found out the variance and variance is 30. 4. And as I have said, that variance is the square of the standard division.

Now, if I have to find out the standard deviation, what I can do is I can take the square root of this, the square root of 30. 4 will be the standard deviation in this particular example. So here I have that statement that square root of the variance is the standard deviation. So square root of 30. 4 will be 5. 51. So that is the standard deviation and many times you can use calculator. So square root is represented by this sign as well, if you are not aware, but I am sure many of you will be aware of that. So if you if you need to take the square root of 30. 4 so this is the sign which you put on that to find out the square root of this.

img