# Ch 2.3 and 2.4 Percentile, Boxplot and Outliers

- Page ID
- 15882

**Percentile and Quartiles**

**Percentile:** are measures of location. Denoted by P_{1, }P_{2}, … P_{99} which divide a set of data into 100 groups with about 1% of the values in each group.

If x is at 90^{th} percentile, means 90% of all data are less than x. Note, percentile is not the same as percentage.

**Quartiles**: (Q_{1}, Q_{2}, Q_{3} )

Quartiles are measures of location, which divide a set of data into four groups with about 25% of the values in each group.

Q_{1 }– First quartile or P_{25}. It separates the bottom 25% of value from the top 75%.

Q_{2}^{ } - Second quartile or P_{50} or median. It separates the bottom 50% of values from the top 50%.

Q_{3} – Third quartile or P_{75}. It separates the bottom 75% of values from the top 25%.

**Five-number-summary, IQR and Boxplot:**

Five-number-summary are:

Mininum, Q_{1, }Median, Q_{3} and Maximum divides the data into four groups of 25% each.

*IQR* = *Q*_{3} – *Q*_{1}^{ }(Inter-quantile Range)

The **interquartile range (IQR) ** is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference between the third quartile (*Q*_{3}) and the first quartile (*Q*_{1}).

A **boxplot** shows graphical image of concentration of data. A boxplot is constructed using 5-number summary with Q1, median and Q3 in a box containing 50% of all data. It gives good distribution of data in 25%, 50% and 75%.

- Maximum and Minimum values are extended as whiskers at the two ends of the box.

#### Find 5-number-summary and boxplot by Statdisk

- Enter data in a column in Statdisk.

- Select Data, Explore data, descriptive statistic, select column and Click evaluate.

Ex1. The time(in min.) a sample of 15 student spent on exercising daily is given:

0, 40, 60, 30, 60, 10, 46, 30, 300, 90, 30, 120, 60, 0, 20

a) Find the 5-number summary and sketch a boxplot.

Use statdisk, data/explore data/select column data/evaluate, five number summary and boxplot will show.

b) What percent of student exercise from 0 to 60 min?

Because 60 is Q3, so 75% of all student exercise from 0 to 60 min.

c) What percent of student exercise between 20 to 60 min?

Because 20 is Q1, 60 is Q3, so 50% of all students exercise from 20 to 60 min.

Answer: Use Statdisk:

a) Min = 0, Q1=20, Med=40, Q3=60, Max = 300

b) Since Q3 = 60, hence 75% of students exercise from 0 to 60 min.

c) Since Q1 = 20 and Q3=60, Hence 50% of students exercise from 20 to 60 min.

#### Outliers and IQR

*IQR* is used to determine potential **outliers**.

Ex1. If Q1 = 34, Q3 = 70, find the lower fence and upper fence for an outlier.

IQR = 70 - 34 = 36

lower fence is 34 - 1.5(36) = -20, upper fence = 70 + 1.5(36) = 124

Value between -20 and 124 are not outliers, values outside the range are outliers.

A potential outlier is a data point that is significantly different from the other data points. These special data points may be errors or some kind of abnormality or they may be a key to understanding the data.

**C) Modified boxplot and outliers:**

A modified boxplot can be graphed to show outliers without calculating IQR and applying the Q1-1.5IQR, Q3+1.5IQR. Outliers are shown as markers in the boxplot.

- use Statdisk, click data , Boxplot,

- Select the column of data, click **modified boxplot**. The outlier will be shown as marker at the lowest or highest end of the boxplot.

- If there are no markers, there is no outliers in the dataset.

- To find the values of the outlier, sort the data. The outliers will be at the top and end of the sorted data.

Ex2. Determine if outliers exist in the exercise time from 15 students.

0, 40, 60, 30, 60, 10, 46, 30, 300, 90, 30, 120, 60, 0, 20

By calculation:

Since Q1 = 20, Q3 = 60, So IQR = 60 – 20 = 40

Lower fence = Q1 – 1.5(IQR) = 20 – 1.5(40) = -40

upper fence = Q3 + 1.5(IQR) = 60 + 1.5(40) = 120

Values lower than -40 and higher than 120 is an outlier. So the value 300 is an outlier.

Graph a modified boxplot to identify outliers.

Use Statdisk, Data/Boxplot/select Modified boxplot.

There is one outlier in the high end of the data. To find the outlier, sort the data and locate the highest value.

Use Statdisk, Sort, one column, select the column containing the data. The last data (300) is the outlier.