import pandas as pd
# Import numbers as vector:
x = pd.Series((9.0, 5.4, 12.3, 6.6, 15.5, 7.0, 11.3, 6.6, 8.4, 14.4, 10.3, 13.2))Summary statistics
Slides for “Summary statistics”
Controll questions
- What does the mean, median and mode aim to describe?
- How would you find the median if you had an even number of observations?
- Which percentile is the median?
- What is the 100-percentile usually called?
- What is the relationship between variance and standard deviation?
- Why prefer the standard deviation instead of the variance?
Python
In this section, we will look closer at how we can calculate these summary statistics in Python, both for a single vector of numbers and for a dataset.
A small startup tracks how many new users signed up each day (in hundreds) during a 12-day promotional campaign. The daily numbers are: \[9.0, 5.4, 12.3, 6.6, 15.5, 7.0, 11.3, 6.6, 8.4, 14.4, 10.3, 13.2\] We start by reading the data into Python:
We can find the mean by doing the calculations
\[\bar x = \frac{1}n \sum_{i=1}^n x_i = \frac1{10}(9.0+5.4+12.3+\cdots +13.2) = \frac{120}{12}=10.0.\]
or use Python:
x.mean()np.float64(10.0)
The mode is the most frequent observation. This is mostly relevant for discrete observations, but in this particular example we have two observations with value 6.6, and therefore this is the mode. In Python, we would generate a frequency table and the most frequent value is the mode:
x.value_counts()6.6 2
5.4 1
9.0 1
12.3 1
15.5 1
7.0 1
11.3 1
8.4 1
14.4 1
10.3 1
13.2 1
Name: count, dtype: int64
Hence, the mode is 6.6.
To find the minimum, maximum and median, we sort the numbers:
x.sort_values()1 5.4
3 6.6
7 6.6
5 7.0
8 8.4
0 9.0
10 10.3
6 11.3
2 12.3
11 13.2
9 14.4
4 15.5
dtype: float64
and find the minimum value, maximum value and middle observation. The minimum is 5.4 and the maximum is 15.5. In this case, we have two observations in the middle (9.0 and 10.3). There are several ways of calculating the median in such cases. One way is to find the middle point of these two middle observations: \[\text{Median} = \frac{9.0+10.3}2 = 9.65.\] In Python, we find it by:
print("Minimum:")
print(x.min())
print("\nMaximum:")
print(x.max())
print("\nMedian:")
print(x.median())Minimum:
5.4
Maximum:
15.5
Median:
9.65
The median is middle observation. To find 1st and 3rd quartiles (25% and 75% percentiles), we find the observation that has 25% and 75% of the data smaller than itself. With 12 observations, this is sorted observation number \((12-1)*25\%+1 = 3.75\) and \((12-1)*75\% = 9.25\). The 25th percentile is thus between sorted observations 3 and 4, and we find it by a weighted average of the two. Since 3.75 is closer to 4, the third observation gets a higher weight. \[\text{1st quartile}=\text{25th percentile}=0.25 \cdot x_{(3)}+0.75 \cdot x_{(4)}=0.25\cdot 6.6+0.75\cdot 7.0 = 6.9.\] Likewise, \[\text{3rd quartile}=\text{75th percentile}=0.75 \cdot x_{(9)}+0.25 \cdot x_{(10)}=0.75\cdot 12.3+0.25\cdot 13.2 = 12.525\]
In Python, we find these quantities by
print("1st quartile:")
print(x.quantile(.25))
print("\n3rd quartile:")
print(x.quantile(.75))1st quartile:
6.9
3rd quartile:
12.525
A very central measure of variability in the data is the standard deviation and variance. We can find the variance of the data by \[s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)^2 = \frac1{11}((9.0-10.0)^2+ (5.4-10.0)^2+\cdots +(13.2-10.0)^2)=\frac{123.76}{11}=11.25\] and then find the standard deviation \(s\) by taking the square root of the variance: \[s=\sqrt{s^2}=\sqrt{15.61}=3.35.\] In Python, we find this by
print("Variance:")
print(x.var())
print("\nStandard deviation:")
print(x.std())Variance:
11.250909090909092
Standard deviation:
3.354237482783396
The inter quartile range is simply the difference between the 75th and 25th percentile of the data, which we found above. Thus \[\text{IQR}=\text{75th percentile}-\text{25th percentile} = 12.525 -6.9 = 5.625\] In Python:
# interquartile range
IQR = x.quantile(.75)-x.quantile(.25)
print(IQR)5.625
Credit card dataset
Say we have a simulated data with the following variables:
- default: Binary (yes/no)
- student: Binary (yes/no)
- balance: Numeric. Balance on credict card after monthly payment.
- income: Numeric. Income of customer.
The data is from the book Introduction to statistical learning. We can load the data by:
import pandas as pd
# load the data:
default = pd.read_csv("https://raw.githubusercontent.com/holleland/TECH3/refs/heads/main/data/Default.csv",
sep = ";")
# Print first rows:
default.head()| id | default | student | balance | income | |
|---|---|---|---|---|---|
| 0 | 1 | No | No | 729.526495 | 44361.62507 |
| 1 | 2 | No | Yes | 817.180407 | 12106.13470 |
| 2 | 3 | No | No | 1073.549164 | 31767.13895 |
| 3 | 4 | No | No | 529.250605 | 35704.49394 |
| 4 | 5 | No | No | 785.655883 | 38463.49588 |
We can then find summary statistics for the income variable similarly to what we did above:
print("Mean income:")
print(default["income"].mean())
print("Median income:")
print(default["income"].median())
print("Standard deviation of income:")
print(default["income"].std())Mean income:
33516.98187595744
Median income:
34552.6448
Standard deviation of income:
13336.639562731905
We could also find the mode of the variable default (binary) by making a frequency table:
# Frequency of defaults
default["default"].value_counts()default
No 9667
Yes 333
Name: count, dtype: int64
Hence, “No” is the modal value of the default column.
Group by
Sometimes, we want to calculate summary statistics for different groups of data. For instance, in the dataset above, say we wanted to calculate the average income of students and non-students. In the Pandas library, there is functionality for this:
print("Mean grouped by student status: \n")
print(default.groupby("student")["income"].mean())
print("Standard deviation grouped by student status: \n")
print(default.groupby("student")["income"].std())Mean grouped by student status:
student
No 40011.952857
Yes 17950.230775
Name: income, dtype: float64
Standard deviation grouped by student status:
student
No 10010.288665
Yes 4533.007954
Name: income, dtype: float64
Thus, students have a average income of 17,950 while nonstudents have an average income of 40,012. Similarly, the standard deviations are 4533 and 10,010, respectively.
loc
It is also useful to filter the data, meaning that we select just a specific set of observations (rows). Say for instance, that we did not want to calculate the mean and standard deviation of income for both students and nonstudents, but only do it for the students. We can do this by first making a new data object that we call students, which only contains rows where the variable student is “Yes”, and the calculate the summary statistics for this new dataset. We do this, using the loc functionality:
students = default.loc[default["student"] == "Yes"]
print("Mean income of students: ")
print(students["income"].mean())
print("\nStandard deviation of student income:")
print(students["income"].std())Mean income of students:
17950.230775053467
Standard deviation of student income:
4533.00795394927
This was filtering based on a discrete (binary) variable, but we could also filter based on numerical comparisons. Say we wanted to find the average and standard deviation of income for students with a balances above 1900, and how many students have a balance above 1900? What is the maximum balance that students have in the dataset?
# Select students with balance above 1900:
students1900 = default.loc[
(default["student"] == "Yes") & (default["balance"] > 1900)
]
print("Mean income of students with balance >1900")
print(students1900["income"].mean())
print("Income standard deviation for students with balance >1900")
students1900["income"].std()
print("\nHow many students have balance above 1900?")
print(len(students)) # number of rows
print("\nWhat is the maximum balance of a student in the data?")
print(students1900["balance"].max())Mean income of students with balance >1900
17859.54731112371
Income standard deviation for students with balance >1900
How many students have balance above 1900?
2944
What is the maximum balance of a student in the data?
2654.322576