How can you remove outliers from a dataset using z-score? (2024)

Last updated on Aug 1, 2024

All
Engineering
Machine Learning

1

What is z-score?

2

How to calculate z-score?

3

How to define outliers using z-score?

4

How to remove outliers using z-score?

5

What are the benefits and limitations of using z-score?

6

Here’s what else to consider

Outliers are data points that deviate significantly from the rest of the distribution. They can skew the results of your machine learning models and affect their performance. One way to detect and remove outliers from a dataset is using z-score, a measure of how many standard deviations a value is away from the mean. In this article, you will learn how to use z-score to identify and filter out outliers in Python.

1 What is z-score?

Z-score, also known as standard score, is a numerical value that indicates how many standard deviations a data point is above or below the mean of the dataset. The mean is the average value of all the data points, and the standard deviation is a measure of how spread out the data is. A z-score of zero means that the data point is equal to the mean, a positive z-score means that the data point is above the mean, and a negative z-score means that the data point is below the mean. The higher the absolute value of the z-score, the more unusual the data point is.

Add your perspective

Help others by sharing more (125 characters min.)

Deepika A Google Certified Data Analyst | IBM Certified ML Engineer | CCNA | Kaggle 3x Expert| Problem Solver - Bronze Badge @Codechef | Student at KPR Institute of Engineering and Technology
Report contribution
Scaling methods help remove outliers and ensure the data is on the same scale. There are several common techniques for feature scaling, including standardization, normalization, and min-max scaling. In scaling methods are normalization and standardization. Here normalization is a data preprocessing technique used to adjust the values of features in a dataset to a common scale. Normalization is a scaling techniques in which values are shifted and rescaled so that they end up ranging between 0 and 1, It is also known as min-max scaling. Min-max normalization: v'=(v-min/max-min )(new_min-new_max)+new_minz-score normalization: v'=(v-μ)/σ ; where μ----> mean; σ----> standard deviation

Like

17
Shivani Paunikar, MSBA Data Engineer @Tucson Police Department | ASU Grad Medallion | Snowflake Certified | BGS Member
Report contribution
As I delved into the world of statistics, I encountered a concept that fascinated me – the Z-score. Essentially, the Z-score, also referred to as the standard score, serves as a numerical indicator, offering insights into the relative position of a data point within a dataset. It measures the number of standard deviations a particular data point deviates from the mean, the average value of the dataset. Reflecting on my own experience, I found that a Z-score of zero aligned perfectly with the dataset's mean, signifying an exact match. A positive Z-score, on the other hand, hinted at a data point above the mean, while a negative Z-score suggested a point below the mean.

Like

5
Vijayant Mehla Financial Risk Engineer | Goldman Sachs
Report contribution
The Z-score is a statistical measure that describes the relationship of a value to the mean of a group of values. The Z-score is expressed in standard deviations from the mean. A Z-score of 0 implies that the data point's value is the same as the mean score, while a Z-score of 1.0 indicates that the result is one standard deviation from the mean. They can be positive or negative, with a positive number indicating that the score is higher than the mean and a negative value indicating that it is lower than the mean.

Like

4
Nagesh Singh Chauhan Director - Data Science at OYO
Report contribution
The z-score, also known as standard score, quantifies how many standard deviations a data point is from the mean in a normal distribution. It's calculated by subtracting the mean from the data point and dividing by the standard deviation. A positive z-score indicates a value above the mean, while a negative score is below. Z-scores help assess the relative position of a data point within a distribution, facilitating comparison across different scales and distributions. They are crucial in statistical analysis, hypothesis testing, and identifying outliers or unusual observations in data sets.

Like

1
Abhinay Kalyankar Graduate student at W. P. Carey School of Business, ASU | Aspiring Business Analyst, Data Scientist
Report contribution
A Z-score, also known as standard score, it provides a normalized indication of how far away an individual data point is from the mean in terms of standard deviations.A positive Z-score indicates a data point above the mean, while a negative Z-score indicates a data point below the mean.

Like

1

Load more contributions

2 How to calculate z-score?

To calculate the z-score of a data point, you need to subtract the mean from the value and divide it by the standard deviation. For example, if the mean of a dataset is 50 and the standard deviation is 10, and you have a data point of 70, the z-score is (70-50)/10 = 2. This means that the data point is two standard deviations above the mean. You can use the scipy library in Python to calculate the z-score of a dataset using the zscore() function. For example, if you have a list of values called data, you can use the following code to get the z-scores:

3 How to define outliers using z-score?

There is no definitive rule for defining outliers using z-score, but a common practice is to use a threshold value that determines how extreme a data point has to be to be considered an outlier. For example, you can use a threshold of 3, which means that any data point with a z-score greater than 3 or less than -3 is an outlier. This corresponds to about 0.3% of the data under a normal distribution. You can adjust the threshold depending on your data and your goals, but keep in mind that a lower threshold will remove more data points and a higher threshold will retain more data points.

Add your perspective

Help others by sharing more (125 characters min.)

Deepika A Google Certified Data Analyst | IBM Certified ML Engineer | CCNA | Kaggle 3x Expert| Problem Solver - Bronze Badge @Codechef | Student at KPR Institute of Engineering and Technology
Report contribution
1) Calculate Z-scores: for each data point in a dataset, calculate its Z-score using the formula: Z=x-μ/σ where x is the data point,μ is the mean,σ is the standard deviation 2) Define a threshold: choose a threshold value beyond which Z-scores are considered outliers. A common threshold is, for example, Z greater than 3 or less than -3.3) Identify Outliers: Data points with Z-score beyond the chosen threshold are considered outliers for example: i) if Z>3, the data point is considered an outlier on the high side ii)if Z<-3, the data point is considered an outlier on the low side

Like

16
Shivani Paunikar, MSBA Data Engineer @Tucson Police Department | ASU Grad Medallion | Snowflake Certified | BGS Member
Report contribution
To spot outliers with the z-score, set a threshold—commonly 3. This means any z-score beyond 3 or below -3 is an outlier. It captures about 0.3% of data in a normal distribution. Adjust the threshold based on your data and goals. A lower threshold weeds out more data points, focusing on extreme outliers, while a higher one retains more data, potentially including moderate deviations. The choice depends on balancing outlier sensitivity with preserving a representative dataset.

Like

6
Nagesh Singh Chauhan Director - Data Science at OYO
Report contribution
Outliers using z-scores are typically defined as data points with z-scores beyond a certain threshold, often ±2 or ±3. A z-score greater than +2 or less than -2 signifies a data point significantly distant from the mean, suggesting it's an outlier. Adjusting the threshold to ±3 increases stringency. This method is efficient for identifying extreme values and is widely applied in statistical analysis to pinpoint unusual observations in a dataset.

Like

2
Abhinay Kalyankar Graduate student at W. P. Carey School of Business, ASU | Aspiring Business Analyst, Data Scientist
Report contribution
Typically, a Z-score threshold of ±3 is employed for outlier identification, signifying that any data point lying beyond this range is considered an outlier. This choice aligns with the empirical rule, suggesting that within ±3 standard deviations from the mean, approximately 99.73% of the data falls. In other words, the vast majority of the data, nearly 99.73%, is expected to reside within this range according to a normal distribution curve. This standardized approach to outlier detection leverages the statistical characteristics of the data to identify instances that significantly deviate from the norm, contributing to a robust and interpretable analysis.

Like

2
Mohamed Maoui Biomedical Engineering | Machine Learning | Signal Processing Developer at MaintainX
Report contribution
Outliers are data points that are significantly different from other observations. They might be caused by variability in the data or experimental errors. Outliers can skew statistical measures and data distributions, leading to misleading results. The Z-score is a measure of how far away a data point is from the mean in terms of standard deviations. Data points with a Z-score greater than 3 or less than -3 are generally considered outliers.

Like

4 How to remove outliers using z-score?

To remove outliers using z-score, you need to filter out the data points that have a z-score beyond the threshold. You can use a boolean mask to create a new dataset that only contains the data points that have a z-score within the threshold. For example, if you have a list of values called data and a list of z-scores called z_scores, and you want to use a threshold of 3, you can use the following code to remove the outliers:

import numpy as npthreshold = 3mask = np.abs(z_scores) < thresholddata_cleaned = data[mask]

Add your perspective

Help others by sharing more (125 characters min.)

Deepika A Google Certified Data Analyst | IBM Certified ML Engineer | CCNA | Kaggle 3x Expert| Problem Solver - Bronze Badge @Codechef | Student at KPR Institute of Engineering and Technology
Report contribution
Steps to remove outliers using Z score:1)Calculate Z-Scores2)Define a Threshold3)Identify Outliers4)Remove OutliersHere is the implementation of the sample dataset using scipy library:import numpy as npfrom scipy.stats import zscore# Example datasetdata = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100])# Calculate Z-scoresz_scores = zscore(data)# Define a threshold (e.g., 3 or -3)threshold = 3# Identify outliersoutliers = np.where(np.abs(z_scores) > threshold)[0]# Remove outliersfiltered_data = np.delete(data, outliers)print("Original data:", data)print("Filtered data without outliers:", filtered_data)

Like

16
Report contribution
The first step is to calculate the z-score for each data point let's name it x and it is calculated by this formula: (x−μ)/σ where μ is mean and σ is the standard deviation.Secondly, we define a threshold value typically we choose ± 2 or ± 3.Lastly, we remove the data points whose z score values are exceeding or receding the chosen threshold based on the value itself.

Like

5
Nagesh Singh Chauhan Director - Data Science at OYO
Report contribution
To remove outliers using z-scores, set a threshold (commonly ±2 or ±3). Identify data points with z-scores beyond this limit and exclude them from the dataset. This process helps mitigate the influence of extreme values, ensuring a more robust and representative analysis. However, choosing an appropriate threshold requires consideration of the dataset's characteristics and the desired stringency in outlier removal. Additionally, alternative methods like the interquartile range (IQR) may complement z-score-based approaches for a comprehensive outlier detection strategy.

Like

1
Ria Raj Data Analyst | IIM Alumnus | Turning Numbers into Insights | MySQL | Postgres | Python | Power BI | Storyteller with Data
Report contribution
Curating a clean dataset is vital for robust analysis. When tackling outliers, employing the Z-score method is a great way to identify and remove these extreme values. Z-scores help measure how many standard deviations a data point is from the mean. By setting a threshold (often around ±3), data points beyond this range can be considered outliers and subsequently excluded from the dataset. This ensures a more accurate analysis, allowing us to focus on the trends and patterns that truly represent the data.

Like

5 What are the benefits and limitations of using z-score?

Using z-score to remove outliers is a simple and effective method that can improve the accuracy and robustness of your machine learning models. It can also help you normalize your data and reduce the impact of scaling and units. However, using z-score also has some limitations that you should be aware of. For example, z-score assumes that your data follows a normal distribution, which may not be the case for some datasets. It also does not account for the context and meaning of the data, and it may remove some valuable information that is not actually an outlier. Therefore, you should always check your data before and after applying z-score, and use other methods such as box plots, histograms, or domain knowledge to verify and complement your results.

Add your perspective

Help others by sharing more (125 characters min.)

Deepika A Google Certified Data Analyst | IBM Certified ML Engineer | CCNA | Kaggle 3x Expert| Problem Solver - Bronze Badge @Codechef | Student at KPR Institute of Engineering and Technology
Report contribution
Benefits of using Z score :1) Standarization 2) Identifying outliers3) Normal Distribution4) Data transformation Limitations of using Z score:1) Sensitive to outliers2) Assumes normal distribution 3) Not robust to skewed data4) Dependent on sample size5) Assumption of independence6) Interpretation complexity

Like

19
Shivani Paunikar, MSBA Data Engineer @Tucson Police Department | ASU Grad Medallion | Snowflake Certified | BGS Member
Report contribution
Z-scores: Quick pros & consPros:Improves model accuracy & robustness (removes outlier influence)Normalizes data, reducing scaling/unit issuesEasy to calculate & interpretCons:Assumes normal distribution (may not hold for all data)Ignores data context & meaning (valuable info loss)Requires verifying results with other methods

Like

7
Nagesh Singh Chauhan Director - Data Science at OYO
Report contribution
Benefits: Z-scores standardize data, aiding comparison and outlier identification. They play a crucial role in statistical analyses, providing a normalized metric for interpretation and facilitating assessments of normal distribution.Limitations: Z-scores assume normal distribution and are sensitive to outliers, impacting reliability. Skewed data and small sample sizes can distort interpretations, requiring caution in diverse datasets.

Like

5
Marco Antonio Peña Cubillos PhD Student | Data Scientist | AI Researcher | Machine Learning Engineer | ML Researcher |
Report contribution
In my experience the benefits I have obtained are:Pros1. At the time of standardizing the data. 2. At the moment of transforming the data3. At the moment of the scale a dfCons:1. Not a very robust method2. Assumes normality3. It is difficult to interpret, and sometimes I need to use other methods, because it removes context and valuable information.

Like

2
Anand Prabhat Cloud Security Expert | AWS | Azure | Oracle Cloud
Report contribution
Pros:-->Using z-score to remove outliers is a simple and effective method that can improve the accuracy and robustness of your machine learning models.-->It can also help you normalize your data and reduce the impact of scaling and units.Limitation:-->The Z-score itself is influenced by outliers, which can affect its accuracy in identifying other outliers.--> z-score assumes that your data follows a normal distribution, which may not be the case for some datasets.-->It also does not account for the context and meaning of the data, and it may remove some valuable information that is not actually an outlier.

Like

6 Here’s what else to consider

This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?

Add your perspective

Help others by sharing more (125 characters min.)

Report contribution
First, to remove outliers using z-scores, calculate each data point's z-score, indicating how many standard deviations it is from the mean. Generally, data points with z-scores above +3 or below -3 are outliers. You can then filter these out from your dataset. Z-scores assume a normal distribution, which might not always fit your data. This method can be less effective in high-dimensional data, so consider using dimensionality reduction techniques like PCA. Also, be cautious in fields like healthcare, where outliers could be crucial. Always check your data's distribution and context before deciding.

Like

6
Karthik Azhagesan Senior Data Scientist @ Mercedes-Benz Mobility AG
(edited)
Report contribution
Z-score can be useful for removing outliers that's of continuous feature type (assuming it follows a normal distribution). In most datasets, we have mixed feature type (both categorical and continuous). For removing outliers in categorical or mixed feature type, Z-score might not be useful. One way to remove outliers in categorical or mixed feature dataset is by using mahalanobis distance. Mahalanobis distance was introduced by P. C. Mahalanobis in 1936 and such a cool metric even today! Steps are as follows:1. Compute the mahalanobis distance between each observation to the whole distribution.2. The distances are then compared using a chi-squared statistic and a cut-off is decided.3. The cutoff can be used to remove the outliers.

Like

4
Report contribution
In practise, it is good practise to do a quick manual investigation of the identified outliers. Often, outliers are caused by measurement errors or wrongly entered values and should be removed for clean results.However, sometimes (seeming) outliers offer deeper insights into the underlying reality or the data collection process that go unnoticed on first sight.

Like

2

Machine Learning

+ Follow

Rate this article

We created this article with the help of AI. What do you think of it?

It’s great It’s not so great

Thanks for your feedback

Your feedback is private. Like or react to bring the conversation to your network.

Tell us more

Report this article

Explore Other Skills

Programming
Web Development
Agile Methodologies
Software Development
Computer Science
Data Engineering
Data Analytics
Data Science
Artificial Intelligence (AI)
Cloud Computing

How can you remove outliers from a dataset using z-score? (2024)

1

2

3

4

5

6

1 What is z-score?

2 How to calculate z-score?

3 How to define outliers using z-score?

4 How to remove outliers using z-score?

5 What are the benefits and limitations of using z-score?

6 Here’s what else to consider

Machine Learning

Rate this article

Thanks for your feedback

Tell us more

More articles on Machine Learning

Explore Other Skills

More relevant reading

Are you sure you want to delete your contribution?

Are you sure you want to delete your reply?