Last updated on Aug 1, 2024
- All
- Engineering
- Machine Learning
Powered by AI and the LinkedIn community
1
What is z-score?
2
How to calculate z-score?
3
How to define outliers using z-score?
4
How to remove outliers using z-score?
5
What are the benefits and limitations of using z-score?
6
Here’s what else to consider
Outliers are data points that deviate significantly from the rest of the distribution. They can skew the results of your machine learning models and affect their performance. One way to detect and remove outliers from a dataset is using z-score, a measure of how many standard deviations a value is away from the mean. In this article, you will learn how to use z-score to identify and filter out outliers in Python.
Top experts in this article
Selected by the community from 27 contributions. Learn more
Earn a Community Top Voice badge
Add to collaborative articles to get recognized for your expertise on your profile. Learn more
- Deepika A Google Certified Data Analyst | IBM Certified ML Engineer | CCNA | Kaggle 3x Expert| Problem Solver - Bronze Badge…
19
- Shivani Paunikar, MSBA Data Engineer @Tucson Police Department | ASU Grad Medallion | Snowflake Certified | BGS Member
7
-
6
1 What is z-score?
Z-score, also known as standard score, is a numerical value that indicates how many standard deviations a data point is above or below the mean of the dataset. The mean is the average value of all the data points, and the standard deviation is a measure of how spread out the data is. A z-score of zero means that the data point is equal to the mean, a positive z-score means that the data point is above the mean, and a negative z-score means that the data point is below the mean. The higher the absolute value of the z-score, the more unusual the data point is.
Help others by sharing more (125 characters min.)
- Deepika A Google Certified Data Analyst | IBM Certified ML Engineer | CCNA | Kaggle 3x Expert| Problem Solver - Bronze Badge @Codechef | Student at KPR Institute of Engineering and Technology
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Scaling methods help remove outliers and ensure the data is on the same scale. There are several common techniques for feature scaling, including standardization, normalization, and min-max scaling. In scaling methods are normalization and standardization. Here normalization is a data preprocessing technique used to adjust the values of features in a dataset to a common scale. Normalization is a scaling techniques in which values are shifted and rescaled so that they end up ranging between 0 and 1, It is also known as min-max scaling. Min-max normalization: v'=(v-min/max-min )(new_min-new_max)+new_minz-score normalization: v'=(v-μ)/σ ; where μ----> mean; σ----> standard deviation
LikeLike
Celebrate
Support
Love
Insightful
Funny
17
- Shivani Paunikar, MSBA Data Engineer @Tucson Police Department | ASU Grad Medallion | Snowflake Certified | BGS Member
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
As I delved into the world of statistics, I encountered a concept that fascinated me – the Z-score. Essentially, the Z-score, also referred to as the standard score, serves as a numerical indicator, offering insights into the relative position of a data point within a dataset. It measures the number of standard deviations a particular data point deviates from the mean, the average value of the dataset. Reflecting on my own experience, I found that a Z-score of zero aligned perfectly with the dataset's mean, signifying an exact match. A positive Z-score, on the other hand, hinted at a data point above the mean, while a negative Z-score suggested a point below the mean.
LikeLike
Celebrate
Support
Love
Insightful
Funny
5
- Vijayant Mehla Financial Risk Engineer | Goldman Sachs
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
The Z-score is a statistical measure that describes the relationship of a value to the mean of a group of values. The Z-score is expressed in standard deviations from the mean. A Z-score of 0 implies that the data point's value is the same as the mean score, while a Z-score of 1.0 indicates that the result is one standard deviation from the mean. They can be positive or negative, with a positive number indicating that the score is higher than the mean and a negative value indicating that it is lower than the mean.
LikeLike
Celebrate
Support
Love
Insightful
Funny
4
- Nagesh Singh Chauhan Director - Data Science at OYO
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
The z-score, also known as standard score, quantifies how many standard deviations a data point is from the mean in a normal distribution. It's calculated by subtracting the mean from the data point and dividing by the standard deviation. A positive z-score indicates a value above the mean, while a negative score is below. Z-scores help assess the relative position of a data point within a distribution, facilitating comparison across different scales and distributions. They are crucial in statistical analysis, hypothesis testing, and identifying outliers or unusual observations in data sets.
LikeLike
Celebrate
Support
Love
Insightful
Funny
1
- Abhinay Kalyankar Graduate student at W. P. Carey School of Business, ASU | Aspiring Business Analyst, Data Scientist
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
A Z-score, also known as standard score, it provides a normalized indication of how far away an individual data point is from the mean in terms of standard deviations.A positive Z-score indicates a data point above the mean, while a negative Z-score indicates a data point below the mean.
LikeLike
Celebrate
Support
Love
Insightful
Funny
1
Load more contributions
2 How to calculate z-score?
To calculate the z-score of a data point, you need to subtract the mean from the value and divide it by the standard deviation. For example, if the mean of a dataset is 50 and the standard deviation is 10, and you have a data point of 70, the z-score is (70-50)/10 = 2. This means that the data point is two standard deviations above the mean. You can use the scipy library in Python to calculate the z-score of a dataset using the zscore() function. For example, if you have a list of values called data, you can use the following code to get the z-scores:
import scipy.stats as statsz_scores = stats.zscore(data)
Help others by sharing more (125 characters min.)
- Deepika A Google Certified Data Analyst | IBM Certified ML Engineer | CCNA | Kaggle 3x Expert| Problem Solver - Bronze Badge @Codechef | Student at KPR Institute of Engineering and Technology
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Z-score normalization, also known as standardization, is a statistical method used to transform a dataset into a standard normal distribution.The formula for calculating the Z score for a data point x in a feature with mean μ and standard deviation σ is: Z=x-μ/σHere's a step-by-step explanation of the process:Calculate the Mean (μ): Find the average value of the feature.Calculate the Standard Deviation (σ): Determine the measure of the amount of variation or dispersion in the feature.Calculate the Z-score for each data point: Use the formula mentioned above to find the Z-score for each data point in the feature.
LikeLike
Celebrate
Support
Love
Insightful
Funny
14
- Shivani Paunikar, MSBA Data Engineer @Tucson Police Department | ASU Grad Medallion | Snowflake Certified | BGS Member
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Consider a high school where the average student height is 165 cm, with a standard deviation of 10 cm. If a student is 185 cm tall, the Z-score, calculated as (X - μ) / σ, yields 2. This Z-score of 2 indicates that the student's height is two standard deviations above the mean. In simpler terms, the student is significantly taller than the average, offering a quick, standardized measure to understand the uniqueness of their height within the school population.
LikeLike
Celebrate
Support
Love
Insightful
Funny
6
- Nagesh Singh Chauhan Director - Data Science at OYO
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
To calculate the z-score, subtract the mean from the data point and divide the result by the standard deviation. The formula is Z = (X - μ) / σ, where Z is the z-score, X is the individual data point, μ is the mean, and σ is the standard deviation. A positive z-score indicates the data point is above the mean, while a negative score denotes below. Z-scores aid in comparing data across different scales, identifying outliers, and assessing a data point's position in a normal distribution.
LikeLike
Celebrate
Support
Love
Insightful
Funny
1
- Disant Upadhyay Senior Software Engineer @ REBL.ai | Rust engineer | Dev Ops | Quantum information theory, Blockchain, Renewable Energy
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
🔍 Calculating Z-Score Made Easy: Let's say you want to find out how a specific data point stands in a dataset. First, subtract the mean from your data point. Then, divide this difference by the standard deviation. For instance, in a dataset with a mean of 50 and a standard deviation of 10, a data point of 70 has a z-score of (70-50)/10 = 2. This data point is 2 standard deviations above the mean. Using Python? The scipy library's zscore() function makes this a breeze for any dataset. 🐍 #DataAnalytics #PythonTips
LikeLike
Celebrate
Support
Love
Insightful
Funny
3 How to define outliers using z-score?
There is no definitive rule for defining outliers using z-score, but a common practice is to use a threshold value that determines how extreme a data point has to be to be considered an outlier. For example, you can use a threshold of 3, which means that any data point with a z-score greater than 3 or less than -3 is an outlier. This corresponds to about 0.3% of the data under a normal distribution. You can adjust the threshold depending on your data and your goals, but keep in mind that a lower threshold will remove more data points and a higher threshold will retain more data points.
Help others by sharing more (125 characters min.)
- Deepika A Google Certified Data Analyst | IBM Certified ML Engineer | CCNA | Kaggle 3x Expert| Problem Solver - Bronze Badge @Codechef | Student at KPR Institute of Engineering and Technology
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
1) Calculate Z-scores: for each data point in a dataset, calculate its Z-score using the formula: Z=x-μ/σ where x is the data point,μ is the mean,σ is the standard deviation 2) Define a threshold: choose a threshold value beyond which Z-scores are considered outliers. A common threshold is, for example, Z greater than 3 or less than -3.3) Identify Outliers: Data points with Z-score beyond the chosen threshold are considered outliers for example: i) if Z>3, the data point is considered an outlier on the high side ii)if Z<-3, the data point is considered an outlier on the low side
LikeLike
Celebrate
Support
Love
Insightful
Funny
16
- Shivani Paunikar, MSBA Data Engineer @Tucson Police Department | ASU Grad Medallion | Snowflake Certified | BGS Member
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
To spot outliers with the z-score, set a threshold—commonly 3. This means any z-score beyond 3 or below -3 is an outlier. It captures about 0.3% of data in a normal distribution. Adjust the threshold based on your data and goals. A lower threshold weeds out more data points, focusing on extreme outliers, while a higher one retains more data, potentially including moderate deviations. The choice depends on balancing outlier sensitivity with preserving a representative dataset.
LikeLike
Celebrate
Support
Love
Insightful
Funny
6
- Nagesh Singh Chauhan Director - Data Science at OYO
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Outliers using z-scores are typically defined as data points with z-scores beyond a certain threshold, often ±2 or ±3. A z-score greater than +2 or less than -2 signifies a data point significantly distant from the mean, suggesting it's an outlier. Adjusting the threshold to ±3 increases stringency. This method is efficient for identifying extreme values and is widely applied in statistical analysis to pinpoint unusual observations in a dataset.
LikeLike
Celebrate
Support
Love
Insightful
Funny
2
- Abhinay Kalyankar Graduate student at W. P. Carey School of Business, ASU | Aspiring Business Analyst, Data Scientist
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Typically, a Z-score threshold of ±3 is employed for outlier identification, signifying that any data point lying beyond this range is considered an outlier. This choice aligns with the empirical rule, suggesting that within ±3 standard deviations from the mean, approximately 99.73% of the data falls. In other words, the vast majority of the data, nearly 99.73%, is expected to reside within this range according to a normal distribution curve. This standardized approach to outlier detection leverages the statistical characteristics of the data to identify instances that significantly deviate from the norm, contributing to a robust and interpretable analysis.
- Mohamed Maoui Biomedical Engineering | Machine Learning | Signal Processing Developer at MaintainX
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Outliers are data points that are significantly different from other observations. They might be caused by variability in the data or experimental errors. Outliers can skew statistical measures and data distributions, leading to misleading results. The Z-score is a measure of how far away a data point is from the mean in terms of standard deviations. Data points with a Z-score greater than 3 or less than -3 are generally considered outliers.
LikeLike
Celebrate
Support
Love
Insightful
Funny
4 How to remove outliers using z-score?
To remove outliers using z-score, you need to filter out the data points that have a z-score beyond the threshold. You can use a boolean mask to create a new dataset that only contains the data points that have a z-score within the threshold. For example, if you have a list of values called data and a list of z-scores called z_scores, and you want to use a threshold of 3, you can use the following code to remove the outliers:
import numpy as npthreshold = 3mask = np.abs(z_scores) < thresholddata_cleaned = data[mask]
Help others by sharing more (125 characters min.)
- Deepika A Google Certified Data Analyst | IBM Certified ML Engineer | CCNA | Kaggle 3x Expert| Problem Solver - Bronze Badge @Codechef | Student at KPR Institute of Engineering and Technology
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Steps to remove outliers using Z score:1)Calculate Z-Scores2)Define a Threshold3)Identify Outliers4)Remove OutliersHere is the implementation of the sample dataset using scipy library:import numpy as npfrom scipy.stats import zscore# Example datasetdata = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100])# Calculate Z-scoresz_scores = zscore(data)# Define a threshold (e.g., 3 or -3)threshold = 3# Identify outliersoutliers = np.where(np.abs(z_scores) > threshold)[0]# Remove outliersfiltered_data = np.delete(data, outliers)print("Original data:", data)print("Filtered data without outliers:", filtered_data)
LikeLike
Celebrate
Support
Love
Insightful
Funny
16
-
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
The first step is to calculate the z-score for each data point let's name it x and it is calculated by this formula: (x−μ)/σ where μ is mean and σ is the standard deviation.Secondly, we define a threshold value typically we choose ± 2 or ± 3.Lastly, we remove the data points whose z score values are exceeding or receding the chosen threshold based on the value itself.
LikeLike
Celebrate
Support
Love
Insightful
Funny
5
- Nagesh Singh Chauhan Director - Data Science at OYO
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
To remove outliers using z-scores, set a threshold (commonly ±2 or ±3). Identify data points with z-scores beyond this limit and exclude them from the dataset. This process helps mitigate the influence of extreme values, ensuring a more robust and representative analysis. However, choosing an appropriate threshold requires consideration of the dataset's characteristics and the desired stringency in outlier removal. Additionally, alternative methods like the interquartile range (IQR) may complement z-score-based approaches for a comprehensive outlier detection strategy.
LikeLike
Celebrate
Support
Love
Insightful
Funny
1
- Ria Raj Data Analyst | IIM Alumnus | Turning Numbers into Insights | MySQL | Postgres | Python | Power BI | Storyteller with Data
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Curating a clean dataset is vital for robust analysis. When tackling outliers, employing the Z-score method is a great way to identify and remove these extreme values. Z-scores help measure how many standard deviations a data point is from the mean. By setting a threshold (often around ±3), data points beyond this range can be considered outliers and subsequently excluded from the dataset. This ensures a more accurate analysis, allowing us to focus on the trends and patterns that truly represent the data.
LikeLike
Celebrate
Support
Love
Insightful
Funny
5 What are the benefits and limitations of using z-score?
Using z-score to remove outliers is a simple and effective method that can improve the accuracy and robustness of your machine learning models. It can also help you normalize your data and reduce the impact of scaling and units. However, using z-score also has some limitations that you should be aware of. For example, z-score assumes that your data follows a normal distribution, which may not be the case for some datasets. It also does not account for the context and meaning of the data, and it may remove some valuable information that is not actually an outlier. Therefore, you should always check your data before and after applying z-score, and use other methods such as box plots, histograms, or domain knowledge to verify and complement your results.
Help others by sharing more (125 characters min.)
- Deepika A Google Certified Data Analyst | IBM Certified ML Engineer | CCNA | Kaggle 3x Expert| Problem Solver - Bronze Badge @Codechef | Student at KPR Institute of Engineering and Technology
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Benefits of using Z score :1) Standarization 2) Identifying outliers3) Normal Distribution4) Data transformation Limitations of using Z score:1) Sensitive to outliers2) Assumes normal distribution 3) Not robust to skewed data4) Dependent on sample size5) Assumption of independence6) Interpretation complexity
LikeLike
Celebrate
Support
Love
Insightful
Funny
19
- Shivani Paunikar, MSBA Data Engineer @Tucson Police Department | ASU Grad Medallion | Snowflake Certified | BGS Member
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Z-scores: Quick pros & consPros:Improves model accuracy & robustness (removes outlier influence)Normalizes data, reducing scaling/unit issuesEasy to calculate & interpretCons:Assumes normal distribution (may not hold for all data)Ignores data context & meaning (valuable info loss)Requires verifying results with other methods
LikeLike
Celebrate
Support
Love
Insightful
Funny
7
- Nagesh Singh Chauhan Director - Data Science at OYO
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Benefits: Z-scores standardize data, aiding comparison and outlier identification. They play a crucial role in statistical analyses, providing a normalized metric for interpretation and facilitating assessments of normal distribution.Limitations: Z-scores assume normal distribution and are sensitive to outliers, impacting reliability. Skewed data and small sample sizes can distort interpretations, requiring caution in diverse datasets.
LikeLike
Celebrate
Support
Love
Insightful
Funny
5
- Marco Antonio Peña Cubillos PhD Student | Data Scientist | AI Researcher | Machine Learning Engineer | ML Researcher |
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
In my experience the benefits I have obtained are:Pros1. At the time of standardizing the data. 2. At the moment of transforming the data3. At the moment of the scale a dfCons:1. Not a very robust method2. Assumes normality3. It is difficult to interpret, and sometimes I need to use other methods, because it removes context and valuable information.
LikeLike
Celebrate
Support
Love
Insightful
Funny
2
- Anand Prabhat Cloud Security Expert | AWS | Azure | Oracle Cloud
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Pros:-->Using z-score to remove outliers is a simple and effective method that can improve the accuracy and robustness of your machine learning models.-->It can also help you normalize your data and reduce the impact of scaling and units.Limitation:-->The Z-score itself is influenced by outliers, which can affect its accuracy in identifying other outliers.--> z-score assumes that your data follows a normal distribution, which may not be the case for some datasets.-->It also does not account for the context and meaning of the data, and it may remove some valuable information that is not actually an outlier.
LikeLike
Celebrate
Support
Love
Insightful
Funny
6 Here’s what else to consider
This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?
Help others by sharing more (125 characters min.)
-
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
First, to remove outliers using z-scores, calculate each data point's z-score, indicating how many standard deviations it is from the mean. Generally, data points with z-scores above +3 or below -3 are outliers. You can then filter these out from your dataset. Z-scores assume a normal distribution, which might not always fit your data. This method can be less effective in high-dimensional data, so consider using dimensionality reduction techniques like PCA. Also, be cautious in fields like healthcare, where outliers could be crucial. Always check your data's distribution and context before deciding.
LikeLike
Celebrate
Support
Love
Insightful
Funny
6
- Karthik Azhagesan Senior Data Scientist @ Mercedes-Benz Mobility AG
(edited)
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
Z-score can be useful for removing outliers that's of continuous feature type (assuming it follows a normal distribution). In most datasets, we have mixed feature type (both categorical and continuous). For removing outliers in categorical or mixed feature type, Z-score might not be useful. One way to remove outliers in categorical or mixed feature dataset is by using mahalanobis distance. Mahalanobis distance was introduced by P. C. Mahalanobis in 1936 and such a cool metric even today! Steps are as follows:1. Compute the mahalanobis distance between each observation to the whole distribution.2. The distances are then compared using a chi-squared statistic and a cut-off is decided.3. The cutoff can be used to remove the outliers.
LikeLike
Celebrate
Support
Love
Insightful
Funny
4
-
- Report contribution
Thanks for letting us know! You'll no longer see this contribution
In practise, it is good practise to do a quick manual investigation of the identified outliers. Often, outliers are caused by measurement errors or wrongly entered values and should be removed for clean results.However, sometimes (seeming) outliers offer deeper insights into the underlying reality or the data collection process that go unnoticed on first sight.
LikeLike
Celebrate
Support
Love
Insightful
Funny
2
Machine Learning
Machine Learning
+ Follow
Rate this article
We created this article with the help of AI. What do you think of it?
It’s great It’s not so great
Thanks for your feedback
Your feedback is private. Like or react to bring the conversation to your network.
Tell us more
Tell us why you didn’t like this article.
If you think something in this article goes against our Professional Community Policies, please let us know.
We appreciate you letting us know. Though we’re unable to respond directly, your feedback helps us improve this experience for everyone.
If you think this goes against our Professional Community Policies, please let us know.
More articles on Machine Learning
No more previous content
- Your business goals rely on accurate machine learning models. How can you ensure they align perfectly? 12 contributions
- You've deployed a machine learning model. How do you prevent unexpected behavior risks? 8 contributions
- You're debating project timelines with your team. How can you ensure machine learning feasibility? 4 contributions
- Your team is divided on machine learning trends. How do you navigate conflicting opinions to stay ahead? 2 contributions
- You're facing potential project delays. How can you effectively inform stakeholders? 7 contributions
- You're leading a team on an ML project. How can you foster a culture of data privacy and security awareness? 32 contributions
- Feeling overwhelmed by explaining machine learning to non-technical stakeholders? 15 contributions
- Here's how you can choose the right Machine Learning specialization for further education. 44 contributions
- Here's how you can land a competitive internship in Machine Learning. 16 contributions
- Struggling with data pipeline performance issues? 21 contributions
- Here's how you can optimize collaboration and innovation in machine learning projects through delegation. 37 contributions
No more next content
Explore Other Skills
- Programming
- Web Development
- Agile Methodologies
- Software Development
- Computer Science
- Data Engineering
- Data Analytics
- Data Science
- Artificial Intelligence (AI)
- Cloud Computing
More relevant reading
- Data Science What are the differences between pandas' .loc and .iloc methods?
- Data Science How can you optimize performance for large-scale pandas Series manipulations?
- Data Science How can you determine the similarity between two datasets?
- Data Science How do you optimize pandas code for faster data manipulation?