Generates descriptive statistics that summarize the central tendency,dispersion and shape of a dataset’s distribution, excludingNaN
values.
Analyzes both numeric and object series, as wellas DataFrame
column sets of mixed data types. The outputwill vary depending on what is provided. Refer to the notesbelow for more detail.
Parameters: | percentiles : list-like of numbers, optional
include : ‘all’, list-like of dtypes or None (default), optional
exclude : list-like of dtypes or None (default), optional,
|
---|---|
Returns: | summary: Series/DataFrame of summary statistics |
See also
DataFrame.count, DataFrame.max, DataFrame.min, DataFrame.mean, DataFrame.std, DataFrame.select_dtypes
Notes
For numeric data, the result’s index will include count
,mean
, std
, min
, max
as well as lower, 50
andupper percentiles. By default the lower percentile is 25
and theupper percentile is 75
. The 50
percentile is thesame as the median.
For object data (e.g. strings or timestamps), the result’s indexwill include count
, unique
, top
, and freq
. The top
is the most common value. The freq
is the most common value’sfrequency. Timestamps also include the first
and last
items.
If multiple object values have the highest count, then thecount
and top
results will be arbitrarily chosen fromamong those with the highest count.
For mixed data types provided via a DataFrame
, the default is toreturn only an analysis of numeric columns. If include='all'
is provided as an option, the result will include a union ofattributes of each type.
The include and exclude parameters can be used to limitwhich columns in a DataFrame
are analyzed for the output.The parameters are ignored when analyzing a Series
.
Examples
Describing a numeric Series
.
>>> s = pd.Series([1, 2, 3])>>> s.describe()count 3.0mean 2.0std 1.0min 1.025% 1.550% 2.075% 2.5max 3.0
Describing a categorical Series
.
>>> s = pd.Series(['a', 'a', 'b', 'c'])>>> s.describe()count 4unique 3top afreq 2dtype: object
Describing a timestamp Series
.
>>> s = pd.Series([... np.datetime64("2000-01-01"),... np.datetime64("2010-01-01"),... np.datetime64("2010-01-01")... ])>>> s.describe()count 3unique 2top 2010-01-01 00:00:00freq 2first 2000-01-01 00:00:00last 2010-01-01 00:00:00dtype: object
Describing a DataFrame
. By default only numeric fieldsare returned.
>>> df = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']],... columns=['numeric', 'object'])>>> df.describe() numericcount 3.0mean 2.0std 1.0min 1.025% 1.550% 2.075% 2.5max 3.0
Describing all columns of a DataFrame
regardless of data type.
>>> df.describe(include='all') numeric objectcount 3.0 3unique NaN 3top NaN bfreq NaN 1mean 2.0 NaNstd 1.0 NaNmin 1.0 NaN25% 1.5 NaN50% 2.0 NaN75% 2.5 NaNmax 3.0 NaN
Describing a column from a DataFrame
by accessing it asan attribute.
>>> df.numeric.describe()count 3.0mean 2.0std 1.0min 1.025% 1.550% 2.075% 2.5max 3.0Name: numeric, dtype: float64
Including only numeric columns in a DataFrame
description.
>>> df.describe(include=[np.number]) numericcount 3.0mean 2.0std 1.0min 1.025% 1.550% 2.075% 2.5max 3.0
Including only string columns in a DataFrame
description.
>>> df.describe(include=[np.object]) objectcount 3unique 3top bfreq 1
Excluding numeric columns from a DataFrame
description.
>>> df.describe(exclude=[np.number]) objectcount 3unique 3top bfreq 1
Excluding object columns from a DataFrame
description.
>>> df.describe(exclude=[np.object]) numericcount 3.0mean 2.0std 1.0min 1.025% 1.550% 2.075% 2.5max 3.0