StatisticsDefinitions
We have obtained a set of N different values for the quantity X: $\{X_i\}$.
- Mean value (average)
$$ \mu \equiv \frac{\sum_i X_i}{N}$$
- Standard deviation
$$ \sigma \equiv \sqrt{\frac{\sum_i (X_i - \mu)^2}{N-1}}$$
- Centered moments
$$\mu_n \equiv \frac{\sum_i (X_i - \mu)^n}{N-1}$$
- Skewness
$${\rm Skew} \equiv \frac{\mu_3}{\sigma^3}$$
- Kurtosis
$${\rm Kur} \equiv \frac{\mu_4}{\sigma^4}$$
Definitions for a grouped distribution
The values can be grouped in different bins, so that we have a set of ordered pairs {value,frequency}.
$$ \{X_i,Freq(X_i)\}$$
$${\rm with } \ X_i > X_{i-1}$$
- Percentiles.
A percentile is the value below which a given percentage of observations in a group of observations fall.
In other words, the Percentile $P_k$ is defined as the value so that k/100 of the values in the distribution are smaller than it.
Let's define some notations for the case of grouped values:
$N = \sum Freq(X_i)$ (total number of values)
$ S_n = \sum_{i<=n} Freq(X_i) $ (cumulated sum of frequencies up to the n-th bin)
$ S_k = k * N/100$ is the cumulated sum of values corresponding to the k-th percentile (for instance, if we are looking for $P_{73}$ in a distribution with 1000 values, $S_k=730$)
When we are looking for the k-th percentile, and $S_n = S_k$, then $P_k = X_n$.
But if often happens that $S_{i-1} < S_k$ and $S_i > S_k$. In this case, the k-th percentile can be calculated using a linear interpolation:
$$P_k = X_{n-1} + (X_n - X_{n-1}) \frac{S_k - S_{n-1}}{S_n - S_{n-1}} $$
- Quartiles
The quartiles of a distribution are defined as the 25, 50 and 75 percentiles. That is:
$$Q_1 = P_{25}$$
$$Q_2 = P_{50}$$
$$Q_3 = P_{75}$$
- Median
The median is defined as the X value so that half the values in the distribution are smaller and the other half are larger. It can be said that it is the "medium point of the distribution".
In practice, it is defined as $P_{50}$.
$${\rm Median} = P_{50}$$
- Mode
The mode is the value that appears most often in a set of data.
Normality tests
There are several tests that can be used to estimate if a given set of values corresponds to an underlying Normal distribution.
In VOSA we have implemented the Pearson's chi-squared goodness of fit test. Both at the Bayes analysis and the Chi2 model fit (when parameter uncertainties are estimated using a Monte Carlo method).
Pearson's chi-squared test
Pearson's chi-squared test uses a measure of goodness of fit which is the sum of differences between observed and expected outcome frequencies (that is, counts of observations), each squared and divided by the expectation:
$$ \chi ^{2} = \sum _{i=1}^{n} \frac{ (O_{i}-E_{i})^{2} }{E_{i}} $$
where:
- $O_i$ = observed frequency for bin i.
- $E_i$ = expected frequency for bin i.
The expected frequency is calculated by:
$$ E_{i} = N \cdot [ F(Y_{u}) - F(Y_{l}) ] $$
where:
- F = the cumulative distribution function for the normal distribution.
- $Y_u$ = the upper limit for class i,
- $Y_l$ = the lower limit for class i, and
- N = the sample size
Once obtained the value of $\chi^2$ we compare it to the chi-square distribution for the corresponding degrees of freedom and obtain a range of values for the probability that our values, $ \{X_i,Freq(X_i)\}$, can correspond to an underlying normal distribution.
See, for instance, Goodness of fit (at the Wikipedia) for more details.
|