# 2.65.4.2 Algorithm for Data Transformation

Origin provides 3 transformation functions for transforming the data to follow a normal distribution, including Box-Cox transformation, Johnson transformation, and Yeo-Johnson transformation.

Both Box-Cox transformation and Yeo-Johnson transformation are power transformation. And the difference is that, Box-Cox transformation can only apply to the data all are positive, but Yeo-Johnson transformation can be used for any data without restriction. While Johnson transformation uses the Johnson distribution system, it can check the normality of the original data, and transform the data.

## Box-Cox Transformation

Box-Cox transformation is one kinds of power transformation, and it only works for positive data. The resulting of Box-Cox transformation is formulated as follows:

$Y'=\left\{ \begin{array}{ll} Y^\lambda&\lambda \neq 0\cr \ln(Y)&\lambda = 0 \end{array} \right.$

Here $\lambda$ is in the range of $[-5, 5]$.

### Optimal $\lambda$

Origin estimates the optimal $\lambda$ in the range of $[-5, 5]$, and the optimal $\lambda$ should get the minimal standard deviation of the transformed data. To eliminate the effect of different $\lambda$ for the standard deviation comparison, before calculating the standard deviation, standarizing the transformed data is needed. The following formula is used for the data standarization.

$Z_i=\left\{ \begin{array}{ll} \frac{Y_i^\lambda -1}{\lambda G^{\lambda-1}}&\lambda \neq 0\cr G \ln(Y_i)&\lambda = 0 \end{array} \right.$

where $i$ is for the $ith$ data, $G$ is the geometric mean of the original data. Then $Z$ is used for the standard deviation calculation.

The detailed steps of the optimization (also called golden section search algorithm) are:

1. Initialize the range for the optimization, here is from -5 to 5, and the tolerance for stopping the iteration.
2. Narrow down the range by the golden ratio, that is
$GoldenRatio=(\sqrt{5}+1)/2$
$LenghOfOldRange=OldLargeEndPoint-OldSmallEndPoint$
$NewSmallEndPoint=OldSmallEndPoint+LenghOfOldRange/GoldenRatio$
$NewLargeEndPoint=OldLargeEndPoint-LenghOfOldRange/GoldenRatio$
then get a smaller new range.
3. Take the end points of the new range as two $\lambda$, and calculate $Z$ values, and then standard deviation.
4. Compare two standard deviations.
If the standard deviation of the small end point of the new range is bigger than the one of the large end point of the new range, update the range as from the small end point of the old range to the small end point of the new range.
Otherwise, update the range as from the large end point of the new range to the large end point of the old range.
5. Take the updated range in 4 as the old range, repeat 2 to 4 util the old range's length is smaller than the spcified tolerance, then get this old range as the final range.
6. The middle point of the final range is considered as the optimal $\lambda$.

How to calculate standard deviation?

1. For subgroup data, that is, subgroup size is bigger than 1, the unbiased pooled standard deviation is estimated.
2. For individuals data, that is, subgroup size is 1, the average of moving range is estimated by moving range of 2.

Origin also provide the option if to round the optimal $\lambda$ to 0.5, that is to say, after getting the optimal $\lambda$, round it to the closest value, which is the multiple times of 0.5.

## Johnson Transformation

The three Johnson families of distribution include SB, SL and SU, which are the Johnson families distributions with the variable bounded (SB), lognormal (SL) and unbounded (SU) respectively. And the formulas for the transformation functions of these three families are:

$Y'=\left\{ \begin{array}{ll} SB = \gamma + \eta \ln\frac{Y-\epsilon}{\lambda+\epsilon-Y} &\eta, \lambda > 0, -\infty < \gamma < \infty, -\infty < \epsilon < \infty, \epsilon < Y < \epsilon + \lambda\cr SL = \gamma + \eta \ln(Y-\epsilon) & \eta > 0, -\infty < \gamma < \infty, -\infty < \epsilon < \infty, \epsilon < Y\cr SU = \gamma + \eta \sinh^{-1}\frac{Y-\epsilon}{\lambda} &\eta, \lambda > 0, -\infty < \gamma < \infty, -\infty < \epsilon < \infty, -\infty < Y < \infty, \sinh^{-1}(x) = \ln(x+\sqrt{1+x^2}) \end{array} \right.$

The goal of this algorithm is to select the best transformation function from the three Johnson families. The "Best" means:

1. After transformation, perform Anderson-Darling test on the transformed data, and the corresponding p-value should be the largest.
2. The largest p-value is greater than the specified p-value criterion (default is 0.1).

The general flow for picking the best transformation function is:

1. Almost all the potential transformation functions from the above three Johnson families are considered as candidates.
2. For each candidate:
1. Estimate the parameters by using the method described in Youn-Min Chou, Alan M. Polansky & Robert L. Mason (1998) Transforming Non-Normal Data to Normality in Statistical Process Control, Journal of Quality Technology, 30:2, 133-141, DOI: 10.1080/00224065.1998.11979832
2. Transform the original data by the candidate function with the esitmated parameters.
3. Perform Anderson-Darling test (Note: in the literature above, Shapiro-Wilks normality test is used) on the transformed data and get the p-value.
3. According to the criterion of the "Best" mentioned above, pick out the "Best" transformation function. If no candidate can match the "Best" criterion, then no transformation is appropriate to be chosen for the data.

## Yeo-Johnson Transformation

Yeo-Johnson transformation is another kinds of power transformation. Different from Box-Cox transformation, Yeo-Johnson transformation works for any data, positive, negative and zero. The resulting of Yeo-Johnson transformation is formulated as follows:

$Y'=\left\{ \begin{array}{ll} \frac{(Y+1)^\lambda - 1}{\lambda} &\lambda \neq 0,Y\ge0\cr \ln(Y+1) &\lambda = 0,Y\ge0\cr -\frac{(-Y+1)^{2-\lambda}-1}{2-\lambda} &\lambda \neq 2,Y < 0\cr -\ln(-Y+1) & \lambda = 2,Y < 0 \end{array} \right.$

Here $\lambda$ is also restricted in the range of $[-5, 5]$.

### Optimal $\lambda$

Origin uses the same algorithm to estimate the optimal $\lambda$ as Box-Cox transformation. However, as we can see that, the algorithm needs to calculate the geometric mean of the original data, which will fail if the original data contains negative data or zero. So, to make this optimization work for non-positive data, it needs to add a positive value to the original data, so to get a new data with all positive values for this algorithm.

For more details about the optimization, please refer to Optimal $\lambda$ section in Box-Cox Transformation section.