or testing of the product were noted. Even if such statistical tests indicate that one or more values are outliers, they should still be retained in the record. Including or excluding outliers in calculations to assess conformance to acceptance criteria should be based on scientific judgment and the internal policies of the manufacturer. It is often useful to perform the calculations with and without the outliers to evaluate their impact.
总之,拒绝或者保留一个明显的异常值都会导致明显偏倚。(异常值)检验(方法)的特性以及对生产过程和分析方法的科学理解都必须在确定这个异常值的来源时予以考虑。一个异常值的检验永远不能代替全面的实验室调查分析。实际上,只有在调查分析中无法找出确切原因,也没有发现在产品生产和检测中存在偏离时才能使用异常值检验。即使这样的统计学检验显示有一个或者多个数据是异常值,也仍要将它们保留在原始记录中。在评估标准符合性的计算过程中,保留或排除这些异常值都应该基于科学判断和生产商内部政策。在时,使用包含异常值和不包含异常值分别计算的方法对于评价异常值的影响通常是有用的。
Outliers that are attributed to measurement process mistakes should be reported (i.e., footnoted), but not included in further statistical calculations. When assessing conformance to a particular acceptance criterion, it is important to define whether the reportable result (the result that is compared to the limits) is an average value, an individual measurement, or something else. If, for example, the acceptance criterion was derived for an average, then it would not be statistically appropriate to require individual measurements to also satisfy the criterion because the variability associated with the average of a series of measurements is smaller than that of any individual measurement.
对于那些测量过程错误导致的异常值都需要进行记录(如使用脚注),但是不用将其包含在接下来的计算中。当评价是否符合某一特定接受标准时,非常重要的一件事是确定需报告的结果(即与限值比较的结果)是均值、单次测量值,还是其他的值。比如,如果接受标准是来自于均值,那么要求单个测量值也满足这个标准在统计学意义上就是不适当的,因为一系列测量均值的变异性要小于任何一个单独测量值的变异性。
COMPARISON OF ANALYTICAL PROCEDURES
分析方法的比较
It is often necessary to compare two procedures to determine if their average results or their variabilities differ by an amount that is deemed important. The goal of a procedure comparison experiment is to generate adequate data to evaluate the equivalency of the two procedures over a range of values. Some of the considerations to be made when performing such comparisons are discussed in this section.
我们经常需要比较两种(分析)方法以确定它们的平均结果或变异性是否存在重要差异。方法比较实验的目的是获得足够的数据,以便评价在一定范围内两种方法的等效性。下面的内容给出了在进行这种比较时应该做出的考虑。
Precision 精密度
Precision is the degree of agreement among individual test results when the analytical procedure is applied repeatedly to a homogeneous sample. For an alternative procedure to be considered to have ―comparable‖ precision to
that of a current procedure, its precision (see Analytical Performance Characteristics in <1225>, Validation) must not be worse than that of the current procedure by an amount deemed important. A decrease in precision (or increase in variability) can lead to an increase in the number of results expected to fail required specifications. On the other hand, an alternative procedure providing improved precision is acceptable.
精密度是指使用分析方法对均质样本进行重复测定时,各实验结果一致的程度。因为一个替代方法应当被认为具有与现行方法“相似”的精密度,其精密度(参见<1225>中分析性能属性,确认)与现有方法相比必须不能存在明显的差异。精密度的下降(或者说变异的增大)可导致不符合规定质量标准的实验结果数量增加。另一方面,体现出更佳精密度的替代方法是可以接受的。
One way of comparing the precision of two procedures is by estimating the variance for each procedure (the sample variance, s, is the square of the sample standard deviation) and calculating a one-sided upper confidence interval for the ratio of (true) variances, where the ratio is defined as the variance of the alternative procedure to that of the current procedure. An example, with this assumption, is outlined in Appendix D. The one-sided upper confidence limit should be compared to an upper limit deemed acceptable, a priori, by the analytical laboratory. If the one-sided upper confidence limit is less than this upper acceptable limit, then the precision of the alternative procedure is considered acceptable in the sense that the use of the alternative procedure will not lead to an important loss in precision. Note that if the one-sided upper confidence limit is less than one, then the alternative procedure has been shown to have improved precision relative to the current procedure.
比较两种方法精密度的一种方式是通过评价每种方法的方差(样本方差s2即是样本标准偏差的平方),并计算替代方法与现用方法的(真)方差比值的单侧置信上限(one-sided upper confidence limit)。附录D给出了这种假设的一个具体实例。理所当然的,该单侧置信上限应该与分析实验室确定的可接受上限进行比较。如果所计算的单侧置信上限低于可接受上限,该替代方法的精密度就被认为可以接受,即认为使用该替代方法不会导致重要的精密度损失。应该注意的是,如果计算所得的单侧置信上限小于1,那么替代方法已经显示出比原使用方法的精密度高的结论。
The confidence interval method just described is preferred to applying the two-sample F-test to test the statistical significance of the ratio of variances. To perform the two-sample F-test, the calculated ratio of sample variances would be compared to a critical value based on tabulated values of the F distribution for the desired level of confidence and the number of degrees of freedom for each variance. Tables providing F-values are available in most standard statistical textbooks. If the calculated ratio exceeds this critical value, a statistically significant difference in precision is said to exist between the two procedures. However, if the calculated ratio is less than the critical value, this does not prove that the procedures have the same or equivalent level of precision; but rather that there was not enough evidence to prove that a statistically significant difference did, in fact, exist.
上述置信区间的方法特别适合用于两样本的F检验来判断方差比值的统计学显著性差异。要进行两样本的F检验,需要将样本方差比与临界值进行比较,临界值可以根据预期的置信度和每个方差的自由度在F分布表中查
2
出。大部分的统计书籍都提供这样的F值表。如果所计算的比值超过临界值,则认为两种方法的精密度在统计学上存在显著差异。但如果所计算的比值小于临界值,并非证明两种方法具有相同或等效水平的精密度,而只能认为没有足够的证据证明两者之间在统计学上有显著差异。
Accuracy 准确度
Comparison of the accuracy (see Analytical Performance Characteristics in <1225>, Validation) of procedures provides information useful in determining if the new procedure is equivalent, on the average, to the current procedure. A simple method for making this comparison is by calculating a confidence interval for the difference in true means, where the difference is estimated by the sample mean of the alternative procedure minus that of the current procedure. 一般认为,方法间准确度(参见<1225>中分析性能属性,确认)的比较,在确定新方法在平均水平上是否与现有方法等效方面可提供非常有用的信息。一个进行比较的简单方法就是计算真实均值之差异的置信区间,这里,该差异是通过替代方法测得结果的均值减去现用方法的结果均值进行评估的。
The confidence interval should be compared to a lower and upper range deemed acceptable, a priori, by the laboratory. If the confidence interval falls entirely within this acceptable range, then the two procedures can be considered equivalent, in the sense that the average difference between them is not of practical concern. The lower and upper limits of the confidence interval only show how large the true difference between the two procedures may be, not whether this difference is considered tolerable. Such an assessment can be made only within the appropriate scientific context. This approach is often referred to as TOST (two one-sided tests; see Appendix F)
理所当然的,计算所得的置信区间应该与实验室确定的置信上限和下限进行比较。如果置信区间完全落在其确定的可接受置信上下限内,那么可以认为两种方法是等效的;即认为两种方法的均值没有实际差异。该置信区间的上下限仅显示两种方法的真值差异有多大,而不是说明这种差异是否可以被容忍。对于是否可以容忍这种差异的评估只有在科学的背景下才能进行。这种方法一般被TOST(双单侧检验;参见附录F)。 The confidence interval method just described is preferred to the practice of applying a t-test to test the statistical significance of the difference in averages. One way to perform the t-test is to calculate the confidence interval and to examine whether or not it contains the value zero. The two procedures have a statistically significant difference in averages if the confidence interval excludes zero. A statistically significant difference may not be large enough to have practical importance to the laboratory because it may have arisen as a result of highly precise data or a larger sample size. On the other hand, it is possible that no statistically significant difference is found, which happens when the confidence interval includes zero, and yet an important practical difference cannot be ruled out. This might occur, for example, if the data are highly variable or the sample size is too small. Thus, while the outcome of the t-test indicates whether or not a statistically significant difference has been observed, it is not informative with regard to the presence or absence of a difference of practical importance.
上述这种置信区间的比较方法特别适合于使用t-检验去检测两均值差异的统计显著性问题。进行t检验的一种
方式是先计算其置信区间,然后检查其是否包含0值。当该置信区间不包括0值时,说明两方法的均值差有显著差异。但是,统计学上的显著差异对于实验室并不一定有多么重要的实际意义,因为差异的增大可能来自于高精密度的数据或者大样本量的数据。另一方面,当置信区间包括0值时,也会出现虽然结果显示两者无统计显著性差异,但也并不能排除存在具有重要实际意义的差异。比如,当数据具有较大变异性或者样本量太小时,这种情况常会发生。所以,不论t检验的结果是否显示有显著性差异,都不能充分证明是否存在有实际重要意义的差异。
Determination of Sample Size
样本量计算
Sample size determination is based on the comparison of the accuracy and precision of the two procedures3 and is similar to that for testing hypotheses about average differences in the former case and variance ratios in the latter case, but the meaning of some of the input is different. The first component to be specified is δ, the largest acceptable difference between the two procedures that, if achieved, still leads to the conclusion of equivalence. That is, if the two procedures differ by no more than δ, on the average, they are considered acceptably similar. The comparison can be two-sided as just expressed, considering a difference of δ in either direction, as would be used when comparing means. Alternatively, it can be one-sided as in the case of comparing variances where a decrease in variability is acceptable and equivalency is concluded if the ratio of the variances (new/current, as a proportion) is not more than 1.0 + δ. A researcher will need to state δ based on knowledge of the current procedure and/or its use, or it may be calculated. One consideration, when there are specifications to satisfy, is that the new procedure should not differ by so much from the current procedure as to risk generating out-of-specification results. One then chooses δ to have a low likelihood of this happening by, for example, comparing the distribution of data for the current procedure to the specification limits. This could be done graphically or by using a tolerance interval, an example of which is given in Appendix E. In general, the choice for δ must depend on the scientific requirements of the laboratory.
根据两种方法进行准确性和精密度比较的需要来确定样本量3,在准确性比较时样本量类似于均值差异检验假设所需,在精密度比较时样本量类似于方差差异检验假设所需,但是计算样本量时所需的一些输入参量的意义是不同的。第一个所需参量是δ,它代表两种方法最大可接受的差异,如果满足条件就可以给出等效性结论。如果两种方法的差异小于δ,一般认为两者等效。考虑到在两个方向上δ的差异,方法的比较可以选择均值比较时所使用的双侧检验。或者,如果可以接受变异性降低,在比较方差时也可以选择单侧比较,并且如果方差比值(新方法方差/现行方法方差的比值)不大于1.0 +δ,新方法和现行方法就被认为是等效的。研究人员需要根据现行方法和/或其应用等的相关知识来规定δ值,或者计算δ值。当合规性检测时,其中的一项考虑就是新方法不应与现行方法出现较大差异,以导致出现超标结果(OOS)的风险。这时人们应该通过选择δ值来降低发
3
In general, the sample size required to compare the precision of two procedures will be greater than that required to compare the accuracy of the procedures.
通常用来两种方法精密度比较所需的样本量应该大于准确度比较所需。