Time to Revisit Endpoint Selection in Clinical Trials

In clinical trials, power calculation is often performed based on a single primary endpoint to determine sample size required for achieving study objective with a desired power at a pre-specified level of significance. In practice, power calculation based on a single primary endpoint has been criticized. First, how to select the single primary endpoint among a group of primary endpoints? Second, a single primary endpoint may not be sufficient to adequately inform complex cohorts, the disease status and/or treatment effect of the test treatment under investigation. Third, different study endpoints with different data types (e.g., continuous versus binary response) may result in different sample sizes. In addition, with a given sample size, some (single) endpoints may achieve the study objective while others fail to do so. In this opinion article, a conceptual innovation is the development of a therapeutic index that fully utilizes information from all relevant study endpoints proposed.


Introduction
In clinical trials, power analysis for sample size calculation (power calculation) is often performed based on a single primary study endpoint, a co-primary endpoint, or a composite endpoint for determining a sample size required for achieving the study objective with a desired power at a pre-specified level of significance.
Thus, the selection of study endpoint for power calculation plays an import role for the success of the intended clinical trials. Different study endpoints with different data types, such as continuous, binary response, or time-to-event data, will lead to different sample size requirement for achieving the study objective with a desired power at a pre-specified level of significance. In other words, with a given sample size, we may achieve the study objective with some endpoints but not with others, thereby underestimating the potential value of some endpoints.
In practice, for (statistical) convenience sake, a single primary endpoint is often selected for power calculation. This approach, however, has been criticized by many authors [1]. because a single primary endpoint can only partially inform the disease status and/ or treatment effect of the test treatment under investigation and cannot provide a complete clinical picture regarding safety and effectiveness of the test treatment under investigation. Besides, the selected single primary endpoint may be highly related to other endpoints which are not selected as the primary endpoint for the intended trial. These endpoints carry more or less valuable information regarding safety and effectiveness of the test treatment under investigation. In practice, it is well recognized that these endpoints may not be translated to one another. In addition, it is unclear which endpoint reveals "the truth" regarding the safety and effectiveness of the test treatment under investigation. Thus, there is a risk that the selected primary endpoint will not accurately reflect disease status and the treatment effect of the test treatment under study.
Consequently, regulatory decisions may be made based on a single, biased primary endpoint and hence misleading. As a result, we may put patients at greater risk or withhold potentially beneficial interventions, due to the inherent flaw of single endpoint selection.  (Table 1).
As it can be seen from Table 1, a total of 57 submissions were approved by the FDA between 1990 and 2002. Among the 57 applications, 18 were approved based on survival endpoint alone, while 18 were approved based on RR and/or TTP alone. About 9 submissions were approved based on RR plus tumor-related signs and symptoms (co-primary endpoints). Table 1 indicated that none of the study endpoints are superior to others in these regulatory submissions. More recently, Zhou et al. [5] provided a list of oncology and hematology drug approved by the FDA between 2008 and 2016. Similar results were observed. Both Williams et al. [5] and Zhou et al. [4] do not indicate that which study endpoint (including single endpoint, a co-primary endpoint, or a composite endpoint of multiple endpoints) should be used for evaluation and regulatory approval of the drug product under investigation. In practice, it is a concern that these endpoints may not be translated to one another and it is not clear which endpoint can best inform disease status and/or therapeutic effect of the test treatment under investigation.
Suppose that the commonly considered study endpoint in can- Some endpoints may be more efficient than others. Moreover, different study endpoints may not translate to one another, although they may be highly correlated to one another. It should be noted that different endpoints may result in different sample sizes required for achieving the study objective with a desired power at the 5% level of significance.
In practice, the traditional approach using single primary endpoints or co-primary endpoints have been criticized not only because it is not clear whether the selected endpoint is the most accurate endpoint for the stated goals of informing disease status and/ or measuring treatment effect. In addition, the selected endpoint does not fully utilize the information collected from all relate study endpoints. To overcome these problems, alternatively, Filozof et al.

Development of Therapeutic Index
Subsequent to the proposal of Filozof et al. [3], Chow and Huang where is a vector of weights with ω ij be the weight for e j with respect to index TI i , f i (•) is a utility (linear or nonlinear) function for construction of the therapeutic index TI i based on ω i and e. Generally, e j can be of different data types (e.g., continuous, binary, or time-to-event) and ω ij is pre-specified (or calculated based on pre-specified criteria), which can be different and consequently may lead to a different therapeutic index TI i .
Moreover, the utility function typically generates a vector of index (TI 1 ,TI 2 ,……,TI K )' and if K =1 it reduces to a single (composite) index.
As an example, consider

Practical and Challenging Issues
The development of a therapeutic index sounds reasonable and scientifically justifiable. However, several challenging issues has been raised, which are briefly described below.

Study Endpoints with Different Data Types
Another challenge is that the multiple endpoints may be of different data types such as continuous, binary response, or time-toevent data. In order to study the statistical properties of the devel- generality, θ j is tested by the following hypotheses: (3) where δ j , j=1,….,J are pre-specified margins. Under some appropriate assumptions, we can calculate the p-value p j for each H 0j based on the sample of e j and the weights ω i can be constructed based on That is, which is reasonable since each p-value indicates the significance of the treatment effect based on its corresponding endpoint.
Thus, it is possible to use all the information available to construct an effective therapeutic index.

Criteria for Evaluation of the Therapeutic Index
Although e j can be of different data types, without loss of generality and for illustration purpose, we assume they are of the same type at this step (e.g., after the study endpoints have been converted to standard scores). On one hand, we would like to investigate the predictability of TI i given that e j can inform the disease (drug) status (effect). On the other hand, we are also interested in the predictability of e j given that TI i is informative. Particularly, we may consider the following two conditional probabilities as criteria for statistical evaluation of the developed therapeutic index: and Intuitively, we would expect that P 1ij to be relatively large given that e j is informative since TI i is a function of e j, especially when relatively high weight is assigned to e j ; on the other hand, P 2ij could be small even if TI i is predictive since the information contained in TI i may be attributed to another endpoint e j' rather other e j.

Concluding Remarks
about 32% (18 out of 57) of oncology regulatory submissions were approved based on a survival endpoint. Amongst the 32% of regulatory submissions, it is not clear what percent of regulatory submissions were approved based on progression-free survival (PFS). In recent years, the FDA appeared to focus on approving regulatory submission based on PSF rather than overall survival.
As we discussed here however, the use of PFS alone for evaluation of safety and effectiveness of oncologic drug products, especially immunotherapy cancer drug products, may not be appropriate nor clinically or statistically justifiable. Thus, in pharmaceutical/clini-cal research and development of a drug product with multiple endpoints, we propose a therapeutic index, which incorporates multiple all relevant endpoints, should be developed whenever possible for a more accurate and reliable assessment of the test treatment under investigation.