Lost in Translation? From Conventional Scoring Tools to Modern Data-Driven Risk Assessment in Critical Care Medicine

High-resolution, longitudinal health data became widely available in intensive care units in the past years. Patient risk assessment, however, is still primarily based on conventional scores that take into account only a few parameters taken at single time points, which frequently causes inaccurate predictions in the clinical practice. Likewise, the contribution of AI-approaches remains sparse, as current machine learning models are inherently difficult to deduce and even impressive results rarely contribute to disease understanding. This review focusses on the limitations of conventional risk scores, and on recent developments and


Introduction
Intensive care medicine is a fast-paced medical field where treatment decisions are often made ad hoc and without the full knowledge of the disease extend or patient background. In general, there is a slim margin for errors, and inaccurate or late treatment decisions have severe consequences and might even be fatal [1].
Therefore, a quick but precise assessment of the current patient status is crucial to reduce outcome morbidity and mortality [2].
In current medical practice, the evaluation of short-term disease progression is primarily based on clinical judgement of the treating physician [3,4]. A study by Fleig and colleagues demonstrated that decisions of physicians are highly subjective and often do not predict disease progression and outcome reliably [5]. Factors such as age, experience, attitude and religious as well as ethical beliefs of the physician influence their judgment [6][7][8]. Even under normal clinical conditions, health care professionals can cognitively assess only a limited amount of the continuous data stream offered by the monitors and machines of an intensive care set-up. Trends are therefore often recognized too late to allow targeted intervention.
In the case of a crisis, the quality of human judgement might be further compromised by exhaustion, fear, the lack of current knowledge and scarceness of normally available resources [9]. Given the combination of high interpatient variability and the fast pace of critical care, standardized approaches are needed to classify patients into risk and progression subgroups that allow reliable assessments and prognosis.
score is compared to a historical patient group, often leading to an under-or overestimation of current mortality risks, especially when the current case is different from the case mixture of the original cohorts. For example, elderly patients (>80 years of age, high risk COVID-19 patients) are completely underrepresented in the cohorts of most established scores [14], severely compromising the accuracy of the prognosis in this patient group.
It is therefore not surprising that both the SAPS II and APACHE II score show a significant value for the Hosmer-learning-goodnessof-fit test (p <0.001), indicating a poor mortality predicting performance, although both models showed moderately good AUC in the ROC analyses [14,15]. Another major limitation of existing scores is their lack of diagnostic value over time. Most intensive care risk assessment scores were developed as a one-time assessment within the first 24 hours of admission to the ICU and are not suitable for daily assessment. Nevertheless, these scores are used for repeated assessments due to a lack of alternatives. Moreover, scores do not take into account changes in medical conditions or medical resources and conventional prognostic scores can only predict the acute mortality risk at admission and do not take further endpoints, such as quality of live after ICU stay, into account. Furthermore, most general risk assessment scores do not differentiate between patient subpopulations that form due to the increasing modernization and specialization of medical fields. They perform poorly for specific diseases such as burn victims or cardiochirurgic patients, but also for newly spreading diseases such as COVID-19 [16]. Especially in pandemic situations, the missing adaptability of scores is fatal due to lacking reliability in triage and resource allocation.

Artificial Intelligence for Medical Descision Support -Advances and Limitations
Continuous advances in medical technology and ITinfrastructure now provide access to high-resolution longitudinal patient data in unprecedented detail [17,18]. It is therefore not surprising that researcher have started to apply AI-approaches developed for big data analysis to create novel prognostic models.
Especially the implementation of machine-learning techniques in critical care medicine has the potential to change the way medical progress is achieved significantly. A self-learning AI-system that supports treatment decisions through continuous analysis of all relevant patient data in real-time would be the "state-of-the-art" technology in tomorrow's critical care medicine. A prominent today's example is the early prediction of acute kidney injury, where an AI-system recognized a dangerous trend in the data 12-24h before the conventional laboratory test exceeded threshold [19]. Machine-learning has also been applied to the management of patients with viral respiratory infections [20] and the prediction of respiratory decompensation in the ICU [21]. However, utilizing artificial intelligence and especially machine-learning approaches on medical data has multiple drawbacks:

1.
The prediction routines typically depend on the availability of a massive amount of training data (thousands of patients) [22]. In the case of novel or rare diseases, like COVID-19, such data is often not available. Moreover, the training and implementation of the underlying algorithms is computationally intensive and requires sophisticated ITresources that are not commonly available.

2.
Machine-learning is a stochastic, not a deterministic approach, meaning that it lacks common sense physical or physiological constraints [22,23]. Medical data collected during routine operation, however, is intrinsically flawed (i.e., missing, faulty or badly annotated data) introducing the possibility for logical fallacies in fully automated set-ups [22]. Furthermore, some multiple testing-based machine learning approaches utilizing massive amounts of data can run into the problem of P-hacking, i.e., finding random significant correlations due to the amount of tested correlations [24].

3.
Machine-learning approaches generally have a poor transfer learning ability, meaning that each narrow machinelearning application needs to be specially trained. The results heavily depend on the selection and annotation of the training data that is often not representative for the general population.
Therefore, many machine-learning approaches are too limited in scope and lack sufficient generalization to achieve a significant benefit for clinical routine [24,25]. Even if the training data is well selected, the same algorithm can arrive to different solutions performing well on the training set while acting very differently in real environments -a problem known as under specification [26]. Furthermore, ethical and moral constraints and concerns will make it difficult to translate any machine-learning-based solution that might affect human health and well-being [27,28].

4.
The currently developed machine-learning algorithms might be able to predict a specific state reliably; however, a mechanistic understanding of the algorithm is often not possible, and no logical models can be derived from the final output [22,29]. Therefore, the final step in a machine-learning pipeline is often expert-based interpretation [30], which might be subjective and biased towards a certain hypothesis [31].  [26].

Conclusions and Future Perspective
This domain knowledge can be encoded in different ways: causal graphs [32], hybrid mechanistic-machine-learning models [33] and well-designed regularization schemes [34] just to mention some possibilities. This knowledge can act as a guide for the training of the algorithms, in order to choose the domain-relevant solutions from all the possible solutions it explores. Current research in these areas try to produce models that retain the predictive ability and performance that has fueled the current machine-learning boom, while achieving better generalization, robustness and some level of interpretability, all of these characteristics being essential in the medical domain.