### Data preparation and pre-processing

In this evaluation process, we used the dataset provided by the *Johns Hopkins University Center for System Science and Engineering* (JHU CSSE)^{46,47}. This dataset is collected from January 2020 until now from various sources such as the *World Health Organization* (WHO), *European Centre for Disease Prevention and Control* (ECDC), *United States Centre for Disease Prevention and Control* (US CDC), etc.^{46,47}. In detail, this dataset contains the daily number of total infectious (including recovered and deceased cases), recovered, and deceased cases in all countries around the world.

In our experiment, we used the data from January 2020 to the end of June 2020 as training data and the July 2020 data for the testing period. As presented in “The BeCaked model”, we need the input containing four values in each day: susceptible, infectious, recovered, and deceased, but with the above dataset, we only have total infectious, recovered, and deceased cases. Therefore, we need to have a total population of the world and recalculate all the required input. The data about world population is provided by Worldometers^{48}. Equations (9a)–(9d) show how to calculate the input for BeCaked model from the dataset.

$$\begin{aligned} Input_{Susceptible}= &\, {} Total\_Population – Total\_Infectious\end{aligned}$$

(9a)

$$\begin{aligned} Input_{Infectious}= &\, {} Total\_Infectious – Recovered – Deceased\end{aligned}$$

(9b)

$$\begin{aligned} Input_{Recovered}= &\, {} Recovered\end{aligned}$$

(9c)

$$\begin{aligned} Input_{Deceased}= &\, {} Deceased \end{aligned}$$

(9d)

The input of our model are then normalized into percentages by dividing all data for the total population to match with the input of our proposed method described in “The BeCaked model”.

### Global evaluation

In the evaluation process, to choose the most optimal day lag number *n*, we conduct the experiments on global using different day lag numbers. According to the result of McAloon’s study (2020)^{49}, the day lag number varies from 5 to 14 days, so that, we test our model with 7, 10 and 14 day lag to find the most suitable one.

To determine the suitable value *n* of the lag days, we used the *recursive stategy*^{50} to perform *k*-step forecasting. In details, a *k*-step forecasting process is described as follows. Firstly, we only use *n*-day data (June \((30-n+1)\)th–June 30th) as the initial input for forecasting the next \(n+1\)th day (the number *n* is corresponding to *n*-day lag), which *n* is set as 7, 10, 14, respectively. Then, we repeat that process *k* times. At each step, we predict the number of cases (susceptible, infectious, recovered, deceased) for the next day and use it (our predicted cases) as the input for the next iteration. In this process, because the testing data is July 2020 data, the number step *k* varies from 1 to 31, we eventually choose *k* as the possible maximal value of 31.

In Table 1, we have shown our forecast results using 7, 10 and 14 day lag in *R Squared* (\(R^2\)) (Eq. (10)) and *Mean Absolute Percentage Error* (*MAPE*) (Eq. (11)) metric.

$$\begin{aligned} R^2 = 1 – \frac{RSS}{TSS} \end{aligned}$$

(10)

The *RSS*, *TSS* in Eq. (10) denote for the *Residual Sum of Squares* and the *Total Sum of Squares*. The \(R^2\) metric given in Eq. (10) provides an insight into the similarity between real and predicted data. The closer to 1 the \(R^2\) is, the more explainable the model is. The *MAPE* given in Eq. (11) tells us about the mean of the total percentage errors for *k*-step forecasting. If the value of this *MAPE* metric is closer to 0, it indicates the better results.

$$\begin{aligned} MAPE = \frac{1}{k}\sum _{i=1}^{k} \frac{|Y_i – {\hat{Y}}_i|}{Y_i} \end{aligned}$$

(11)

In Eq. (11), *k*, \(Y_i\), \({\hat{Y}}_i\) denote for the number of steps, the actual cases and our predicted cases, respectively.

According to the results shown in Table 1, the highest performance of our model was achieved with 10-day lag, so that we chose the number of day lag as 10 for further evaluations. We also visualized our results for a better overview using 10-day lag. Figure 8a shows the comparison of daily infectious cases between real data and BeCaked forecasting results while Fig. 8b shows that of the total infectious cases. Also, the comparison of recovered and deceased cases are presented in Fig. 8c,d, respectively. Moreover, we visualized the (\(\beta\), \(\gamma\), \(\mu\)) corresponding to the results of forecasting in Fig. 9. In this period (July 01st–July 31st), this pandemic in many countries started to be controlled, so that the overall transmission rate decreased. Due to the slower speed of transmission, the recovery rate increased. The hidden truth here is that the health system in those countries was load-reduced and doctors could pay more attention to currently infected patients. Because of the above reasons, the mortality rate decreased too.

In Table 3, we compared our model with some top-tier forecasting-specialized others from statistical models to machine learning models including *Autoregressive Integrated Moving Average* (ARIMA)^{12}, *Ridge*^{51}, *Least Absolute Selection Shrinkage Operator* (LASSO)^{52}, *Support Vector Machine for Regression* (SVR)^{53}, *Decision Tree Regression* (DTR)^{54}, *Random Forest Regression* (RFR)^{55} and *Gradient Boost Regression* (GBR)^{56}. Except for ARIMA, the other models use the same number of day lag *n* as our BeCaked does (which is 10) to forecast the future. The specific configurations of each model are listed in Table 2. Even though these models are widely used in predicting the future for time-series data and achieving comparative results^{50}, in the case of the COVID-19 long time forecasting problem, with the exception of our model, only Ridge and LASSO have acceptable results. Our model is proved to attain overall comparative performance with those mentioned methods by these results given below.

### Country evaluation

At the country level, we also conducted the same evaluation process as we did in global scale. We chose six countries with different locations, social policies, and anti-epidemic strategies, etc. for testing our model in various conditions.

Firstly, we fit our model for each country until the end of June 2020. Then, we used the July 2020 cases for the testing phase. In this comparison, we used the same methods with configurations as we did at the whole world level. These below evaluations were done in 1-step (Table 4), 7-step (Table 5) and 15-step (Table 6) forecasting using the daily infectious cases. The reason why we did more comparisons will be discussed in “Discussion”.

In July 2020, many countries began to have better control of the development of this COVID-19 pandemic by restricting outside agents from spreading viruses. However, some countries have reopened after lockdown, such as the United State, Australia, Italy, etc. This became a favorable condition for external factors to directly influence the increase in the number of cases in those countries. Therefore, to effectively forecast the long-time situation of those countries, a forecasting model must have the ability to adapt to emerging changes in the pandemic exponential growth rate. Due to that, the compared results below reflected the strong adaptive capacity of our BeCaked model along with others.

For a more challenging evaluation, we kept the training set to the end of June 2020 while stimulating the progress of long-term forecasting. In detail, firstly, we used 10 final days of June 2020 cases as the initial input and predict the pandemic in July and August 2020. On the next day in stimulating, we used the data of nine last days of June and July 1st as the input and re-produce the forecasting until the end of August. This process is the same as the natural behavior of most real-life forecast systems as we need to re-run the forecasting each day to get the most accurate result. With this testing, our model shows not only its performance but also its capacity of catching the changes of the pandemic. Figures 10, 11, 12 and 13 show the predicting results of daily infectious cases at the beginning and middle of July and August 2020, respectively. According to these figures, we can observe that, like other prediction models, BeCaked only works well when given sufficient input of historical data. As a result, in Fig. 13, BeCaked enjoys good accuracy in all countries when forecasting.

Moreover, unlike other blackbox-like prediction models, BeCaked can also provide an explanation for its results, in terms of the parameters (\(\beta\), \(\gamma\), \(\mu\)). For example, in Fig. 13, let us consider the cases of Spain and the United Kingdom. Even though the predicted curves of those two countries run in similar shapes, their parameters tell us different stories. The common things between those two countries are that they failed to control the transmission rate (maybe their lock-down policy did not make sufficient impacts). However, in Spain, they had been suffering from a high mortality rate for a long time, showing that their health system had difficulty in dealing with a large number of infectious cases, even though the recovery rate had also been increasing (i.e. more patients were cured daily). In contrast, in the United Kingdom, the rates of recovery cases and mortality cases gradually reduced at the early stage, indicating that the government somehow well controlled the situation in this period, which was also implied by the reduction of the transmission rate. However, when the transmission rate began to increase (corresponding to the time the lock-down policy had been relaxed in this country), the situation had been worse quickly in terms of mortality. At the end of this experiment, even though the regression models of two countries generate two similar shapes, like previously discussed, the parameters (\(\beta\), \(\gamma\), \(\mu\)) indicate that Spain already controlled the situation and things would be improved. Meanwhile, the United Kingdom still had a hard time awaiting ahead. Real data taken afterward confirmed our predictions.

In other countries, BeCaked can also be able to tell us what happened “behind the scenes” of the generated results. In Australia, the major turn occurred when the government succeeded in controlling the transmission rate by their lock-down policy. From that point, the recovery rate increased and the mortality rate decreased in this country, leading to the stable situation they are enjoying now.

Meanwhile, in Russia, things are up and down many times, reflecting the rapid policy changing of this country during this period. However, this country generally attempts to maintain less direct contact in the community to reduce the transmission rate, which makes our model promise a better situation for them.

In the United States, their social distancing policies had been somehow proved effective when both transmission rate and mortality rate had gradually reduced. However, the absolute number of infectious cases has still stably increased, which can be explained by the reducing number of recovery cases. This shows that the country was struggling to handle the infected patients in the previous days of the outbreak.

In a broader view, we can consider that Spain and the United Kingdom had an almost unchanged policy which leads to a familiar situation, so that our model can predict their pandemic future accurately in the very early time. The evidence for this is that the shape of \(\beta\), \(\gamma\), \(\mu\) line of these two nations at Fig. 10, 11, 12 and 13 are almost the same. Australia and the United States, in the past, faced the same situation but they did not provide any actions or policies to prevent external factors from spreading the virus. This is the reason why their infectious case increased dramatically. But, because in the past, they have faced this situation, our model can give good forecast results after “realizing” this situation (after about 30 days). Considering the parameter lines of the two above nations, we can see that in the stimulating progression, they are not stable. But in general, their directions are the same as the first forecast. Towards Italy and Russia, our model takes a little bit more time to change the direction of the forecast line, due to the strange situation. We can get this by comparing the parameter lines in Figs. 10, 11, 12 and 13 of Australia and Russia. The direction of these lines has changed as the policies of these two nations become loose. To be simple, it is because there is no pattern of this situation in the training data (pandemic data until the end of June 2020).

With the above result, we can consider that our proposed solution can catch up with the change in the pandemic situation. With the unchanged training set, our model can give very good forecast results if the situation is more stable. In case a strange situation occurs, our model needs time to give a more accurate forecast.

In the real-life application, the forecasting models are updated regularly using reinforcement learning methods in order to make them more “update” to new situations. Therefore, to get better results of our model in real-life, we need to finetune it with new data every one or two weeks.