The 2021 edition of Mirror, Mirror was constructed using the same methodological framework developed for the 2017 report in consultation with an expert advisory panel.2 Another expert advisory panel was convened to review the data, measures, and methods used in the 2021 edition.3
Using data available from Commonwealth Fund international surveys of the public and physicians and other sources of standardized data on quality and health care outcomes, and with the guidance of the independent expert advisory panel, we carefully selected 71 measures relevant to health care system performance, organizing them into five performance domains: access to care, care process, administrative efficiency, equity, and health care outcomes. The criteria for selecting measures and grouping within domains included: importance of the measure, standardization of the measure and data across the countries, salience to policymakers, and relevance to performance-improvement efforts. We examined correlations among indicators within each domain, removing a few highly correlated measures. Mirror, Mirror is unique in its inclusion of survey measures designed to reflect the perspectives of patients and professionals — the people who experience health care in each country during the course of a year. Nearly three-quarters of the measures come from surveys designed to elicit the public’s experience of its health system.
Changes Since 2017
The majority of measures included in this report are the same as in the 2017 edition of Mirror, Mirror (Appendix 2). Seventeen measures were dropped if a survey question was no longer included in the Commonwealth Fund International Health Policy Survey or if we had reason to believe the response to the measure might be less valid because of effects of the COVID-19 pandemic, such as questions in the timeliness subdomain related to wait times, which were being fielded during the spring of 2020. Ten measures were considered “modified” in the 2021 report because the wording of a survey item was altered since the 2017 version.
We worked to include new measures to fill previously identified gaps in performance measurement across the 11 countries and considered a wide array of potential new measures related to topics such as quality of behavioral and mental health care, hospital care, pediatric care, and safety. We considered the data availability of new measures, how recently they had been updated, and how they correlated with other measures in each domain. In the end we included 16 new measures across the five domains (see How We Measured Performance for details).
Data for this report were derived from several sources. Survey data are drawn from Commonwealth Fund International Health Policy Surveys fielded during 2017, 2019, and 2020. Since 1998, in collaboration with international partners, the Commonwealth Fund has supported these surveys of the public’s and primary care physicians’ experiences of their health care systems. Each year, in collaboration with researchers in the 11 countries, a common questionnaire is developed, translated, adapted, and pretested. The 2020 survey was of the general population; the 2017 survey surveyed adults age 65 and older. The 2020 and 2017 surveys examined patients’ views of the health care system, quality of care, care coordination, medical errors, patient–physician communication, wait times, and access problems. The 2019 survey was administered to primary care physicians and examined their experiences providing care to patients, use of information technology, and use of teams to provide care.
The Commonwealth Fund International Health Policy Surveys (2017, 2019, and 2020) include nationally representative samples drawn at random from the populations surveyed. The 2017 and 2020 surveys’ sampling frames were generated using probability-based overlapping landline and mobile phone sampling designs and in some countries, listed or nationwide population registries; the 2019 survey was drawn from government or private company lists of practicing primary care doctors in each country, except in France, where they were selected from a nationally representative panel of primary care physicians. Appendix 9 presents the number of respondents and response rates for each survey, and further details of the survey methods are described elsewhere.4,5,6
In addition to the survey items, standardized data were drawn from recent reports of the Organisation for Economic Co-operation and Development (OECD) and the World Health Organization (WHO). Our study included data from the OECD on screening, immunization, preventable hospital admissions, population health, and disease-specific outcomes. WHO data were used to measure health care outcomes.
The method for calculating performance scores and rankings is similar to that used in the 2017 report, except that we modified the calculation of relative performance because the U.S. was a distinct and substantial outlier (see below).
Measure performance scores: For each measure, we converted each country’s result (e.g., the percentage of survey respondents giving a certain response or a mortality rate) to a measure-specific, “normalized” performance score. This score was calculated as the difference between the country result and the 10-country mean, divided by the standard deviation of the results for each measure (see Appendix 3). Normalizing the results based on the standard deviation accounts for differences between measures in the range of variation among country-specific results. A positive performance score indicates the country performs above the group average; a negative score indicates the country performs below the group average. Performance scores in the equity domain were based on the difference between higher-income and lower-income groups, with a wider difference interpreted as a measure of lower equity between the two income strata in each country.
The normalized scoring approach assumes that results are normally distributed. In 2021, we noted that the U.S. was such a substantial outlier that it was negatively skewing the mean performance, violating the assumption. In 2017, we had included all 11 countries to calculate the mean and standard deviation of each measure. After conducting an outlier analysis (see below), we chose to adjust the calculation of average performance by excluding the U.S., using the other 10 countries as the sample group for calculating the mean performance score and standard deviation. This modification changes a country’s performance scores relative to the mean but does not affect the ranking of countries relative to one another.
Domain performance scores and ranking: For each country, we calculated the mean of the measure performance scores in that domain. Then we ranked each country from 1 to 11 based on the mean domain performance score, with 1 representing the highest performance score and 11 representing the lowest performance score.
Overall performance scores and ranking: For each country, we calculated the mean of the five domain-specific performance scores. Then, we ranked each country from 1 to 11 based on this summary mean score, again with 1 representing the highest overall performance score and 11 representing the lowest overall performance score.
Outlier analysis: We applied Tukey’s boxplot method of detecting statistical outliers and identified several domains or subdomains (affordability, preventive care, equity, and health care outcomes) in which the U.S. was a statistical outlier. The test identified isolated instances of other countries as statistical outliers on specific measures, but the pattern for other countries was inconsistent and the outlier differences were smaller than in the U.S.
Sensitivity Analysis. We checked the sensitivity of the results to different methods of excluding the U.S. as an outlier (see above). We removed the U.S. from the performance score calculation of each domain in which it was a statistical outlier on at least one indicator (otherwise keeping the U.S. in calculation of other domains where it was not an outlier (see Appendix 3). In another sensitivity analysis, we excluded the U.S. and other countries from the domains in which they were outliers, but the results were essentially similar.
We tested the stability of the ranking method by running two tests based on Monte Carlo simulation to observe how changes in the measure set or changes in the results on some measures would affect the overall rankings. For the first test, we removed three measure results from the analysis at random and then calculated the overall rankings on the remaining 68 measure results, repeating this procedure for 1,000 combinations selected at random. For the second test, we reassigned at random the survey measure results derived from the Commonwealth Fund International Health Policy surveys across a range of plus or minus 3 percentage points — approximately the 95 percent confidence interval for most measures — recalculating the overall rankings based on the adjusted data and repeating this procedure 1,000 times.
The sensitivity tests showed that the overall performance scores for each country varied but that the ranks clustered within several groups similar to that shown in Exhibit 2. Among the simulations, Norway, the Netherlands, and Australia were nearly always ranked among the three top countries; the U.S. was always ranked at the bottom, while Canada, France, and Switzerland were nearly always ranked between eighth and tenth. The other four countries varied in order between the fourth and seventh ranks. These results suggest that the selected ranking method was only slightly sensitive to the choice of indicators.
Four OECD indicators from the health care outcomes domain (30-day in-hospital mortality rate following acute myocardial infarction, 30-day in-hospital mortality rate following ischemic stroke, maternal mortality, and deaths from suicides) are included in the OECD measures of treatable and preventable mortality. To evaluate the potential impact of double-counting these four measures, we examined the correlations between each of the four measures and the two composite measures and recalculated the performance scores after removing these four measures. The correlations were modest or low. We found little difference in the overall performance scores for the 11 countries after removing the four potentially duplicative OECD indicators.
This report has limitations. Some are particular to our analysis, while some are inherent in any effort to assess overall health system performance. No international comparative report can encapsulate every aspect of a complex health care system. As described above, our sensitivity analyses suggests that country rankings in the middle of the distribution (but not the extremes) are somewhat sensitive to small changes in the data or indicators included in the analysis.
Second, despite improvements in recent years, standardized cross-national data on health system performance are limited. The Commonwealth Fund surveys offer unique and detailed data on the experiences of patients and primary care physicians but do not capture important dimensions that might be obtained from medical records or administrative data. Furthermore, patients’ and physicians’ assessments might be affected by their expectations, which could differ by country and culture. Augmenting the survey data with standardized data from other international sources adds to our ability to evaluate population health and disease-specific outcomes. Some topics, such as hospital care and mental health care, are not well covered by currently available international data.
Third, we base our assessment of overall health system performance on five domains — access to care, care process, administrative efficiency, equity, and health care outcomes — which we weight equally in calculating each countries’ overall performance score. Other elements of system performance, such as innovative potential or public health preparedness, are important. We continue to seek feasible standardized indicators to measure other domains.
Fourth, in defining the five domains, we recognize that some measures could plausibly fit within several domains. To inform action, country performance should be examined at the level of individual measures in addition to the domains we have constructed.