|Year : 2017 | Volume
| Issue : 1 | Page : 3-8
Evidence-based medicine in action: Critical appraisal of articles on therapy or intervention
Department of General Medicine, Pondicherry Institute of Medical Sciences, Puducherry, India
|Date of Submission||22-Apr-2017|
|Date of Acceptance||20-May-2017|
|Date of Web Publication||12-Jul-2017|
Department of General Medicine, Pondicherry Institute of Medical Sciences, Kalapet, Puducherry - 605 014
Source of Support: None, Conflict of Interest: None
Evidence-based medicine (EBM) has gained acceptance as a means to improve patient outcomes. However, the practice of EBM necessitates the acquisition of certain skills beginning with the formulation of an answerable clinical question, effective search of the literature to find the best evidence, critically evaluating the evidence for its validity and the application of good evidence to patients giving due respect to their preferences and the clinician's individual expertise. Systematically analyzing the evidence is crucial to identifying the presence and degree of bias in the study, to determine the magnitude of the results in clinically relevant terms and deciding on the applicability to a particular patient. This process is termed critical appraisal. As research questions decide the appropriate study designs, the tools used for critically appraising different types of articles also vary. In general, critical appraisal of articles on therapy or intervention seek to clarify three issues – the internal validity of the study, the magnitude and precision of results, and the external validity of the study. This review aims to elucidate the practical ways in which each of these components of a critical appraisal can be approached and finally enable the clinician to use or ignore the evidence at hand. Since the acquisition of skills such as critical appraisal needs repetitive exposures and continuous constructive feedback, journal clubs offer ideal occasions where they can be initiated, pursued, and mastered. Judicious application of critical appraisal would no doubt aid the effective practice of EBM and ultimately improve patient care.
Keywords: Critical appraisal, evidence-based medicine, intervention, outcomes, therapy
|How to cite this article:|
Basheer A. Evidence-based medicine in action: Critical appraisal of articles on therapy or intervention. J Curr Res Sci Med 2017;3:3-8
|How to cite this URL:|
Basheer A. Evidence-based medicine in action: Critical appraisal of articles on therapy or intervention. J Curr Res Sci Med [serial online] 2017 [cited 2023 May 28];3:3-8. Available from: https://www.jcrsmed.org/text.asp?2017/3/1/3/210336
| Introduction|| |
Evidence-based medicine (EBM) has evolved quite a lot since its inception in the early 1980s by David Sackett and colleagues, making it a cornerstone of modern medical practice. The term EBM, coined in 1992, encompasses three pivotal concepts – best evidence, clinical expertise of the physician and patient preferences or values. Evidence, even of the highest quality and credibility, alone is not sufficient to practice EBM. In fact, no good evidence can be applied to patient care without due consideration to patient values and clinical judgment of the doctor.
The practice of EBM revolves around four essential, and often sequential steps beginning with the genesis of a clinically relevant answerable question followed by a systematic search to acquire the evidence to answer that question, critically evaluating the obtained evidence for its credibility and finally its application to the patient incorporating his or her preferences and tempering it with the clinician's expertise. This ask, acquire, assess, and apply sequence is necessary to ensure that the clinician chooses the most appropriate and valid information from the vast ocean of medical literature with minimum effort and time.
Crucial to deciding whether a particular article found by literature search is appropriate and valid for applying to patients is the process of systematically analyzing it – critical appraisal. Critical appraisal helps the reader assess the strengths and weaknesses of a study in a relatively objective manner. Since the study designs are predominantly determined by the research question they aspire to answer, the critical appraisal of each article also depends on the type of question it addresses. Therefore, it would not be appropriate to use the same set of tools to evaluate an article exploring the effectiveness of therapy or intervention and another that addresses the prognosis of a particular disease. In this review, the critical appraisal of articles on therapy or intervention would be discussed.
| Risk of Bias in Studies|| |
It is logical to assume that studies are always prone to errors because the best of sampling techniques cannot reflect the entire population. While all studies are designed with the intention of finding the truth, it is never possible to determine whether a result obtained after the study reflects this truth because there is always a play of chance. Random error is a directionless error attributable to chance whereas systematic error (also termed bias) has a direction and occurs due to certain defects introduced at various stages of a study. Such systematic deviations from truth can seriously compromise the applicability and credibility of a study. One major aim of critical appraisal is to identify the presence and degree of bias in a study. In subsequent sections, we will discuss how this bias can be identified.
| Components of Critical Appraisal of Article on Therapy|| |
Although the efficacy of interventions or treatment is best studied in a randomized controlled trial (RCT), other study designs also may be employed by investigators. Obviously, such designs are at higher risk of bias and therefore an RCT scores higher over any other study design for therapy or intervention.
There are three components in the critical appraisal of an article on therapy or intervention: internal validity, magnitude of effect, and external validity.
| Internal Validity|| |
Internal validity refers to the extent to which a study measures what it is intended to measure. In other words, it reflects the degree of bias associated with the study. What are the various points in a study where bias is likely to be introduced? Perhaps at every stage right from the start to the end.
Consider the case of a dart game; the shooter aims to hit the bull's eye which can be compared to the truth, every single time. When all darts hit bull's eye, the result is a cluster of hits within the truth. This is a case of no bias [Figure 1]a. Now consider a situation where all darts hit the outermost ring. In this case, the shooter is reliable as he has consistently been hitting the same area. However, the hits are all clustered away from the actual truth [Figure 1]b. This is a case of gross bias. There may be another situation as well where the hits are not exactly in the bull's eye but scattered immediately and uniformly around it. In this case, the hits are clustered around the truth, and the average of the hits will be close to the truth with minimal bias [Figure 1]c.
|Figure 1: (a) Example of dart game depicting a situation with no bias at all; (b) example illustrating gross bias with all darts clustered on one side (systematic deviation from bull's eye); (c) illustration of the more common scenario where darts cluster relatively uniformly around the bull's eye or truth, so that the average is close to the truth with minimal bias|
Click here to view
In case of RCTs (or for that matter any study), we expect most studies to deliver results as close to the truth as possible with minimal bias as in the last situation, being well aware that in real life no study can determine the actual truth. At the same time, we want to avoid or disregard studies whose results are deviant from the truth due to a significant bias. To assess internal validity, we address the following four levels of the study.
- An RCT typically begins by selecting a patient population that is subsequently randomized to two (intervention and control) groups. As often quoted apples cannot be compared with oranges; it would be unsuitable to compare these groups after the study if they are grossly dissimilar, to begin with. For example, in a study to determine the effectiveness of a new mouthwash for reduction of periodontal infections, if one group has more patients with diabetes (which is a risk factor for periodontal infections), the study is likely to be biased as this group may finally show a poor response to the mouthwash.
A properly performed randomization generally provides sufficient (but not absolute) guarantee that the two groups are balanced prognostically at the beginning of the study. How do we ascertain this from a given study? Search in the article for a table [Figure 2] that describes the baseline characteristics of the two groups. Reflect a while on the potential factors that can affect the prognosis of the condition being studied. Thereafter, find out from the table whether the proportion of participants in the two groups with these prognostic variables is relatively equal. If that is so, then you can safely assume that the randomization has worked and that the groups were comparable from the beginning. For example, in a study on the efficacy of an antidiabetic drug versus placebo, one important factor that is likely to affect the outcome is the duration of diabetes. If the group receiving the drug has a large proportion of patients with diabetes of <1 year duration compared to the placebo group where many have diabetes of >10 years duration, the results are likely to show that the new drug is more effective in controlling diabetes
- As the study proceeds, one is interested to know if all trial participants completed the study and whether they adhered to the intervention to which they were randomized. This is crucial as it is well known that participants lost to follow-up or not adhering to the allocated regime fair poorer than others. Thus their poor outcomes are likely to bias the study. But how are we as readers equipped to identify this? Again look for a flow chart describing the recruitment of participants to the study and the numbers lost to follow-up. A dropout rate of <10% gives us much more confidence in the validity of a study compared to a 20% or more lost to follow-up. One must also be aware that these are relatively arbitrary cutoffs and actual alarm levels may depend on absolute sample size. A 10% lost to follow-up in a study with 100 patients may have a greater impact on validity that a similar percent lost to follow-up in a study with 1000 participants
- Another aspect that may affect the validity of an RCT is the way in which such dropout (if any) was handled by the investigators. The most accepted method requires that the individuals be analyzed in the same groups to which they have been allocated originally irrespective of whether they complete the study, adhere to the allocated treatment or switch from one group to other during the trial period. This method is called intention to treat (ITT) analysis.
How are we as clinicians going to check this aspect from a study? There are generally 2 ways of doing this. First look at the flow chart describing participant recruitment, randomization and follow-up. Most good studies explicitly state how many participants remained in each arm at the end of the study and how many were included in the final analysis. If the number included in final analysis equals the number of participants originally randomized to each arm, then it is an ITT analysis. Second, look at the table that displays the results of the primary outcome of the study. If the denominator for each study arm in the results table is same as the originally randomized number, it implies that an ITT analysis has been done.
ITT provides a practical way to handle dropouts with minimal interference to validity. However, it does not nullify the impact on validity, and becomes ineffective when lost to follow-up rates exceed 15%–20%.
- As the study proceeds, the baseline balance achieved between groups through randomization may be threatened by at least three factors. Consider a study testing the effect of a new drug for insomnia. If the study individuals are aware of the group to which they are randomized (intervention drug or placebo), they are likely to report biased outcomes of sleep quality. By blinding participants to the treatment allocation, this form of bias can be tackled easily (single blind study). Now imagine another trial on the effect of a new antibiotic on the clinical improvement of pneumonia. If the investigating clinician is aware that a patient is randomized to the new drug, he might provide additional care in the form of intensive monitoring, early and aggressive use of inotropic agents or respiratory supports (co-interventions) that may tilt the final outcome in favour of the group receiving the new drug. Moreover, he is likely to be biased while assessing the outcome (clinical improvement, which is a relatively subjective variable). Blinding the investigators to treatment allocation will safeguard against these (double-blind study). When primary outcomes are relatively objective variables like mortality, the assessment by the investigator is less likely to be affected by prior knowledge of treatment allocation. Finally, if treatment allocation is unknown to individuals analyzing the data, it ensures further minimization of bias (triple blind). Double blinding provides adequate validity to an RCT on most occasions. However, beware of trials that simply report their method as “double blinded” since there is ample evidence to show that investigators and readers tend to perceive this term in several different ways as to who are blinded. Therefore, the latest guidelines for reporting of RCTs (CONSORT 2010) recommend that authors explicitly state the groups of persons who were blinded rather than the use of these terms.
No study would qualify for 100% internal validity. There may be varying degrees of bias introduced at the four levels described above [Table 1]. If the degree of systematic deviations introduced into the study is minimal, the internal validity must be rated good and the article worth pursuing further. On the other hand, studies may have serious issues at one or more of these levels (like grossly different baseline characteristics, huge loss to follow-up or lack of blinding of investigators and so on) or may fail to provide any information on these. The internal validity in these cases is deemed to be compromised, and the reader may choose either not to follow it further at all or interpret its results with extreme skepticism.
|Figure 2: Example of a table depicting baseline characteristics of study populations.|
Click here to view
| Magnitude and Precision of Effect|| |
Should the reader choose to appraise the article further, he/she would now need to ascertain the magnitude, relevance and precision of the results. In most RCTs, the outcomes are reported as dichotomous variables (number of individuals alive in each group or expired in each group). The statistical analysis of such data usually yield risk estimates such as odds ratio (OR) or relative risk (RR) which help quantify the magnitude of treatment effect. In general, RCTs and cohort studies report results in the form of RR while case–control studies use OR; however, these are not hard and fast rules. When time to event analysis (survival analysis, Kaplan-Meier analysis) is used, the point estimate is termed hazard ratio (HR). When outcomes are assessed from continuous data, means are compared, and weighted mean differences are used to express the magnitude of effect.
The most effective way to interpret results is to scrutinize the table depicting results of the primary outcome. One can get a quick idea of how good or bad the treatment is by looking at the relative risk. RR is the ratio of the proportion of outcomes in the experimental group and the proportion of outcomes in the control group. A relative risk of 1 indicates no effect, meaning that the group receiving experimental treatment and that receiving placebo experienced similar number of outcomes of interest (such as death, myocardial infection, stroke and so on). In a study assessing the efficacy of an antiplatelet drug in preventing myocardial infarction (MI), if 45 of the 100 (45%) patients receiving the drug develop MI compared to 60 among the 90 (67%) patients receiving placebo, the RR would be 45/67 or 0.67. This means that risk of MI among users of the drug is 67% compared to those on placebo. In other words, administration of the drug lowers risk of MI by 33% (100-67), which is termed the relative risk reduction.
A statistically significant result may not necessarily be clinically significant. Let us see a study where a drug X has been used to reduce the incidence of stroke among patients with high risk for cardiovascular events when given for 2 years. At the end of the trial, stroke occurred in 30% of patients who received the drug X while 50% of patients who were on placebo (control group) developed stroke; this is found to be statistically significant (P < 0.01). This difference expressed as relative risk or risk ratio (RR), the event rate in the intervention group divided by the event rate in the control group is 30/50 or 0.60. In other words, the drug X reduces the risk of stroke by 1–0.60 or 40% when given to high-risk persons for 2 years. As a clinician, you would obviously want to use this drug.
Now think of another trial in which a drug Y has been tried for preventing stroke in a similar population over 2 years. At the end of 2 years, we find that 1.2% of patients receiving the drug Y developed stroke and 2% of patients on placebo developed stroke with a statistically significant P = 0.001. The RR here is 1.2/2 which is 0.60. Thus this drug Y also reduces stroke risk by 40% as in the above study. But how confident would you be in prescribing this drug Y since the percentage of persons who developed stroke in both groups are so small?
The absolute risk reduction helps us in such situations. In the first study, the absolute risk reduction with drug X was 50%–30% which is 20%. When we divide 100 by this absolute risk reduction, we calculate the number needed to treat (NNT). Thus, the NNT here is 100/20 or 5. This simply means that if we treat 5 patients for 2 years with drug X, we can prevent 1 additional stroke. Lower the NNT, better the clinical utility. However, in the second trial, the absolute risk reduction with drug Y is 2–1.2 which is 0.8. The NNT is 100/0.8 or 125 meaning that we would need to treat 125 patients with drug Y for 2 years to prevent one additional stroke. It is quite apparent now why we would not choose the second drug although the relative risk and relative risk reduction were similar to drug X.
Equally important is the width of confidence interval (CI) provided along with the point estimate of RR or OR. The CI, generally expressed as 95% CI is an expression of the dispersion of the variable if the study were repeated several times. A wide CI generally reduces our confidence in the result, and if the boundaries of the CI includes 1, then it is statistically insignificant as well. For example, in the above study, the RR of 0.6 (95% CI: 0.4–1.3) indicates that if one were to repeat this study 100 times, 95% of the time RR is likely to fall anywhere between the lowest value of 0.4 and the highest value of 1.3. Since this CI includes 1, it is not statistically significant; moreover, it is likely that at times there could be a 30% (1.3–1.0) increased risk of dying with the use of the drug. Occasionally, a statistically insignificant result may be clinically important if one of the boundaries of the associated CI includes clinical benefit. For example, let us assume that the RR for a drug for preventing ventricular fibrillation (a potentially fatal arrhythmia) is 1.07 (95% CI 0.45–1.2); P < 0.05. We would not want to use this drug because the RR is around 1, almost no difference or actually a slight increase in risk of arrhythmia. However, the lower limit of the CI is 0.45 which means that the real effect can be as low as an RR of 0.45 or a risk reduction of 55%. Thus, we would not reject the drug outright but wait for more promising results from other trials.
Thus, it is important for clinicians to understand that the magnitude of the effect (expressed as RR, OR or HR), as well as the precision of the result (determined by the associated CI), are both critical to evaluate the results of a trial. The clinical relevance of results is best determined by deriving the NNT using the absolute risk reduction.
| External Validity|| |
External validity of a study refers to whether the results of an internally valid study can be applied to our patient. In other words, can we use the drug or intervention tested in the particular trial for the patient in front of us? As with the other two components of critical appraisal, this is also done objectively by answering few questions.
- Is our patient grossly different from the population that was studied? If so we cannot apply the results to our patient. In general, this is not the case. For example, if a study of a new antihypertensive drug was done on elderly patients with resistant hypertension, the results would not be relevant for use in a young hypertensive being worked up for secondary hypertension
- Are there specific social, cultural, racial, regional or religious preferences and values to be addressed for your patient? However strong and compelling the evidence might be it can be applied only if the patient's values and preferences permit us to do so. For example, available evidence and your clinical judgment may warrant blood transfusion for your patient for his or her specific illness; however, if the patient's belief forbids receiving another person's blood, we cannot apply this piece of evidence.
- Do risks outweigh the benefits of applying this result to our patient? This needs a careful estimation of individual's baseline risk of the disease if no treatment is given. Look for the results table on safety of the drug or intervention. This is particularly important in case of interventions or drugs that have potentially harmful side effects. The magnitude of this adverse event also needs to be considered. For example, an intervention might reduce the risk of cardiac arrhythmia in elderly patients with cardiac failure by 30%. However, if the study also reports that 5% of the patients randomized to this intervention died during procedure compared to no deaths in the control group, would we as clinicians recommend this intervention? Probably, that depends on the risk and benefits perceived by the patient as well. We as responsible evidence-based practitioners need to discuss this with patients in as simple and succinct a manner as possible.
- Is this particular treatment feasible in our settings? A RCT finds that ventilating patients with acute respiratory distress syndrome in the prone position reduces mortality. The study is internally valid, and results are clinically relevant. The study was conducted in a specialist intensive care set up that has been practising prone ventilation for the past 5 years. However, could this intervention possibly be successful in a hospital where the nursing staff are not trained at all in prone position ventilation? This lack of local expertise in a particular intervention or technique limits the external validity or generalizability of the study. Similarly, a new drug for rheumatoid arthritis might show excellent results in a well-conducted RCT; if the cost of this drug is exorbitant and our patient is from a low socioeconomic stratum, the applicability is restricted. Therefore, one needs to determine the feasibility based on local expertise, availability, cost, and any other factor that is relevant to a particular region and patient before the results can be applied.
| Conclusion|| |
Developing skills to critically appraise literature is imperative to weed out the clinically relevant, credible and valid evidence from those with bias and clinically insignificant results. Following a simple algorithm consisting of answering focused questions addressing internal validity, results and external validity enables rapid screening of articles and choose the most appropriate evidence applicable to the individual patient problem. Journal clubs should act as catalysts for enhancing and popularizing critical appraisal of literature for the ultimate goal of EBM – improving patient outcomes.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Evidence-Based Medicine Working Group. Evidence-based medicine. A new approach to teaching the practice of medicine. JAMA 1992;268:2420-5.
Horsley T, Hyde C, Santesso N, Parkes J, Milne R, Stewart R. Teaching critical appraisal skills in healthcare settings. Cochrane Database Syst Rev 2011;11:CD001270.
Pannucci CJ, Wilkins EG. Identifying and avoiding bias in research. Plast Reconstr Surg 2010;126:619-25.
Kabisch M, Ruckes C, Seibert-Grafe M, Blettner M. Randomized controlled trials: Part 17 of a series on evaluation of scientific publications. Dtsch Arztebl Int 2011;108:663-8.
Guérin C, Reignier J, Richard JC, Beuret P, Gacouin A, Boulain T, et al.
Prone positioning in severe acute respiratory distress syndrome. N Engl J Med 2013;368:2159-68.
Gupta SK. Intention-to-treat concept: A review. Perspect Clin Res 2011;2:109-12.
] [Full text]
Schulz KF, Grimes DA. Blinding in randomised trials: Hiding who got what. Lancet 2002;359:696-700.
Karanicolas PJ, Farrokhyar F, Bhandari M. Practical tips for surgical research: Blinding: Who, what, when, why, how? Can J Surg 2010;53:345-8.
Miller LE, Stewart ME. The blind leading the blind: Use and misuse of blinding in randomized controlled trials. Contemp Clin Trials 2011;32:240-3.
Schulz KF, Altman DG, Moher D; CONSORT Group. CONSORT 2010 statement: Updated guidelines for reporting parallel group randomised trials. BMJ 2010;340:c332.
Viera AJ. Odds ratios and risk ratios: What's the difference and why does it matter? South Med J 2008;101:730-4.
Spruance SL, Reid JE, Grace M, Samore M. Hazard ratio in clinical trials. Antimicrob Agents Chemother 2004;48:2787-92.
Cook RJ, Sackett DL. The number needed to treat: A clinically useful measure of treatment effect. BMJ 1995;310:452-4.
Flechner L, Tseng TY. Understanding results:P-values, confidence intervals, and number need to treat. Indian J Urol 2011;27:532-5.
] [Full text]
[Figure 1], [Figure 2]