Doing Evidence Right

This article was originally published in the Winter 2017 edition of PATimes focusing on workforce management.

The views expressed are those of the author and do not necessarily reflect the views of ASPA as an organization.

By Shelley Metzenbaum and Steven Kelman
January 8, 2018

For the two of us—who believe deeply that government can, should and does improve the quality of people’s lives, but who also believe that as individuals and a society we should live our lives figuring out facts, learning from experience and applying the best logic we can muster when relevant experience is lacking—the idea of using evidence to guide government decisions comes very close to a bedrock belief. In today’s environment, we embrace this belief more than ever.

But, we also want to do evidence right. The dominant view in the evidence-based government community is that we use evidence to figure out “what works,” and then direct resources toward what works and away from what does not. The view also holds that randomized controlled trials (RCTs) are the best method to produce evidence and that we should give pride of place to RCTs for the evidence in evidence-based government. Neither idea is completely mistaken. However, both have real limits we need to consider if we want to do evidence right.

We should not use program evaluations primarily to define programs as “effective” or “ineffective,” but to help find ways to improve.

Many “what works clearinghouses” and efforts to facilitate the search for evidence-based programs eligible for government funding suffer from oversimplified findings. They use average results to designate programs as effective, promising, ineffective or inconclusive. Yet a program or practice often is neither fully “effective” nor “ineffective;” paying attention to variations in performance is important. When that is the case (which it is most of the time), it is essential to consider more closely for whom, where and when the treatment worked or not, especially when government spending is restricted to evidence-based practices.

One reason a program may not be “effective” or “ineffective” is because, as renowned statistician Dick Light pointed out many years ago, a “program” may have tens or even hundreds of different design features, some of which may be effective and others ineffective. If we simply call something “job training” without attending to its many component parts and their potential permutations and combinations, conclusions about program effectiveness may well be meaningless. To learn something, we need to understand which design features are or are not associated with success.

A second reason is that some programs may show beneficial effects for only a subset of the population. A program might be ineffective for most people but effective for a few. Paying attention to the one person who benefitted from an otherwise failed drug trial led to the discovery of a class of patients who responded well to an “ineffective” drug, The New York Times explained in a June 2017 story about pembrolizumab, now marketed as Keytruda.

“The drug is the happy result of a failed RCT. A nearly identical drug was given to 33 colon cancer patients, and just one showed any response—but his cancer vanished altogether,” according to the article. Because at least one doctor took the time to follow up on the patient for whom the otherwise ineffective treatment worked to detect characteristics that might explain his improvement, the doctor found others with tumors whom the drug might help, run a trial on a sample of those and develop a drug likely to help 60,000 people annually in the United States alone.

Conversely, a program may be effective for the majority of people, organizations, places or situations but not for everyone. As Anthony Bryk pointed out, Reading Recovery, the first grade literacy program, is effective for most children but not for a substantial number. By trying to decipher the characteristics of the children for whom a program does not work, policymakers can decide whether those not helped should be a priority and, if so, follow up with additional testing to find practices that do work for them.

An evaluation should be seen as a waystation on a journey to performance improvement, not the last stop.

That programs often are not merely “effective” or “ineffective” raises a crucial feature of doing evidence right. Evaluations normally try to determine whether a program worked and should be replicated, but typically do not discuss how the program can be improved. Too often, evaluation results are the end point of the journey, followed only by a decision to fund or defund. We believe that a crucial part of doing evidence right is to see most evaluations, particularly for important programs, as a waystation along a journey to performance improvement. More often, evaluations should adopt an iterative learning approach to evidence-based government—not stopping the program if the RCT results are favorable for most, but rather beginning there and tweaking the program’s ingredients and practices to find ways to improve over time. Even when a proven set of practices is known, government should continue to test adjustments, measure frequently and find ways to enhance performance on multiple dimensions, not just on outcomes but on other dimensions of performance like people’s experience with government and unwanted side effects.

This way of thinking moves the evaluation-oriented tradition of evidence-based government closer to the tradition of performance measurement and management, where one measures performance on objectives, detects shortcomings and tries alternatives to current practice to see if they work better, rather than simply decide to stop the program because it is ineffective.

To take an iterative, learning approach towards evidence-based government, we must move beyond using evidence only from randomized controlled trials.

In the performance measurement and management tradition, one typically uses data to inform goal-setting and then tests alternative ways to deal with a problem using methods considerably less rigorous than an RCT. For example, you can use convenience samples more readily available to real-life organizations and with considerably smaller sample sizes. These methods have the virtue of providing fast feedback. Further, if the alternative to the less rigorous approach in the performance measurement tradition is not an RCT but no testing at all, we should be careful not to let the perfect be the enemy of the good. Evidence from analyzing performance measures is a good way to start the performance improvement journey without a full-blown RCT.

Analytics applied to performance and other data to find trends, variations in patterns across different subsets and positive and negative outliers, anomalies and relationships is a valuable non-RCT source of evidence. This analysis helps detect problems needing attention, find promising practices worth trying to replicate, inform priorities, detect root causes to try to influence and refine program design based on analytics. We have become increasingly aware of how the private sector analyzes “big data” to understand individuals’ purchasing patterns, leading to increases in sales due to targeted marketing. Parts of government have begun analyzing big data to look for anomalous patterns that might point to fraud. There also is increased discussion and experience with predictive analytics in government, analyzing past trends to predict the results of various interventions and using other statistical methods to draw conclusions about those approaches to a problem worth focusing on.

Useful evidence comes in different shapes and sizes. Evidence-based advocates typically treat RCTs as the “gold standard” of evidence. They typically are relatively large, costly, lengthy and, as a consequence, rare. If we want to expand the scope of evidence-based government—and not the least if we want to adopt an iterative, learning approach that sees evaluations as a waystation in a journey of continuous improvement—we need to use not only performance analytics but also other forms of evidence that are quicker and less expensive to gather.

Many programs are learning to employ “rapid-cycle evaluations,” essentially small-scale, quickly executed, iterative RCTs that test the impact of discrete changes in policy, management or practice, rather than evaluate an entire program to point to ways to improve. This approach may be on the rise due to growing familiarity with agile IT and web design practices, such as A/B testing which uses random assignment principles to examine the impact of alternative web design features. Another form of a small-scale RCT: “Nudge” interventions to test alternative ways to achieve a variety of objectives, such as reducing hiring bias, increasing taxpayer compliance or boosting school attendance rates.

In addition to RCTs and performance analytics, role playing exercises like FEMA’s tabletop exercises are another useful source of evidence, predicting how people are likely to act in different situations and problems likely to arise when responding to actual events while providing those involved the opportunity to practice, learn from experience and sort out future roles.

Defunding what does not work does not always make sense.

It is appealing and sometimes justified to defund programs that do not work. This is a strong reason to use evidence, especially to counteract political forces encouraging a failing status quo. Yet we should understand that if a government program does not work, the problem the program was intended to address likely still exists. If the problem itself is serious, we should be cautious about prematurely removing funding from programs that do not work before going through a performance improvement journey to try to locate new approaches, or evidence about positive outliers, that might produce improvement.

Evidence-based government is a good thing, and clearly better than the alternative. But, we should try to do evidence right to achieve the most learning and on-the-ground benefits with the fewest costs.

Authors: Shelley Metzenbaum is a senior fellow at the Volcker Alliance and a good government catalyst. She led federal efforts to improve government outcomes, cost effectiveness and accountability as OMB associate director for performance and personnel management in the first term of the Obama administration, subsequently serving as founding president of The Volcker Alliance. Prior to these positions, she was founding director of University of Massachusetts-Boston’s Collins Center for Public Management and director of the Kennedy School Executive Session on Public Sector Performance Management. She can be reached at [email protected].

Steven Kelman, the Weatherhead professor of public management at Harvard University’s John F. Kennedy School of Government, is the author of many books and articles on the policymaking process and improving the management of government organizations. From 1993 to 1997, Kelman served as administrator of OMB’s Office of Federal Procurement Policy. A fellow of the National Academy of Public Administration, he can be reached at [email protected].

1 Star 2 Stars 3 Stars 4 Stars 5 Stars (6 votes, average: 5.00 out of 5)
Loading...