About Us
Child Abuse Training
International Activities
Health Policy Collaboration
Rossi Award for Program Evaluation
UMD Capstone Courses
Mailing List
Contact Us

Return to the Rossi Award page.

2014 Rossi Award Winner

Larry L. Orr

Acceptance Remarks

(November 6, 2014)

View Award Symposium

I am deeply honored by this award. No professional recognition is as meaningful as the recognition of one's peers. And I have always considered APPAM the professional association of my colleagues and peers.

It is also a great honor to be included in the company of Rossi awardees. The prior awardees are an elite group of people for whom I have enormous respect, both personally and professionally.

1. Rob Hollister is one of my most longstanding professional colleagues (I almost said "oldest", but I think we've both gotten a little sensitive about that word!)-when I was a kid just out of grad school, drowning in my first professional job as an Assistant Professor, Rob took me under his wing, helped me direct my first research project (a summer project, budget of maybe $50k), and then, as if that weren't enough, got me a job for which I was much better suited and that I really loved! (I'm deliberately not saying when that was!) Over the years, Rob has always been available for sage advice or just to lend an ear.

2. Fast-forward to the 80s (yeah -- we're only up to the 80s!), when I spent 8 of the most important -and rewarding-years of my life working with Judy Gueron and Howard Bloom on the National JTPA Study. Wow! What I learned from that pair! And not just about research methodology (though it was a real education in that), but about how one thinks about evaluation and designs evaluation and, very importantly, implements an evaluation in the field. We didn't always agree-especially about the latter-but it was certainly a learning experience.

3. I haven't had as much opportunity to work with Becka Maynard and Tom Cook, though I have always followed and greatly admired their work. And Becka and I keep trying to collaborate on a book, but silly things like her stint as Commissioner of the National Center for Education Evaluation keep getting in the way. But you may yet see a Maynard and Orr volume on the shelves in your neighborhood bookstore. While I am acknowledging the formative influences on my career, I also want to mention several non-awardees who had an enormous impact on my professional development: First, Joe Newhouse, with whom I worked closely for 7 years on the design and implementation of the Health Insurance Experiment. Joe taught me what experimental design was all about.

And Steve Bell, with whom I have collaborated on various projects, including the JTPA Study, over the past 30 years. Steve showed me how to solve design problems from first principles and put structure on even the messiest research problem, while still maintaining high standards of rigor.

Last, but not least, I want to acknowledge my professional debt to my wife, Kathleen Flanagan, who worked with me on evaluation projects for 4 or 5 years-and still married me! Kathleen taught me that evaluation is a team sport and that a team can be more than the sum of its parts. She also taught me that all-nighters don't improve the quality of the proposal or report.

I didn't know Peter Rossi personally, but I certainly know of his prodigious work on social programs and promotion of sound evaluation methodology in the interest of social justice-and for that, I think we all owe him an enormous debt.


So. As Doug has explained the rules to me, in order to get that nice plaque, I am supposed to write a publishable paper. With no equations, t-statistics or impact estimates. I'll leave it to Jacob to decide how publishable these remarks are, but I do have some things I'd like to say to my professional peers that, you will be pleased to know, don't involve equations or statistics.

For 40 years, the evaluation profession has been consumed in a battle over internal validity. And, as Judy Gueron and Howard Rolston have reminded us, it WAS a battle. In their aptly titled book, Fighting for Reliable Evidence, they speak of the "long struggle" to convince researchers and policymakers that random assignment is the method of choice to produce credible answers to important questions, like the effect of different welfare regimes on the work effort of the poor. Today, that battle has been decided. Random assignment, while still far from universal in practice, is almost universally acknowledged as the preferred method for impact evaluation. There is even a term for it: "the gold standard".

So I would like to declare that war over, and think about the next set of challenges we face in improving our craft.

In thinking about those challenges, I would like to start with what I call the Standard Model of impact evaluation-again, far from universally implemented, but almost universally agreed to be the best practice model. The Standard Model has several main features:

First, random assignment of a sample of individuals or groups of individuals, like classrooms or schools, to one or more interventions (the treatment group(s)) or to the status quo or "business as usual" (the control group) in a small number of sites

Second, 1-2 rounds of follow-up surveys for all sample members

Third,A process or implementation analysis

Fourth, A benefit-cost analysis

The Standard Model is focused on random assignment and internal validity. And from that limited perspective, it is the gold standard. But viewed more broadly, the Standard Model is anything but a gold standard. And I know about gold-I grew up in a gold mining town. I actually worked in a gold mine. So I know my precious metals. Peter Rossi may have known his base metals, like iron and brass. But I know precious metals. And the Standard Model is NOT a gold standard. It's not even a bronze standard.

The Standard Model of impact evaluation has at least two major flaws. First, as generally practiced, it has terrible external validity. And, second, it fails to take into account Peter Rossi's Iron Law. I see these flaws as posing the next big challenges for the evaluation community. Let me speak to each in turn.

First, external validity. In working on the issue of external validity over the past several years, I've discovered that term means very different things to different people. So let me state clearly what I mean by external validity. An externally valid evaluation provides unbiased estimates of the impact of an intervention on the population of interest for policy-that is, the population that would be affected if the intervention were adopted as policy (or, in the case of an ongoing program, the population currently served by that program). Suppose, for example, that the U.S. Department of Labor tests a new approach to job placement for unemployment insurance recipients; the population of policy interest is all UI recipients nationwide. If the state of Wyoming tests such a policy, the population of policy interest is all UI recipients in Wyoming. Why? Because the U.S. Department of Labor makes policy for the nation as a whole, not just Dayton or San Antonio. And the state of Wyoming makes policy for the entire state, not just Cheyenne or Newcastle.

Unfortunately, our evaluations almost never test interventions on samples that are representative of the population of interest for policy. Instead, we test interventions in sites that are convenient or cooperative, without regard to how well they represent the population of interest. Often, we do not even specify the population of interest. The most recent edition of the Digest of Social Experiments describes 273 randomized trials. Of those 273, seven were designed to be representative of the population of policy interest.

Does it matter? Well, if interventions have the same impact everywhere, it doesn't matter where you test them. But if impacts vary across sites, it does. And there is pretty good evidence that they do. For example, the Charter School study found school-specific impacts that varied from statistically negative to significantly positive. And Howard Bloom and Christina Weiland have found equally striking variation in impacts on various outcomes in the National Head Start Study. So we have to at least allow that choice of sites may matter.

Let me be clear. We don't know that the other 266 evaluations in the Digest yielded biased estimates of effects on the population of policy interest. But we don't know that they didn't-just as we never know whether nonexperimental estimates are internally biased. And just as we have become unwilling to accept estimates of unknown internal validity, we should be unwilling to accept estimates of unknown external validity.

I know what you are thinking: "So every evaluation has to have a random sample of the U.S. population? Right. Like that's going to happen!" That's not what I'm saying. I'm just saying that external validity is a problem that needs to be taken seriously, and that the smart people in this room have to figure out how to do better. I don't have any magic solutions. But I do know that until evaluation sponsors demand more representative results, and evaluators apply their considerable talents to producing them, the situation isn't going to change.

And I can't resist saying that it is possible to conduct evaluations on nationally representative samples, even for large programs. That's what was done in the evaluations of Job Corps, the Food Stamp Employment and Training Program, and Head Start. In the Benefit Offset National Demonstration, the Social Security Administration is implementing an experimental treatment that encompasses a 20 percent random sample of all Social Security Disability Insurance beneficiaries nationwide. So it can be done.

But I'm not asking you to draw nationally representative samples, because I know you won't. Here is what I am asking you to do-in every evaluation you conduct:

1. Define the population of policy interest at the outset.

2. Think about how you can select sites and draw samples that have a reasonable relationship to that population of interest.

3. Compare your sample to the population of policy interest on relevant characteristics and outcomes.

4. Document all of this in your design report.

5. Once you have results, use one of the various techniques available to project your estimates to the population of policy interest. Report those results.

I am convinced that if every evaluation followed these simple, easy steps, the evaluation community would be much more cognizant of, and committed to achieving, external validity. And the advice we provide to policy makers would be correspondingly better and more useful.

The second major flaw in the Standard Model is that it fails to take account of Rossi's Iron Law:

"The expected value of any net impact assessment of any large scale social program is zero."

My initial reaction to the Iron Law was probably about the same as yours: "Gee, that's a clever bit of hyperbole! Of course, nobody would take it literally." Peter himself substantially backed off a literal interpretation of the Iron Law in remarks he made at APPAM a few years ago.

I have come to believe, though, that the Iron Law is a pretty good description of reality. Note that saying that the expected value of the impact of the interventions we test is zero is not the same as saying that they all have zero effect. It just means that the distribution of effects is centered on zero. That implies that roughly half of the interventions we test have zero or negative effect (i.e., are no better than the status quo). That appears to be about right.

For example, a review by the Coalition for Evidence-Based Policy found that of the 90 interventions evaluated in randomized trials commissioned by the Institute of Education Sciences (IES) between 2002 and 2013, approximately 90% were found to have weak or no positive effects. Six of 10 randomized evaluations of science, technology, engineering, and math programs found weak or no effects. Of the 13 interventions evaluated in Department of Labor randomized trials that have reported results since 1992, about 75% were found to have found weak or no positive effects.

In medicine, reviews have found that 50-80% of positive results in initial, "phase II" clinical studies are overturned in subsequent, more definitive "phase III" trials. A recent study published in the Journal of Clinical Epidemiology shows that 82% of diagnostic tests don't improve patient outcomes.

In his book Uncontrolled, Jim Manzi reports that of 13,000 randomized trials of new products/strategies conducted by Google and Microsoft, 80-90% have found no significant effects.

Manzi also cites a University of Cambridge review of 122 randomized field trials in criminology conducted between 1957 and 2004 in which only about 20 percent found statistically significant reductions in crime from the interventions tested. On the basis of this and other evidence, Manzi concludes that, "the vast majority of criminal justice, social welfare, and education programs fail replicated, independent, well-designed RFTs".

One implication one might draw from this dismal hit rate is that we should test better interventions. It appears to be the case, however, that it is almost impossible to predict with any confidence which interventions are likely to succeed. That is, after all, why we test them.

Even in medicine, with its highly structured sequence of tests leading to clinical trials, treatments that appear promising in Phase II studies frequently fail large, rigorous Phase III randomized trials. For example, Zia et al. (2005) report a success rate of only 28 percent among 43 Phase III studies based on Phase II trials of the identical chemotherapeutic regimens. Zia et al. also cite success rates of 2 to 24 percent across all Phase III trials in several oncology specialties. A recent study published in the Journal of Clinical Epidemiology shows 82% of diagnostic tests don't improve patient outcomes. If extensive Phase II testing cannot yield more effective interventions for Phase III testing in medicine, it seems unlikely that social scientists and policy analysts can do much better.

So what does all this have to do with the Standard Model of evaluation? Well, an evaluation designed according to the Standard Model typically takes 5 years to complete and costs upward of several million dollars. Most agencies research budgets will only support one or two of these per year. If we cannot improve the success rate by choosing better interventions, at this rate it is going to take a very long time to identify any appreciable number of effective interventions.

So what can we do? We can do better, cheaper, faster experiments.

In a review of the Department of Labor's evaluation program, Becka, Jon Baron, and I made several recommendations to take into account the low hit rate of social interventions. I believe that these recommendations are more generally applicable to other policy areas. First, we urged the Department to choose interventions for testing by strategically searching the existing evaluation literature to identify the strongest candidates-i.e., those most likely to produce sizable positive impacts. This evidence might take the form of small-scale trials or program components that appeared to be important in earlier rigorous tests of more comprehensive interventions. Raising the bar for investing in a rigorous test should improve the hit rate. But given the evidence in other fields, that is unlikely to be enough.

Our second recommendation, therefore, was to conduct experiments in a two-stage process. The first stage would be an experimental evaluation to measure the intervention's impact on the primary outcome(s) of interest (e.g., earnings)-if possible, using low-cost administrative data. This first stage would thus be designed to answer the most important question for policy: does the intervention produce the main hoped-for effects?

We suggested that this initial evaluation also obtain basic information on the implementation and cost of the intervention being evaluated, to help inform its replication should it be found effective. However, at this stage, we cautioned against large investments in process or implementation evaluations and data collection to support exploratory analyses (e.g., to learn about the mechanisms through which impacts occur, the factors that may interfere with effectiveness, and implementation challenges) for programs, policies, and practices that will typically not, in the end, be sufficiently effective to warrant adoption.

For interventions that demonstrate program effectiveness relative to cost in Stage 1, a second stage involving more comprehensive data collection and analyses would go forward. Stage 2 evaluations would look more like the Standard Model, with more intensive process analysis, data collection, and benefit-cost analysis. But unlike the current standard model, restricting such tests to interventions that have already demonstrated positive effects on central outcomes is almost certain to yield a higher hit rate.

A numerical example: If we could do Stage 1 trials at 40% of the cost of full Stage 2 trials, we could increase the number of trials-and the number of effective interventions identified-by over 50%. Within the same budget.

The two-stage evaluation strategy also has a more subtle advantage. With a low hit rate, we run the risk of a very high "false discovery rate". Allow me to explain. At conventional levels of statistical significance, all those ineffective interventions – the ones with a true effect of zero-have a 5-10% chance of coming up statistically significant-i.e., of being "false positives" or Type 1 errors. So when one looks only at the interventions that yield statistically significant effects, a lot more than 5-10% of them will be false positives.

Numerical example: Suppose we test 100 interventions at the 10% significance level and that only 10 of them are truly effective. We will identify 8 or 9 of the truly effective interventions as statistically significant. But the 90 interventions that are ineffective will produce about 9 false positives. Result: about half the statistically significant findings are false positives. This is the “false discovery rate". It means that if we adopt all the interventions that yield statistically significant results in a single trial, half of them could be totally ineffective.

The surest protection against false positives is replication. In a single test, each intervention has a 5-10% chance of being a false positive. But the chance of an intervention being a false positive twice, in two successive replications, is less than one-tenth of 1 percent. The two-stage evaluation strategy automatically replicates all statistically significant results, effectively driving to false discovery rate to almost nothing.

Finally, there is one area in which we can do lots and lots of really cheap, really quick experiments. This is in the province of the "M" in APPAM. Program management involves a huge number of decisions, many of them relatively small, but collectively of central importance. In many cases, managers face a clear choice among two or more relatively well-defined options. Many of these choices are susceptible to rigorous analysis with randomized trials. For example, which style letter gets a better response? Should this position be staffed with a Master of Social Work or can we use a BA? If the suggestion that management decisions like these should be decided with randomized trials seems weird, that's just because we are all used to management by gut instinct. As Jim Manzi points in his book, major corporations like Google, Microsoft, and Capital One run thousands of experiments to decide questions like this. I first encountered this use of randomization back in the 90s when we were working with a direct mail firm, designing a brochure to encourage adult workers to return to school to upgrade their skills. The choice was between a positive message-"Get ahead, stay ahead"-and a negative one-"Avoid layoffs". The firm we were working with did a randomized test as a matter of routine, sending out 10,000 of each brochure to randomly selected addresses. We were all gratified to see that the positive message got the better response.

So what I am suggesting is not at all new-it's just not done in government. It should be. Experiments like these are very cheap to carry out and their results are available in weeks or months, not years. If we believe in evidence-based decision-making, this is an area ripe for exploitation.

I raise these issues not because I think that doing any of the things I suggest will be easy. On the contrary, they will take imagination and persistence to move the profession out of the comfortable rut it has settled into. But the payoff, in terms of better advice to policy makers and better program management, could be enormous. I urge you to take on these challenges.

Thank you all for your time and patience. And thanks to the Association for this tremendous honor. I look forward to hearing the panel's thoughts on these issues.


Back to top