# We're so good at medical studies that most of them are wrong



## Ernest Nagel (Mar 4, 2010)

Some very good and intriguing points. :bow:

http://arstechnica.com/science/news...dical-studies-that-most-of-them-are-wrong.ars


*We're so good at medical studies that most of them are wrong*
By John Timmer | Last updated a day ago

It's possible to get the mental equivalent of whiplash from the latest medical findings, as risk factors are identified one year and exonerated the next. According to a panel at the American Association for the Advancement of Science, this isn't a failure of medical research; it's a failure of statistics, and one that is becoming more common in fields ranging from genomics to astronomy. The problem is that our statistical tools for evaluating the probability of error haven't kept pace with our own successes, in the form of our ability to obtain massive data sets and perform multiple tests on them. Even given a low tolerance for error, the sheer number of tests performed ensures that some of them will produce erroneous results at random.

The panel consisted of Suresh Moolgavkar from the University of Washington, Berkeley's Juliet P. Shaffer, and Stanley Young from the National Institute of Statistical Sciences. The three gave talks that partially overlapped, at least when it came to describing the problem, so it's most informative to tackle the session at once, rather than by speaker.
Why we can't trust most medical studies

Statistical validation of results, as Shaffer described it, simply involves testing the null hypothesis: that the pattern you detect in your data occurs at random. If you can reject the null hypothesisand science and medicine have settled on rejecting it when there's only a five percent or less chance that it occurred at randomthen you accept that your actual finding is significant.

The problem now is that we're rapidly expanding our ability to do tests. Various speakers pointed to data sources as diverse as gene expression chips and the Sloan Digital Sky Survey, which provide tens of thousands of individual data points to analyze. At the same time, the growth of computing power has meant that we can ask many questions of these large data sets at once, and each one of these tests increases the prospects than an error will occur in a study; as Shaffer put it, "every decision increases your error prospects." She pointed out that dividing data into subgroups, which can often identify susceptible subpopulations, is also a decision, and increases the chances of a spurious error. Smaller populations are also more prone to random associations.

In the end, Young noted, by the time you reach 61 tests, there's a 95 percent chance that you'll get a significant result at random. And, let's face itresearchers want to see a significant result, so there's a strong, unintentional bias towards trying different tests until something pops out.

Young went on to describe a study, published in JAMA, that was a multiple testing train wreck: exposures to 275 chemicals were considered, 32 health outcomes were tracked, and 10 demographic variables were used as controls. That was about 8,800 different tests, and as many as 9 million ways of looking at the data once the demographics were considered.
The problem with models

Both Young and Moolgavkar then discussed the challenges of building a statistical model. Young focused on how the models are intended to help eliminate bias. Items like demographic information often correlate with risks of specific health outcomes, and researchers need to adjust for those when attempting to identify the residual risk associated with any other factors. As Young pointed out, however, you're never going to know all the possible risk factors, so there will always be error that ends up getting lumped in with whatever you're testing.

Moolgavkar pointed out a different challenge related to building the statistical models: even the same factor can be accounted for using different mathematical means. The models also make decisions on how best handle things like measuring exposures or health outcomes. The net result is that two models can be fed an identical dataset, and still produce a different answer.

At this point, Moolgavkar veered into precisely the issues we covered in our recent story on scientific reproducibility: if you don't have access to the models themselves, you won't be able to find out why they produce different answers, and you won't fully appreciate the science behind what you're seeing.
Consequences and solutions

It's pretty obvious that these factors create a host of potential problems, but Young provided the best measure of where the field stands. In a survey of the recent literature, he found that 95 percent of the results of observational studies on human health had failed replication when tested using a rigorous, double blind trial. So, how do we fix this?

The consensus seems to be that we simply can't rely on the researchers to do it. As Shaffer noted, experimentalists who produce the raw data want it to generate results, and the statisticians do what they can to help them find them. The problems with this are well recognized within the statistics community, but they're loath to engage in the sort of self-criticism that could make a difference. (The attitude, as Young described it, is "We're both living in glass houses, we both have bricks.")

Shaffer described how there were tools (the "family-wise error rate") that were once used for large studies, but they were so stringent that researchers couldn't use them and claim much in the way of positive results. The statistics community started working on developing alternatives about 15 years ago but, despite a few promising ideas, none of them gained significant traction within the community.

Both Moolgavkar and Young argued that the impetus for change had to come from funding agencies and the journals in which the results are published. These are the only groups that are in a position to force some corrections, such as compelling researchers to share both data sets and the code for statistical models.

Moolgavkar also made a forceful argument that journal editors and reviewers needed to hold studies to a minimal standard of biological plausibility. Focusing on studies of the health risks posed by particulates, he described studies that indicated the particulates in a city were as harmful as smoking 40 cigarettes daily; another concluded that particulates had a significant protective effect when it comes to cardiovascular disease. "Nobody is going to tell you that, for your health, you should go out and run behind a diesel bus," Moolgavkar said. "How did this get past the reviewers?"

But, in the mean time, Shaffer seemed to suggest that we simply have to recognize the problem and communicate it with the public, so that people don't leap to health conclusions each time a new population study gets published. Medical researchers recognize the value of replication, and they don't start writing prescriptions based on the latest gene expression studythey wait for the individual genes to be validated. As we wait for any sort of reform to arrive, caution, and explaining to the public the reasons for this caution, seems like the best we can do.


----------



## Melian (Mar 4, 2010)

The root of the problem is that our recent biotechnology development boom was not accompanied by a similar increase in bioinformatician training/recruitment. We are using microarray and deep sequencing platforms that can interrogate millions of loci per sample on a single chip or plate - repeat this for a decent sample size and you are flooded with data, BUT...no algorithms exist (or very few and not incredibly well tested) to handle such a volume. So the more established analysts can invent their own algorithms and deal with the data in a logical manner, but the rest of them try to adapt old programs or invent analysis techniques that don't take all variables into account, and this is why we don't see much reproducibility!

Also, the protocols for a lot of the high throughput techniques are not standardized. The science community is not blind to the issue, though. For example, the MIAME (Minimum Information About a Microarray Experiment) guidelines have been established and places like GEO are enforcing them strictly, in an attempt to make all array data comparable and (hopefully) decrease the number of false positives.

It will be a few more years before researchers really learn to handle the firepower they're using today :/


----------



## Ernest Nagel (Mar 4, 2010)

Melian, on top of all that the people who are now in senior research positions and chartered with experimental design have bio-statistics training/understanding that is 10-20 or more years out of date. They don't design the validation protocols themselves but their outdated awareness informs the research design. In terms of their relationship to the data it's not unlike a lumberjack from 1910 trying to come up with a suitable timber harvest strategy with no direct exposure to 2010 technology. We should probably be more grateful for the progress that is made. :bow: :doh:


----------



## Melian (Mar 6, 2010)

Ernest Nagel said:


> Melian, on top of all that *the people who are now in senior research positions and chartered with experimental design have bio-statistics training/understanding that is 10-20 or more years out of date*. They don't design the validation protocols themselves but their outdated awareness informs the research design. In terms of their relationship to the data it's not unlike a lumberjack from 1910 trying to come up with a suitable timber harvest strategy with no direct exposure to 2010 technology. We should probably be more grateful for the progress that is made. :bow: :doh:



This is true, but (from my experience) a lot of them are _trying _to stay current. It really comes down to the budget of the lab - my lab is fairly wealthy, so we employ a team of bioinformaticians, some seasoned and some fresh out of doctoral training, and they guide all data analysis and experimental design. These guys really know their stuff and they provide training for lab members, as well as the PI! The labs with less funding, however, are the ones who have difficulty keeping current - unfortunately, a lot of labs fall into that demographic, especially the ones doing studies that are more "social science" vs "pure science" (and these are usually the interest-piece studies that get published for public consumption, thus the public is constantly exposed to their flaws...). The clinical studies are well funded, but are often forced forward due to pressure from industry partners, so although they really could afford to do things correctly, a lot of issues are overlooked as they crank that paper out.

Sigh.


----------



## Webmaster (Mar 6, 2010)

I was pretty involved in all this when I did the research for my Ph.D. (A Multi-Variate Analysis of the American Household Energy System), which was at the very dawn of computer-assisted statistical analysis. As far as I am concerned, a) sponsorship is a not inconsiderable variable in the outcome of almost any studies, b) proving one's own hypotheses is an equally strong variable in the outcome, and c) the statistical science behind it all is complex enough so that there are not many who both are able to truly comprehend it AND have the ability to analyze and meaningfully present the findings. Finally, the media's tendency to mindlessly sensationalize data contributes to public scorn and skepticism.


----------

