Thu Nov 20th, 2008 at 07:15:08 PM EST
I have said many times on ET that I don't trust most computer simulations of reality. I never supported this claim with any argument. Considering that this is the area where I work, I think I can give some ideas on the topic based on personal experience. My plan, in the future is to write a robust and extensive argument on this issue. For now, I will start with a simple example, using mostly naive arguments. Feedback is highly encouraged (especially antagonistic viewpoints that are rationally fundamented).
Caveat: I exclude computational chemistry and physics from all this (a preamble where I explain why, is due in the future). For computational chemistry and physics I mean only what theoretical chemists call ab-initio and semi-empirical methods.
Caveat: I will trade rigor for clarity (especially the word gene will be abused).
Undergrad in computer science (with lots of experience in software development)
MSc in bioinformatics (area: population genetics, conservation genetics)
PhD is tropical medicine (ongoing - theoretical study of the spread of drug resistant malaria)
For obvious reasons it is quite difficult for me to talk about the PhD work. I will just say that I have even more reasons to believe in what I believe.
Population genetics and simulated reality:
In many research studies involving population genetics empirical data is compared to simulated data (by a computer). Conclusions about the real world data are made according to the relationship between that data and deviations from the computed data.
Imagine trying to find in a certain population which genes are under selection - Imagine, say, that a certain population of fish are able to adapt to very cold waters while others aren't, you might be wanting to know which genes are responsible for that (assuming there is a genetic basis). The procedure is more or less like this: You go to the field, collect samples (something that has DNA from individuals) from several populations. Then, in the lab, several possible candidate genes are genotyped from each individual. Each individual will have 2 copies of a gene, different versions of the same gene may exist (different alleles). You are looking for genes that seem to be "different" for the majority. The underlying hypothesis is that "normal" genes are neutral while "different" genes are under selection (hopefully are related to thermal-regulation, what you are looking for).
Imagine that you sample 2 populations (one in normal waters, another in cold waters), 10 individuals on each population and (only) 3 different genes. Imagine that you have the following distributions (I will call alleles for first gene a and A, for the second gene b and B and for the 3rd c, C).
Population a A | b B | c C
Normal 10 10 | 0 20 | 10 10
Cold 10 10 | 20 0 | 10 10
It would seem the the first and third genes have the same distribution in both populations and the second gene is quite "different".
This is, of course, a simple and highly skewed example, for explanation purposes.
Reality is normally much more complex, involves more genes, more populations, and possibly more alleles per gene.
So what do researchers might do? They simulate a bunch of neutral("normal") genes on a computer, calculate some confidence intervals for some statistics on that simulation and then compare each gene (or the statistics derived from those real genes) against the computer simulation results. Genes that fall outside the confidence intervals from the simulation are deemed candidates for being "under selection".
So, real data is compared against simulated data to derive conclusions.
So what about the realism of that simulated data? In population genetics, most simulations (not only my selection detection example) are based on Coalescent theory. I am not going to explain the coalescent here, or this would become huge. I will present some of the common assumptions done on coalescent simulations (note that some simulators might do different).
- Random mating among individuals. The meaning of this is obvious. Or maybe not.
- Some simulators don't have the notion of gender. So random mating might mean really random mating of genes.
- Random mating is not very good in many situations. We all know of many species where mating is not random at all. But it is said that random mating is a good approximation for humans (especially because we don't see one male for all female behavior - Though Irish might complain here (search for Niall), or Genghis Kan descendents, by the way). So we have random mating used for species which don't mate randomly at all; and for humans, where it is seen as an acceptable approximation. I will not even go into politically dangerous terrain about random mating among humans, I will just say this: If you do some trivial math, random mating produces lots of half-siblings (where one of the parents is shared but the other is not) and almost no full siblings (where both parents are shared). Do you think this is a reasonble approximation of the reality that you know? Especially think in terms of genetic diversity (which is what we are discussing here).
- Random mating is actually an extreme in the sense that it maximizes the transmission of genetic diversity. For instance if in the real scenario only one male is responsible for the next generation gene pool (this is quite typical in, say, domesticated animals), you loose the genetic diversity that all the other males might give. A random mating approach looses very little in comparison.
- Meta-Population structure. Let me give an example with humans. Humans are normally studied theoretically as three populations (Africans, Asians and Europeans - forget about Native Americans to start with). These 3 populations are, you guessed it, in random mating and individuals migrate among populations (equal rates of migration among populations). So according to this model, a Portuguese like me (think about this simulation modeling ages before planes or cars existed) had a much bigger probability of mating with a Finish than with a Moroccan (as a funny aside, people from Lisbon and south, like me, are called by northern Portuguese as Moors).
- Hierarchical population. Continuing and detailing the example above, each population is normally in random mating. It is known that genetic diversity varies quite a lot in Europe (the Finish are normally presented as an example of an highly-inbreed nation - more than the Icelandic). But most models have no account for this internal population structure. There is normally, in most models no notion of hierarchy (i.e., the are 3 human populations are not internally subdivided).
- Independent generations. According to most existing models we can only mate with people of the same generation. No Charles Chaplin for you (or Niall, or Gengis Kan ;) ). Actually I think this is a bigger problem with some non-human species.
- In a completely different front are field and lab errors: Repeated sampling of the same individual? Might happen especially with elusive species. Errors in the lab? There is ample evidence that researchers do lots of errors in interpreting results from sequencers. How does a certain method cope with average error rates?
Now that I think of, most of the assumptions discussed above (the last being a more ambiguous case) tend bias for the increase genetic diversity...
Notice also that I am discussing approximations that are easy to dismount. In many cases much of the history of populations (especially non-human) mostly is unknown: population size, population structure, mating habits, genomic structure. So you invent.
Now, people are aware of this and the counter-argument is: "most methods are robust to changing assumptions". The immediate question then becomes: "Show me proof that your method is robust" (this can be done by doing theoretical studies where you change a certain assumption and compare the results)? Normally the answer becomes: "If have tested for a certain specific change" (normally from a certain demographic model to another). "What about for this other n differences which are also important (like, say, mating)?" The answer is invariably: "You can do it yourself, that would be nice to see".
I am consciously avoiding a discussion of the economics and sociological part of this, or to explain why people don't test for all these conditions. There are reasons why people avoid doing this, but that is not the point of this exercise (though a discussion about the sociology of science would be cool).
The only thing that I want to do, is to place some seeds of doubt on the current use of computational simulations as a proof/forecast mechanism in science, and doing that by explaining, roughly, the underlying science of it.
By the way, I also think that in many cases we are really not talking about science as many of these methods are falsifiable, but people seem to prefer to ignore that ("this method can still be used in many realistic scenarios", "This method is valid if the (unrealistic) assumptions are, if you have different assumptions that it is your responsibility").
This text was, I have to admit, half-baked. Please accept my apologies for that. It is my intention to develop the argument over time, but I think I owed some explanation on why I doubt so much of using computer models for making predictions. Please allow for some language abuse, I just want to pass the rough idea for now. Comments are very welcome