Skepfeeds-The Best Skeptical blogs of the day

Talking to the dead study: does it hold up?

Posted in Skepdude by Skepdude on August 16, 2010

A little while ago, I was listening to The Skeptics Guide to the Universe podcast, when Steven Novella mentioned that he’d been on the Skeptiko podcast debating Near Death Experience research with the host, Alex Tsakiris. I subscribed to Skeptiko to hear the debate. My initial reaction was that Alex was trying to honestly evaluate the evidence. However, the way he was interpreting it, was unsatisfactory to my skeptical mind. Thus, I decided to listen to a couple other episodes to see if my initial interpretation was correct.

In the next episode, Alex had as guests the hosts of a skeptical podcast I wasn’t aware of, called Righteous Indignation, and one the main thing that the 4 of them spend a lot of time discussing was a study about mediums and communication with the dead. The study is titled “Anomalous Information Reception by Research Mediums Demonstrated Using a Novel Triple-Blind Protocol” by Julie Beischel and Gary E. Schwartz. I have sent Alex an e-mail to ensure that this is in fact the study in question. He has replied confirming that this is indeed the study they were discussing in that show.

Alex took exception to the skepticss comment that the study’s methodology was questionable. After reading the study myself I find myself agreeing, not surprisingly, with the skeptics. This study has glaring issues, and leaves too many important pieces of information out. I tried to reach out via e-mail to the study’s author, Julie Beischel to ask her a few questions, but the e-mail address listed on the study came back with an error message. Unfortunately, those questions will remain unanswered.

So without further ado let me get into the meat of things.

Study Summary

The study’s purpose was to investigate the “anomalous reception of information about deceased individuals by research mediums under experimental conditions that eliminate conventional explanations.” In other words, the authors wanted to set up conditions which made it impossible for the mediums to get information in any way besides “anomalous reception”, a.k.a. psychically, and then figure out the success rate.

8 students were selected, 4 with a deceased parent, 4 a deceased peer. Each student was paired with a student from the other group, thus each pair of 2 students had one deceased parent and one deceased peer, both deceased individuals of the same gender, resulting in 4 pairs of “sitters”. An unrelated third person, who had no knowledge of the sitters or the dead people served as a “proxy sitter”. In other words, the proxy sitter was given the names of the 2 dead people, which he/she then relayed to the medium over the phone. The medium, working solely with the first name of the dead person would then go on to produce a reading for the pair (2 readings per medium one for the dead parent, one for the dead peer). Each pair of sitters received readings from 2 separate mediums.

So to summarize, 8 students organized in 4 pairs. 8 mediums. Each pair got reading from 2 mediums. We have 16 readings altogether. Next comes the scoring.

I’m not going to spend much time on the technicalities of the scoring process. For purposes of the summary it suffices to say that each student was presented with the readings for the pair and asked to choose  the one that better fit their deceased person. So if you were the student with the dead parent, you’d get two readings : the one meant for your dead parent and the one meant for the dead peer of the other student in your pair. You would not know which was which and had to pick the one that best fit your dead parent. After doing this 13 out of the 16 readings were correctly identified.

The authors concluded with strong words:

The present findings provide evidence for anomalous information reception but do not directly address what parapsychological mechanisms are involved in that reception. In and of themselves, the data cannot distinguish among hypotheses suchm as (a) survival of consciousness (the continued existence, separate from the body, of an individual’s consciousness or personality after physical death) and (b) mind reading (ESP or telepathy14)or super-psi1 (retrieval of information via a generalized psychic information channel or physical quantum field, also called super-ESP).

So what is the verdict here? Does this study really provide convincing evidence for anomalous reception?

Basic Criteria for evaluating a scientific paper

Before we start analyzing how well, or not, this study followed basic methodological principles, it is important, I think, to review the basic characteristics that we expect to see in a well designed and run scientific study, and they are:

  1. No fraud – This one is pretty obvious; the very first requirement is that there was no fraud perpetrated by the authors, no hiding of data, no making up data and that sort of stuff.
  2. Statistical competency – We would also expect the authors to have done their statistics properly, that the correct analytical techniques were used and such.
  3. Sample Size – This refers to the number of people drafted to participate in a study. For any given level of statistical confidence interval, a minimum sample size (referred to as n in statistics) is necessary. The smaller the sample the less reliable the results of the study are. Sample size is directly related to the total population for which we’re trying to come to a conclusion, the confidence level and the confidence interval. For a quick calculator and a quick refresher of what these terms mean, you can check out this website.
  4. Randomization – of test subjects is important because it helps to reduce the effect of bias in the study results.
  5. Control Group – Very important to weed out perceived, but not real, effects/benefits from whatever is being studied. Thus, when testing a drug, there will be one group of test subjects receiving the medicine being studied, and another group, separate and distinct from the first, receiving a sugar pill. Neither knows what they’re being administered. The results from the control group are compared with the results from the medicine group to see if there is a real effect, beyond placebo.
  6. Blinding – Single/Double/Triple. Blinding comes in many flavors. The gold standard is double-blinding, when neither the test subject, nor the person administering the thing being tested know what they are dealing with. Triple blinding is also possible, when the people doing the statistical analysis of the raw data are not told which one they’re analyzing. So for example, in the drug scenario double blinding means the test subject does not know if he’s getting the medicine or the sugar pill, the person handing out the pills does not know if she’s handing out the medicine or the sugar pill. In the triple blinding case, the statistician would not be told “here is the data for the medicine and here is the data for the sugar pill”. Instead, she’ll be told “here is data set A and here is data set B”.

These are the core, basic requirements of a properly designed scientific study. Now going back to the study at hand, the skeptics claimed that the methodology, a fancy way of saying the design, of the study was inappropriate, “highly dubious” I believe were the exact terms, if my memory serves correctly. Let us go through the list and see if that is indeed the case, or if Alex was right that this study has very good design. Only one of them can be right, so let us try to find out who is.

Study Analysis

I will skip over #1 and #2 and give the study authors a “Pass” for the simple reason that I am not aware of any evidence that there was any fraud, so unless such evidence comes to light I am inclined to believe no fraud was present, and because I am not an expert in statistics, I cannot scrutinize the statistical methods and results so I am willing to give this study the benefit of the doubt in that regard as well. Let’s look at the other criteria, those that any lay person can evaluate for themselves.

#1 Sample Size – Was the sample size appropriate? Well what is the sample size in this study? Is it the number of students recruited? The number of mediums? Well, given that what is being studied here is not the effect of the reading on the sitter, but the effectiveness of the medium to give a correct reading, I would suggest that the sample here would be the total number of readings performed, thus n=16. Is this sample size appropriate. No, not to enable us to reach any conclusions whatsoever. Even if everything else is done perfectly, all the other criteria were followed to the dot, a sample size of 16, at best, indicates that a larger sample is needed. No conclusions can be drawn from 16 data points.

You do not have to take just my word for it. Let us refer to the calculator I linked to before. How can we apply it to this case? Simple: the study concludes that 13/16 readings were picked up correctly, therefore that is strong evidence for psychic powers, or anomalous reception of information. The unstated premise is that those 13 readings must have been on target. So we can look at the number of readings. According to this study, 13 out of 16 medium readings were correct, which would be impressive. However, let us think for a moment: how many such readings take place, in the US alone in any given year? I would venture a guess of something in the hundreds of thousands. Let us say for argument’s sake that we have a population of only 100,000 readings.

Now we ask the question, what number of readings do we have to study in order for the sample size to be appropriate? That depends on the desired confidence level and interval. No study I’ve ever read has had a confidence level of less than 95%, and if I am not mistaken, this study is using a 99.9% confidence level, but for argument’s sake we’re going to use the lower level of 95%, which will require a smaller sample size. The interval is the + or – that usually follows poll results. I’ve usually seen a few digits, so let us go with 5. Please type all this information in the calculator:

  • Confidence Level – 95%
  • Confidence Interval – 5
  • Population – 100,000

The result? 383. In other words, you’d need to look at 383 readings to be 95% sure that the result is within 5% of the true value. All of a sudden 16 looks really, really tiny, doesn’t it? Strike One!

#2 Randomization – Were the test subjects chosen at random? No, neither the sitters nor the mediums were chosen at random from their respective populations. While I do see why that would be so with the mediums, you want to test the best of the best after all if you want to sort this thing out and you don’t want the charlatans in the medium population to dilute the effect, I do not understand why this simple requirement was not followed when it came to the sitters. The authors had a pool of 1,600 students to choose from, more than enough to get  a nice, random sample out of. Instead the sitters were selected based on answers “yes” or “unsure” to questions about his/her belief in the afterlife and mediums. Furthermore, the final 8 were chosen based on their answers and based on the desired paring, in order to optimize “the ability of blinded raters to differentiate between two gender-matched readings during scoring”.

What does all this mean? Well, simply put it means that the authors hand-picked who they wanted to be a sitter based on the survey questions, and even went so far as making sure that the paired deceased were as different from each other as possible. That basically takes randomization and throws it out the window, no questions asked.

So what exactly were these survey questions the volunteers had to answer? What were the answers of the final 8? We do not know, and unfortunately Dr. Beischel’s e-mail did not work so I could not ask these questions. But these are crucial pieces of information to have. What if all 8 had answered “Yes” to the question “do you believe in an afterlife” or “do you believe in mediums and their ability to contact the dead”? Wouldn’t you think that would severely bias the way they look at the readings? Strike Two!

#3 Control Group – This was a sticking point between the skeptics and Alex in the podcast. Alex kept insisting that there was a control, that the fact that each person got their intended reading and another reading constituted a control. However, he’s missing the main point about controls: it is supposed to be a control group, separate and distinct from the “treatment” group. The magnitude of the placebo effect, random chance etc. cannot be gauged by having the same test subject choose between treatment A and the placebo. That’s just not how science works, and if we are pretending to be running a scientific experiment we must play by the rules of science. You cannot make up a new definition for “control”; that’d be having your cake and eating it too!

So what would the control have looked like in this experiment? Sticking with the way this experiment was run, the control group would be a second group of 8 students, identical to the first 8 who would be getting the same readings but not from a “medium” but a mentalist that can produce such readings without claiming paranormal powers. Then you would run the exact same experiment and tally the results. If there is a statistically significant difference between the first group of 8 students and the control group of another 8 students, then one may reasonably say that more study is needed. This study as run, lacked a control group. Strike Three!

#4 Blinding – Is this really a triple blinded study as the authors proclaim? Well remember triple blinding means that the participants are blinded (meaning they don’t know if they are getting the real or the control treatment), the person handing out the treatment does not know what they’re giving out, and the statisticians analyzing the results do not know what they’re analyzing. This study fails on all three counts.

First the test subjects were not blinded, simply put because there was no control group. Every student knew they were getting a “real” reading indeed. You cannot have participant blinding without a control group, and having the test subject choose between a fake and a real reading does not constitute blinding, especially when the readings are set up to be as different as possible. That’s a basic fact and anyone who has a problem with that is not understanding control & blinding as they are used in science.

Second, the mediums were not blinded. In order to effectively blind the mediums they should not have known if the name they were given was indeed that of a dead person or that of a living one. Not only did the mediums work in complete confidence that they were working only with dead people, but they also knew the gender and approximate age of the dead people they were supposed to give  a reading for. That is not blinding, that is the opposite of blinding, the medium is going in knowing three pieces of information: the person is indeed dead (so no chance of giving a reading for a living person), the person’s gender (gleamed from the name) and the persons’ ages (roughly late teens to early twenties for the dead peer, and late  40s and higher for the dead parent). That is a lot of information for someone skilled at the guessing game. The way the experiment is set up, betrays one important thing: the author is going on about this study already assuming the mediums can indeed talk to the dead, so they didn’t even bother to control for the possibility of fraud, or guessing.

Thirdly, there is no indication in the paper about the blinding of the statisticians and the other persons involved in interpreting the data. The author refers to the proxy-sitters as their triple blinding but that is not what triple blinding means. Matter of fact, the presence of the proxy sitters is completely baffling. They do not need to be there, they add nothing to the overall methodology, and it seems their sole purpose was to pass on a name to the medium, which could have easily been done otherwise.  Anyone who knows anything about triple blinding can easily confirm this is not triple blinding.

So the test subjects and the mediums were not properly blinded, and it appears the statisticians weren’t either. Strike Four!

Other problems with the study

Besides the methodological problems described above here are more problems that need to be worked out before we can have any reliance on the results of this study.

  • There is no mention in here of how accurately the mediums readings matched with the descriptions of the deceased that the students gave in the beginning. Were there any specific pieces of information provided (such as the deceased’s birth date, death date, death place, mode of death, social security # etc, something that is specific to the person being “read”)?
  • The participants were forced to choose one of the two readings provided. They were not asked to pick only if the reading very closely applied to their deceased person, they are forced to choose one of the two. When you combine this 50-50 pure chance, with the fact that the students were hand-picked to participate (possibly having been choses for their propensity to believe) and the fact that the two readings would have been fairly different (medium knew approximate ages) that can easily explain 13 out of 16 hits. The fact that we lack a control group makes that number almost useless as we have nothing to compare it to.
  • When the students were chosen from the pool of 1,600 it was done so in order to “optimize testing conditions…based on answers “yes” or “unsure” to survey questions about his/her beliefs” yet no explanation of exactly what this optimization process involved.
  • When the dead people were paired is was done in a way so to  “optimize differences” across various characteristics. Again no description is given. When they say it was optimized for age does that mean to decrease or increase the age difference in the pair? The answer is unknown.
  • The second part of the reading was the Life Questions in which the medium was to answer 4 specific questions about the dead person. The results on the accuracy of these answers are not available.
  • Each medium reading was transcribed and turned into a numbered list of individual items. It is unknown how specific the items included in the list by the experimenter were. In other word, did it say “Bob died in a motorcycle crash on the I95” or does it say that “Bob died peacefully”? Those kinds of things always matter in a study of this sort.

We can get into more detail about other opened questions that remain and have not been properly addressed. In the Results the authors promise more details in a future manuscript, but I haven’t been able to find it, and as stated previously my attempt to contact the lead author was futile.

Conclusion

So what can we take from this study? How reliable is it? Unfortunately for the talking-to-the-dead enthusiasts, this study is worthless scientifically. It had a ridiculously small sample size, it lacked a control group, had no randomization or proper blinding, not in a scientific sense that is.  There are many other unanswered questions, missing crucial information that could shed some light on the results. The authors forced the subjects to pick one of the two answers, which alone gives a 50-50 chance which when coupled with the other points I raised up earlier more than explains the results observed. And more importantly, nothing was reported on the accuracy of the mediums readings, how specific their readings were especially in the Life Questions sections and how well they matched with the subjects descriptions on specific items.

Would it not have been easier to ask the students to provide ten pieces of information specific to the dead person, ask the medium to do their reading, covering the 10 specific pieces of information and then ask a third-party to analyze how well the mediums’ answers match those specific pieces of information, as opposed to relying on forced choice between two options? I think so. Why wasn’t it done? Id’ rather not speculate.

8 Responses

Subscribe to comments with RSS.

  1. Pekka S said, on August 20, 2010 at 2:27 PM

    Thanks! Very interesting post!

  2. Troythulu said, on August 20, 2010 at 2:28 PM

    Damn. I have to commend you for the work you’ve put into this. As an admitted armchair skeptic (I’m in the process of fixing that), I don’t quite have the skill to do an investigation like this just yet. I strongly suspect that you’re spot-on with this though, and I’ll have to look more closely at some of Ray Hyman’s techniques and tips for analyzing claims like this as well. Good job!

    • Skepdude said, on August 21, 2010 at 5:12 PM

      Well, in the interest of fairness, what I have done is armchair skepticism as well; which is ok because all of us cannot get our hands dirty on the ground so to speak.

      • Troythulu said, on August 21, 2010 at 11:06 PM

        I guess you’re right about that. I suppose that it isn’t so much armchair skepticism to be avoided as knee-jerk skepticism.

        • Skepdude said, on August 23, 2010 at 11:08 AM

          Yeah, definitely, knee-jerk not a very skeptical thing to do, but quite human. I’m confident I’ve fallen for it quite a few times myself.

  3. Troythulu said, on August 23, 2010 at 11:31 AM

    Me too (raises hand).

  4. […] Reflections on the Tina Wilkins interview — listen here Skepdude on the Beischel Protocol — read the article […]

  5. Anon said, on October 16, 2010 at 5:06 AM

    There are a few valid points here but also a number of crass errors.
    In any case, it would be good if you provided a reference for the evaluation criteria.

    First error. What you say about sample size relates to a very different sitatuation. Say, you want to know who people will vote for in an upcoming election. Maybe of all voters 60% prefer party A but alas, the random sample of people you poll just so happens to have only a 40% preference for party A.
    The fewer people you ask the easier this can happen. This is what the calculator deals with.

    The study is a completely different statistical situation. There is a so-called null hypothesis. This hypothesis is used to calculate the likelihood of the observed outcome.
    In a clinical trial with treatment and control group the null is that there’s no difference between the groups. If there really is no difference then any seeming difference must be due to the luck of draw. IE during the randomization, when the participants were split into the groups, one group just happened to get more especially frail or sturdy individuals.
    The likelihood of the outcome is called the p-value. When it is low then one concludes that it is more plausible that something other than the luck of the draw was responsible for the difference.

    Second error. From the preceding explanation we can see why randomization is very important in clinical trials. If it is not done proper then the calculation of the p-value is off.
    More generally if the randomization is not as assumed by the null hypothesis then the null is wrong but for a boring and trivial reason.
    Selecting study participants is not improper randomization. It is quite reasonable. If you wanted to test an antibiotic you would select subjects with a bacterial infection, testing it on the healthy or those with viral infections would not make sense. Except you’re interested in other matters than mere efficacy.

    About control groups. Looking for the control group is a very good advice. When a medical trial features no control group you can pretty much forget the result.
    There may be good and legitimate reasons why there is no control group but the result is dubious at best.
    The error here is that a mentalist group is not a control group. If the medium performs better than the mentalist then we only know that the medium is better at that task than the mentalist. It does not tell us why.
    Also a control is not necessarily a control group. It is valid to call the presentation of 2 readings a control. It rules out the possibility that subjects endorse a reading simply due to the Barnum effect.

    So what is actually wrong with this study?
    I’ll argue that a better question is how we should interpret the result.
    The result was so that if the null hypothesis is true then we should expect one like it or more impressive in 1% of cases. You do the exact same experiment 100 times and on average 1 of them will have 13 or more “hits”.
    This is where the small size of the experiment comes in. Running a 100 such experiments is easily possible. The null hypothesis may still be true.

    Small experiments also have a reputation for being less well conducted. Just because the description of an experiment is neat and tidy does not mean that the experiment was actually conducted that way. Maybe someone screwed up and that’s why the null is wrong. Not for any interesting or repeatable reason.

    These are general issues. They are why one should be careful before taking any study, and most especially any small study at face value.
    There has been organized study of mediumship for over 150 years. When faced with such a small study the reaction should be, IMHO, to just say ‘So what?’ There isn’t really any point in giving it any attention.
    Nevertheless, I have still done exactly that.

    Now, let’s assume that the null is really wrong and for some repeatable reason, rather than some unnoted mishap.
    Would that mean that mediumship is real? Of course not.
    It just means that the null is wrong.
    The null in this case is that subjects faced with a reading for them and one for another matched subject will chose their own 50% of the time.
    One reason this may be wrong is that the mediums were given the first name of the deceased. This name is chosen by the parents, it is not random. Names are subject to fashion. As such a first name may give a clue about the persons age and background.
    This is a plausible explanation of the result but I would quite simply not accept it based on the evidence. Mainly because the study is so small.
    Also the lack of a control group is a serious problem here. We only know, or rather suspect, that the null is wrong. We don’t know if this is in any way connected to giving names.
    In some scientific disciplines one cannot use control groups. For example geoscience, we don’t have a bunch of control planets. In these disciplines it is often said that ‘The null is always wrong’ or similar things. We can usually be fairly certain that we have NOT thought of everything that may influence something. We can never be sure that we did.
    Control groups are neat because they relieve us of that necessity. We only need to make sure that the groups are treated identical except for one variable. Then we can conclude that there is a connection between the variable and the outcome.
    That, of course, still does not tell us anything about the nature of the connection.

    This is a general problem with parapsychology. You just have this huge gap between the result in any study and the mind-boggling conclusions that are drawn.
    Trying to find errors in their studies is nice and well. It’s basically trying to find explanations for why the null is wrong and one would like to explain riddles. But it does conceal the fact that any errors or mistakes in the protocol or its implementation are not where parapsychology falls down.

    Finally, there is an error in the statistics of this study. The subjects come in (age) matched pairs. Let’s say you give a reading to the first in the pair saying that the deceased was young. Then you’d have to be stupid not to say to the second subject that his deceased was old.
    There’s a 50/50 chance of getting the first right but when you get the first right you will also get the second right. Either both are right or both are wrong. In actuality things are probably less clear cut.
    The way the study calculates there should be a 50/50 chance for the first in the pair and another 50/50 for the second. This does not help the mediums score better but it will make the score seem more unusual than it really is.


Leave a comment