Skip to main content

View Diary: USA Today: Key climate denier report was plagiarism (162 comments)

Comment Preferences

  •  The problem with "Data Mining" (1+ / 0-)
    Recommended by:
    ebohlman

    When you publish statistically significant findings, usually there is a p value that must be met.  A P value of 0.01 means that there is a 1 in a hundred chance that the result is due to chance.  Some journals want p values of at least 0.05 (1 in 20) but most are starting to demand 0.01 and 0.001 (1 in 1000).

    Now there is another problem, as you exam more sets, your p value must become more stringent.  Why, because as you examine more items, you run a greater chance of false positives.  And this is especially important when you examine 1000's of sets and variables when the p is 0.01.

    So if you set your p value at 0.01 and examine 10,000 items, you will get on average 100 false positives.

    It doesn't surprise me that 100 runs generated a hockystick out of random 10,000 if the p value is set to 0.01.  That's what statisitics are and that is the problem with data mining.

Subscribe or Donate to support Daily Kos.

Click here for the mobile view of the site