victorykmfk.blogg.se - Data dredging vs data mining

Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD.

Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use. If you are confident that you are not dredging data, click here to continue the exploration wizard.Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. If you are using data mining procedures to test large data sets for 'significant' associations, be sure to correct for multiple testing and other purely statistical phenomena that might mislead interpretation. If you have a very large data set (with hundreds or thousands of samples), it may be feasible to use a random subset of samples for exploratory analysis and test any hypotheses derived therefrom on the other samples. If you use exploratory analyses to generate hypotheses, be sure to test those hypotheses on data sets other than the one used for exploratory analysis.

If not, you may simply be 'massaging' the data for a (probably false) signal. If using data transformations or discarding data, ensure that there is solid rationale to do so. "Data dredging" (sometimes called "data fishing") is a real risk which may invalidate any conclusions you draw from your analysis.Įxploratory analyses are used to find subsets of data that confirm (or are more likely to confirm) an a priori hypothesis which may not be generalisable to the whole (statistical) population.Įxploratory analyses are used to generate a hypothesis from a given data set which is tested using the same data set.Įvaluate if your data supports the results of a hypothesis based on previous knowledge and research.