Reliability and Validity
The term reliability refers to the consistency of a research study or measuring test. Thus Reliability is the degree of consistency of a measure. A test will be reliable when it gives the same repeated result under the same conditions
For example, if a person weighs themselves during the course of a day they would expect to see a similar reading. Scales which measured weight differently each time would be of little use.
Types of Reliability
- Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time.
- Example: A test designed to assess student learning in psychology could be given to a group of students twice, with the second administration perhaps coming a week after the first. The obtained correlation coefficient would indicate the stability of the scores.
2.Parallel forms reliability is a measure of reliability obtained by administering different versions of an assessment tool (both versions must contain items that probe the same construct, skill, knowledge base, etc.) to the same group of individuals. The scores from the two versions can then be correlated in order to evaluate the consistency of results across alternate versions.
- Example: If you wanted to evaluate the reliability of a critical thinking assessment, you might create a large set of items that all pertain to critical thinking and then randomly split the questions up into two sets, which would represent the parallel forms.
3.Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions. Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed.
- Example: Inter-rater reliability might be employed when different judges are evaluating the degree to which art portfolios meet certain standards. Inter-rater reliability is especially useful when judgements can be considered relatively subjective. Thus, the use of this type of reliability would probably be more likely when evaluating artwork as opposed to maths problems.
4. Internal consistency reliability is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results.
- Average inter-item correlation is a sub type of internal consistency reliability. It is obtained by taking all of the items on a test that probe the same construct (e.g., reading comprehension), determining the correlation coefficient for each pair of items, and finally taking the average of all of these correlation coefficients. This final step yields the average inter-item correlation.
- Split-half reliability is another subtype of internal consistency reliability. The process of obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The entire test is administered to a group of individuals, the total score for each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total “set” scores.
Validity refers to the extent to which an indicator (or set of indicators) really measure the concept under investigation. There are five ways by which sociologists and psychologists might determine validity of their indicators are:
- Face validity,
- Concurrent validity,
- Convergent validity,
- Construct validity, and
- Predictive validity.
How can sociologists assess the validity of measures and indicators?
Five ways for measurement validity in social research:
- Face validity – Face validity is simply achieved by asking others with experience in the field whether they think the measure seems to be measuring the concept. This is essentially an intuitive process.
- Concurrent validity – Researchers simply compare the results of one measure to another which is known to be valid (known as a ‘criterion measure). For example with gamblers, betting accounts give us a valid indication of how much they actually win or lose, but wording of questions designed to measure ‘how much they win or lose in a given period’ can yield vastly different results. Some questions provide results which are closer to the hard-financial statistics, and these can be said to have the highest degree of concurrent validity.
- Predictive validity – here a researcher uses a future criterion measure to assess the validity of existing measures. For example we might assess the validity of school as measurement of academic intelligence by looking at how well school students do at college compared to A-level students with equivalent grades.
- Construct validity – here the researcher is encouraged to deduce hypotheses from a theory that is relevant to the concept. However, there are problems with this approach as the theory and the process of deduction might be misguided!
- Convergent validity – here the researcher compares her measures to measures of the same concept developed through other methods. Probably the most obvious example of this is the British Crime Survey as a test of the ‘validity’ of Police Crime Statistics’. The BCS shows us that different crimes, as measured by PCR have different levels of construct validity – Vehicle Theft is relatively high, vandalism is relatively low, for example.