Interobserver variation in recording behavior : Random or systematic error?
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The number of investigations using direct behavioral data has increased in frequency during the last ten years. The interest in behavioral deta has focused attention on observation methodology, and emphasized such variables as objectifying data language, training of observers, and using more refined methods of calculating interobserver reliability. The emphasis on these variables assumes observer homogeneity and does not acknowledge systematic differences among observers. Systematic observer bias, when acknowledged, is given only cursory attention or subsumed under recognized forms of observer bias such as knowledge of the experimental hypothesis. This study questioned the assumption of observer homogeneity and hypothesized that observers do show systematic differences in recording behavior events-differences which are related to how an observer indicates he would respond to (evaluate) such behavior. Twenty-six Ss were categorized on the basis of their preferred mode of response (positively consequate, negatively consequate, or extinguish) to seven classes of behavior. Subsequently the Ss received familiarization training on coding procedures and then coded the seven behavior classes for four kindergarten age males in four seperate coding sessions. The training and coding sessions utilized video tapes with an audio tone signalling alternate 10-sec observation periods. Each observer's coding records were compared with a coding standard (5 professionals' independent agreements on behavioral occurrence-nonoccurrence) to obtain a deviancy score which indexed the extent and direction of disagreement with the coding standard. Interobserver reliability was also calculated for gross reliability (a frequency ratio between two observers for an observation session) and agreement (paired observers' agreements divided by agreements plus disagreements for 10-sec observation periods within an observation session). An accuracy score (an agreement index with an observer's coding record compared with the coding standard for agreements-disagreements) was also obtained to determine the relative validity of the observers' coding records and the reliability indices. When deviancy scores were compared for behavior classes the observers would positively consequate, negatively consequate, and extinguish, there were bias trends. However, wide variation in behavior rate for the seperate codes tended to mask systematic bias by the observers. The coding standard indicated that the three ways of consequating behavior were comparable only for low behavior rates. When behavior rate was held constant for low rates, there were significant bias effects as well as strong evidence for response style (systematic bias across data sources). In the absence of ongoing behavior (according to the coding standard) behaviors positively consequated by the observers were significantly overrecorded, behaviors extinguished by the observers were intermediate in overrecording bias, and behaviors negatively consequated showed minimal overrecording bias by the observers. When behavior rate increased to one ongoing behavior per observation period, overrecording bias was reduced for behaviors the observers positively consequated or extinguished although the positively consequated behaviors were still overrecorded by the observers. The decrement in observer bias as behavior rate increased was consistent with other data which indicated that behavior rate and bias are inversely related for behaviors which are positively consequated and extinguished while behaviors which are negatively consequated show minimal bias across behavior rates. These results suggest that evaluative differences among observers are an important variable in behavior coding variability and thus the assumption of observer homogeneity can be rejected. Of the methods of calculating interobserver reliability, the agreement index proved to be a more valid index than gross reliability. Gross reliability provided an inflated index on all behavior classes for most of the observation sessions. The validity of the gross reliability index is therefore seriously questioned and it probably should be discontinued as an index of interobserver reliability. The agreement index was most valid for codes which were negatively consequated, was marginal for extinction codes, and proved to be a conservative estimate of data validity for positively consequated codes. These differences in relative validity were directly related to the magnitude and direction of recording bias for the observers; The agreement index is, however, a useful estimate of interobserver reliability and data validity since any error is in a conservative direction.