This study investigated the test-retest reliability of a behaviorally recorded work sample test. Although previous research has demonstrated the internal consistency and interrater reliability of work sample scores, only limited attention has been paid to test-retest reliability of the subject's performance. Forty-two students enrolled in technical courses where operation of the engine lathe was taught were used as subjects. The subjects were administered a work sample test, a related job knowledge test, and a mechanical aptitude test in a counterbalanced design during a regularly scheduled lab period. These same tests were readministered one week later. The work sample test was developed as part of a machinist training program at a local refinery and employed both checklist recording of process behaviors and evaluation of final product characteristics. The work sample required the use of a lathe to turn round metal stock to a specified diameter and cut threads on a section of that diameter to fit a nut. A modification of the checklist used to record process behavior on the work sample was used as the job knowledge test. The subjects were given a statement of a similar problem and asked to check those behaviors on the checklist which they felt represented the correct procedures to follow. The mechanical aptitude measure was the Bennett Mechanical Comprehension Test. Significant test-retest reliabilities were found for all tests. The work sample reliability (r = .62) was similar to that of the job knowledge (r = .73) and mechanical aptitude (r = .77) tests. However, significant score improvement in retest work sample performance was observed. It was speculated that this was due to subject inexperience and replication of the study with experienced subjects was recommended. Implications of the observed practice effect for work sample applications were discussed. The three tests were found to be relatively independent. The mechanical aptitude measure was related to work sample performance only within the initial trial; none of the other correlations were significant although they were generally positive. Surprisingly, job knowledge test performance was not clearly related to work sample performance. It appeared that work sampling measured aspects of task performance not evaluated by paper and pencil tests. The implications for use of similar written tests for selection testing were discussed. Both evaluation of process behavior recording and evaluation of final product characteristics were found to exhibit test-retest and interrater reliabilities which compared favorably with those reported by previous research. The relative advantages and disadvantages of using either strategy were discussed and the recommendation made that both be utilized whenever possible. Subscores based on three previously identified critical performance dimensions were found to be reliable (test-retest and some interrater). Two of the dimensions. Use of Tools and Follows Job Procedure, were related to process behavior and the third was the Accuracy and Appearance of the final product. These critical dimensions were viewed as possible building blocks for an improved system for classifying and describing complex job behavior.