Part I - The Science of Comparing Learning Protocols: Blog Post II on the Cormack & Grossman Article by Herbert L. Roitblat, Ph.D. Member, Board of Governors, OLP

28 May 2014 8:46 AM | Chere Estrin (Administrator)

This is my second blog post about Cormack and Grossman’s latest article, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.” In their article they compared several versions of simple random sampling as the training set. In my previous blog, I argued that they are CAR vendors, they have a vested interest in making their algorithm appear better than others, and that the system that they tested using random sampling bears almost no resemblance to any system using it to actually do predictive coding.

In this post, I want to focus more on the science in the Cormack and Grossman article. It seems that several flaws in their methodology render their conclusions not just externally invalid — they don’t apply to systems that they did not study, but internally invalid as well — they don’t apply to the systems they did study.

Their paper has been accepted for presentation at the annual conference of SIGIR (The Special Interest Group on Information Retrieval) of the ACM (The Association for Computing Machinery). I am also a member of this organization.

External Validity

Poor performance on a caricature of a random sampling system or one using simple active learning does not indict all systems that use random sampling or simple active learning. As a demonstration that this inference is not warranted, I showed results from some OrcaTec projects that used random sampling. Rather than returning poor results, these projects returned results that were better than Cormack and Grossman’s favored algorithm. My point was not to show that their CAL algorithm is not good, but rather, in contrast to their claims, to show that random sampling can lead to high levels of predictive accuracy.

Some of the differences between the system that they tested and the OrcaTec system include:

  • Solving for a different task
  • Using a different underlying learning model
  • Using a different underlying document representation

What they describe as the “TAR task” is different from the task that most predictive coding systems seem to use, including OrcaTec’s. Cormack and Grossman’s task is to continuously optimize the documents presented for review, whereas it is more common to divide the predictive coding process into two parts, one where the system is trained, and a second where the system’s predictions are reviewed. The continuous task is well suited to their CAL model, but not to the other two systems that they tested.

Cormack and Grossman’s SPL system uses a support vector machine as its underlying machine learning model. The support vector machine is common among several predictive coding systems, but it is not used in the OrcaTec system. The different underlying models in the two systems means that it is not valid to extend the results obtained with one to the other. They have different learning properties.

The Cormack and Grossman system used a peculiar way of representing the documents in the collection, whereas the OrcaTec system uses words. I discuss this issue in more detail below, but this difference means that the results of their SPL method cannot be validly applied to other forms of random-sampling training.

These differences limit the validity of any comparisons between the results that Cormack and Grossman obtained and the results of competing predictive coding providers, including OrcaTec.

All other things being equal, one might be willing to say that in their specific tasks, with their specific collections, and their specific models, that their Continuous Active Learning (CAL) system performed better than a Simple Active Learning (SAL) or Simple Passive Learning (SPL), but even this comparison turns out to be invalid.

Internal Validity

As mentioned, the Cormack and Grossman task is well suited to CAL, but because neither of the other two systems are designed to present the most relevant documents for review at each of the training stages, their task does not apply to them. Presenting random documents for review (SPL) or the most uncertain documents for review (SAL) is simply not designed to produce highly responsive documents during the initial stages of training. These systems are designed for a two stage process of learning followed by prediction. Whatever the technical merits of a system, pitting one designed to produce responsive documents at the beginning of the task against two systems that are designed to get information about documents at the beginning of the task is not a valid comparison.

Their choice of task strongly biases their results in favor of the CAL task, meaning that their internal comparisons, within their study, are invalid. It is metaphorically like comparing a dolphin and a horse and saying, oh look, that dolphin does not run so fast in the desert.

The bigger issue is with how they constructed the set of documents that they used to measure the performance of the system. Their so-called “gold standard,” unless I am seriously mistaken, guarantees that their CAL system and not any of the others will be successful.

The measurement of Recall (the completeness of the process identifying responsive documents) and Precision (the selectivity of the process identifying responsive documents), requires that we know the true status (responsive or non-responsive) of the documents being assessed by the TAR/CAR system. How that standard set is constructed is critically important. As they note: “To evaluate the result, we require a ‘gold standard’ indicating the true responsiveness of all, or a statistical sample, of the documents in the collection.”

Here I want to focus on how the true set, the so-called “gold standard” was derived for the four matters they present. They say that for the “true” responsiveness values “for the legal-matter-derived tasks, we used the coding rendered by the first-pass reviewer in the course of the review. Documents that were never seen by the first-pass reviewer (because they were never identified as potentially responsive) were deemed to be coded as non-responsive.”

If I am reading this correctly, it means that the CAL algorithm and (perhaps) the keyword search were used to identify the set of responsive documents. If true, this approach is a very serious problem. They used the tool that they want to measure to derive the standard against which they were planning to measure it. They used the prediction of the CAR model as the standard against which to measure the prediction of the CAR model. The reviewers could decrease the number of documents that were deemed responsive, but they could not apparently increase the number of documents that were deemed responsive.

It is not at all surprising that the documents predicted to be responsive during the standard setting were the same documents identified by the same system during the assessment. They were both predicted by the same system.

The double use of the same measurement means that only the documents that were identified as responsive by the CAL algorithm or by the keyword search used to create the seed set could be considered responsive. If either of the other two algorithms identified any other documents as responsive, they would have matched an unseen document, and so by Cormack and Grossman’s definition, they would have been counted as an error. Again, if true, it means that only the CAL algorithm could consistently produce documents that would be called responsive. The other algorithms could produce correctly classified responsive document only if they happened to agree with the CAL algorithm or the keyword search. Only those documents would have been reviewed and so only those could be candidates for being responsive.

Cormack and Grossman report the relative contributions of the keyword search and the CAL algorithms. Recall and Precision for the keyword searches used to generate the seed sets are shown in their Table 3. For two of the matters (B and D), the keyword searches yielded high Recall (presumably when judged against the merged standard of keyword and CAL), but low to moderate Precision. In fact, in these two matters, the keyword search resulted in more responsive documents than the CAL algorithm did. For the other two matters (A and C), the CAL algorithm identified more responsive documents than the keyword search did.

Their keyword seed set sometimes showed better Recall than their CAL results.

Table 3: Keyword seed queries
RecallPrecisionMax CAL RecallRecall 90Precision

Two matters out of four found poorer Recall after using the CAL method than the keyword search used to generate the seed set achieved.

Keyword Recall and Precision are from their Table 3.

Max CAL Recall is the maximum that CAL achieved in the study.

Recall 90 is their first Recall above 0.90, or max Recall if it never reached 0.90

Precision at Recall 90 is the last column

The CAL algorithm selects for review the top 1,000 documents that it predicts to be responsive. So this algorithm does not present any documents for review that are not predicted to be responsive. If a document is responsive, but does not resemble any of the documents already predicted to be responsive, then it will not be found, except, perhaps at the very end of training when the number of remaining similar documents is exhausted. Documents that are similar to the seed set are considered, documents that are not similar to the seed set are unlikely to ever be considered, let alone predicted to be responsive. This limitation is hidden by the apparent fact that only similar documents were contained in the validation set (the “gold standard”). The system can appear to be performing well, even when it is not. An invalid standard set means that any measures of accuracy compared to this standard are also, necessarily, invalid.

Cormack and Grossman do not provide, therefore, even a valid measure of what their CAL system has accomplished. The CAL system may be quite good, but we cannot tell that from this study without validation on an independent set of documents that were not generated by the CAL system itself.

n text classification, the more disjoint the relevant topics are (where disjoint topics share few target words or other features), the more difficult it will be for a method like Cormack and Grossman’s to find them all.  Their method depends on the starting condition (the seed set) because it ranks documents by their resemblance to the seed set.  It is very difficult to add to the seed set in their situation, because documents that are not similar to the seed set are not encountered until the similar ones have been exhausted. And in their study, those documents would automatically have been counted as non-responsive.

Random sampling is known to be slower than certain other training algorithms, under some conditions, but more resistant to noise.  As mentioned, it is also more resistant to training bias because the items selected for review are not determined by the success or failure of previous predictions. Whether it will, in fact, be poorer depends on the complexity of the space. Even if it is somewhat slower, though, it still has the advantage of being open to identifying disjoint responsive documents.


Powered by Wild Apricot Membership Software