Part II - The Science of Comparing Learning Protocols: Blog Post II on the Cormack & Grossman Article by Herbert L. Roitblat, Ph.D. Member, Board of Governors, OLP

28 May 2014 2:50 PM | Chere Estrin (Administrator)

Other Concerns

Another possible limitation of Cormack and Grossman’s study is the document representation that they used. Their CAL algorithm could be run with any kind of document representation, but they chose one that may be particularly difficult for other approaches, such as random sampling. 

Cormack and Grossman’s method destroys any semantics (meaning) in the documents.  By breaking the text into 4-character shingles (including spaces and punctuation, I think), it destroys the morphemes.  Morphemes are the basic units of meaning in a language. Words consist of one or more morphemes. Cormack and Grossman’s method treats documents as sequences of characters, not words.

They do two other things that make training more difficult.  The number of distinct words in a document collection is typically on the order of around 100,000 for small document sets, growing by roughly one new word per document beyond this.  About half of those words occur exactly once in the corpus.  The number of four-character shingles  is much larger.  Most, if not all machine learning algorithms are sensitive to the number of distinct “dimensions” on which they are trained. A dimension is simply a variable on which the items may differ, for example the presence or absence of a word, o the presence/absence of a letter shingle.

Typically information retrieval systems work to reduce the number of distinct features to classify, but Cormack and Grossman appear to have chosen a representation that increases them.  Instead of limiting the dimensions to a few hundred thousand words, Cormack and Grossman use shingles. They note that there are 4,294,967,296 potential shingle representations , which they reduce through hashing to 1,000,081 distinct codes. Each code (each of the 1 million final values) represents about four thousand different four-character shingles (4 billion potential shingles to 1 million codes).

If I understand correctly, they encode each shingle originally as a 32-bit binary string (whether a character, including some punctuation, is present or not in the shingle), throwing away repeat characters in the shingle.  They then encode the shingles into a million-bit binary string, throwing away the number of times each shingle occurred in a document.  That’s a lot of information to throw away. This representation may work well for the CAL algorithm, but it is not optimal for other machine learning algorithms.


In conclusion, Cormack and Grossman claim that:

•Random sampling yields inferior results
•Reliance on keywords and random sampling is a questionable practice, certainly not necessary to train predictive coding
•The CAL process is not subject to starting bias

None of these claims is supported by the evidence that they present. Their use of an inadequate and biased evaluation standard, a so-called “gold standard,” makes it very difficult to draw any real conclusions from this study. Other design decisions in the study also strongly bias the results to favor their CAL system.

There is no reason to think that their results imply that random sampling per se is the issue. The best that they could reasonably claim is that it did not work well in the situation that they arbitrarily constructed, with the algorithm that they chose for it, and measured against a task that was also biased in favor of their CAL algorithm. The fact that random sampling does work well in other situations belies their conclusion. Random sampling can, in fact be very effective at identifying responsive documents and it can be verified by an independent standard of judgment.

Cormack and Grossman raise a straw-man argument that others claim that only random sampling is valid for training predictive coding. I don’t know who actually makes this claim. I believe that random sampling is sufficient to produce accurate results, but I do not argue that random sampling is necessary.

At OrcaTec, we frequently use combinations of pre-categorized documents and random sampling, but random sampling alone is efficient and sufficient for most of our matters. Further, there are other reasons to want to use random sampling for training predictive coding.
As Cormack and Grossman note, random sampling can give more representative training examples, their trumped up example of a system learning that spreadsheets are not responsive aside. They are correct in noting that random sampling is more likely to capture common categories of documents than rare ones, but unless there is some method for identifying rare documents, no statistical learning system is going to be very good at identifying them. If the users know of examples of rare responsive documents, then these can be included in a training set, followed by random sampling. As Cormack and Grossman note, however, there is no scientifically useful protocol available for identifying these rare instances.

Random sampling gives direct, unbiased, and transparent feedback as to how well the system is predicting responsive documents.

I am truly puzzled that Cormack and Grossman argue that their CAL system is not affected by starting bias. They argue that a new classifier is generated after every 1,000 documents. This classifier derives its positive examples from the documents that have previously been identified as responsive and then presents more like those for further review. I see no opportunity for a rare unanticipated document type to be selected.

Random sampling, on the other hand, presents for review a representative sample of documents without regard for the system’s predictions about which are responsive. New and rare topics, thus, have the opportunity to be included in the training set, but they are not guaranteed to be included.

Finally, to paraphrase Mark Twain, any reports concerning the death of random sampling for training predictive coding are grossly exaggerated. Far from being a questionable practice, if your goal is to minimize the number of unnecessary non-responsive documents to review while maximizing the accuracy of your selection, then random sampling can provide a powerful, defensible, means to achieve that goal.


Powered by Wild Apricot Membership Software