<< First  < Prev   1   2   Next >  Last >> 
  • 28 Oct 2014 9:52 AM | Chere Estrin (Administrator)
    In every investigation, whether it is a criminal or legal investigation, there are five golden Ws that the investigator must answer in order to be successful. These are:

    • WHO is it about?
    • WHAT happened?
    • WHEN did it take place?
    • WHERE did it take place?
    • WHY did it happen?
    Some might even consider two other pertinent questions:
    • HOW did it happen?
    How do you cull through the vast volume of data to find these answers? Manually analyzing the data to find the answers is very time consuming and requires lots of resources. The reason is that you do not know exactly what it is you are looking. What words should you be searching in order to find the “smoking gun?” How do you find patterns among the words? Do you highlight the words as you manually review each of the documents, then go back and see how they are connected? It’s not that simple. Criminals use aliases; transfers may be done by unknown off-shore companies or via unknown bank accounts, etc. All of this complicates and slows down the investigation.

    In addition, the size of electronic data that needs to be investigated continues to grow with increasing complexity, exacerbating the problem. Of course, there are technologies that can help expedite the investigation. Computer technology can help analyze large data sets at tremendous speed for specific patterns. In combination with other technological advances including text mining, computational linguistics, statistics, machine learning and even artificial intelligence, it is much easier to analyze the data specifically focused on finding the five Golden Ws.

    Modern text mining and content analytics can search on a higher level than just key words. For example, with text mining linguistic patterns like ‘someone pays someone else’ or ‘someone meets someone else at a certain location and at a certain time’ can be identified without using the exact names or amounts. By extracting such patterns combined with simple statistics, one can easily identify unknown persons, companies, bank account numbers, and also spot code names and aliases.

    Criminals will try to cover up illegal activities by hiding information in non-searchable file formats or by embedding different types of electronic objects within complex compound files where the most relevant information is often hidden in the deepest layers. Your solution needs to identify information even when it is hidden in the deepest layers and be able to search those seemingly unsearchable formats such bitmaps, images, non-searchable PDFs, audio files or even a video. By combining text mining with advanced analytics, relevant information can be quickly identified at speeds many times faster and more efficient than what humans could ever do. The investigators can easily validate the relevant information to prevent so-called tunnel vision and identify invalid evidence or investigation directions.

    Over the years, I have seen many real-life cases where this hybrid man-machine approach has identified twice the amount of relevant information with half the resources in half the time! This is a great example where Big Data analytics can lead to Big Savings!

  • 15 Oct 2014 2:07 PM | Chere Estrin (Administrator)

    In The T-Shaped Lawyer, Nelson and Simek make a persuasive argument that in the case of lawyers, the “T” should stand for technology.1 Other visionaries within the legal profession indicate that one of the areas within law that is likely to have future career opportunities, especially in an increasingly competitive job market, is law + tech - in other words, someone with both the legal knowledge and the technology skills to render legal services more creatively, efficiently and cost-effectively.  Authors have suggested a variety of options for providing these technology skills.  A number of organizations offer seminars, webinars and certification examinations that address the needs of the T-shaped lawyer, including the Organization of Legal Professionals (OLP) with its focus on electronic discovery, legal project management and litigation support, and others have proposed and developed new courses within the law school curriculum.2-3  As indicated in the Comment 8 to Rule 1.1 in the ABA Model Rules of Professional Conduct, “a lawyer should keep abreast of changes in the law and its practice, including the benefits and risks associated with relevant technology…” 4  Perhaps it would be even better to think in terms of the T-Shaped Law Firm, so that the potential to use robust technology to deliver legal services is emphasized as a core competency at all levels within the organization. 

    In anticipation of the need for lawyers and others working in the legal arena to have more background in technology, the School of Informatics and Computing at IUPUI began introducing courses in the emerging field of legal informatics back in 2005.4 The genesis of the program was a white paper written by Professor Sara Anne Hook, the program director, in 2004 on her vision for legal informatics.   Legal informatics has been described as the study of the application of information technologies to the field of law and the use of these technologies by legal professionals. Therefore, the focus of the legal informatics program is on the effective use of cutting-edge technology in the study and practice of law as well as related issues, such as security and privacy.  By 2010, five online courses were in place and the legal informatics certificate was officially approved in 2012.  The certificate can be earned as either part of a student’s degree or as a free-standing certificate and can be completed in less than a year.  Although the courses can be taken for graduate credit, offering the certificate at the undergraduate level not only means lower tuition, but also that courses are open to everyone irrespective of whether they already have another degree.  The director of the legal informatics program is a lawyer and all of the courses are taught by faculty with law degrees based on their areas of expertise. An advisory committee composed of lawyers and other professionals with a wide variety of skills and interests in law and technology is available to provide guidance to the legal informatics program, especially in the development of new courses to meet the changing needs of the legal world.    

    Online courses in the legal informatics program currently include:

    • Electronic Discovery
    • Foundations in Legal Informatics
    • Litigation Support Systems and Courtroom Presentations
    • Legal and Social Informatics of Security
    • Technology and the Law

    A course that is tentatively titled “Advanced Legal Informatics” is under development.  Other related online courses offered by the School of Informatics and Computing, IUPUI, include a course on competitive intelligence, a course on computer and information ethics and a course on the legal and business issues in starting a new company.   

    As a way to provide career opportunities and flexibility for students, the legal informatics certificate is envisioned as one of several joint program opportunities with the Robert H. McKinney School of Law’s newly approved Master of Jurisprudence (MJ) degree.5-6 This will be one of many potential degree collaborations between the School of Informatics and Computing, IUPUI, and the McKinney School of Law as they begin to combine their faculty expertise and curricular strengths into programs that will meet the needs of lawyers and other legal personnel in the 21st century. 


    1. Sharon D. Nelson and John W. Simek.  The T-Shaped Lawyer:  Does the “T” Stand for Technology? Sensei Enterprises, Inc., June 16, 2014, http://www.senseient.com/news-press-articles?category=Articles, accessed 9/25/14.
    2. Organization of Legal Professionals, http://www.theolp.org/, accessed 9/25/14.
    3. Stephanie Kimbro.  Course Correction:  Teaching Tomorrow’s Lawyers Legal Technology Skills.  Peer to Peer, Vol. 30, Issue 2, Summer 2014, p. 56-59.
    4. Comment on Rule 1.1, ABA Model Rules of Professional Conduct, http://www.americanbar.org/groups/professional_responsibility/publications/model_rules_of_professional_conduct/rule_1_1_competence/comment_on_rule_1_1.html, accessed 9/26/14.
    5. Legal Informatics Certificate, School of Informatics and Computing, IUPUI, http://soic.iupui.edu/undergraduate/minors-certificates/legal-informatics/, accessed 9/25/14.
    6. Master of Jurisprudence, Robert H. McKinney School of Law, Indiana University, http://mckinneylaw.iu.edu/degrees/mj/index.html, accessed 9/25/14.
    7. For information about the MJ program, please contact Professor Deborah B. McGregor, http://mckinneylaw.iu.edu/faculty-staff/profile.cfm?Id=20, accessed 9/26/14. 
  • 18 Sep 2014 12:33 AM | Chere Estrin (Administrator)

    We are finally seeing the emergence of industry standards and professional certifications as the e-discovery industry continues to mature. The recent California Bar advisory opinion on e-discovery competency drew much attention and is the most recent example of the trend.

    A corporation or law firm selecting an e-Discovery or Discovery Management partner should determine if the partner has enough qualified, certified personnel and implements industry standards into their practice."

    However, it is surprising how even e-discovery RFPs from the largest corporations and most sophisticated buyers rarely address certification and standards requirements. If the federal rules amendments regarding e-discovery that came into effect in January 2007 were a watershed moment, then certifications and standards have perhaps been slow to emerge.

    Part one will survey certifications and evidence of competency for individuals; in part two I will review emerging standards at the organizational level, including the elusive matter of "best practices".

    Individual certifications for e-discovery

    Individual certifications of competency in e-discovery are on the rise, however, there may only be a few hundred professionals certified thus far.

    It is important to distinguish between true certifications from industry bodies independent of specific e-discovery vendors, and certificates issued by e-discovery service or software providers. Obviously the former are more valuable than the latter.

    Any would-be client should require independently certified individuals across the staff of the prospective partner and in the service delivery teams assigned to their work.

    • The Organization of Legal Professionals (OLP) offers a widely respected certification called CeDP (Certified E-Discovery Professional) and the CLSS (Certified Legal Support Specialist) certificate. OLP also now offers an E-Discovery Project Management certificate. The Association of Litigation Support Professionals (ALSP) in lieu of developing its own certification, recommends the OLP certification to its members.

    • Over three hundred people have the CEDS qualification.

    Both OLP and ALSP in their mission statements emphasize standards and certifications, but ACEDS only secondarily. Of course establishing standards and certifications is challenging for even much larger professional organizations.

    OLP and ACEDS reportedly utilized teams of experts to develop bar-like multiple choice exams that try to cover the complete Electronic Discovery Reference Model (EDRM), from information identification through document production. As such, OLP and ACEDS are the leading providers of e-discovery certifications, but expect other non-profit organizations to enter the certification arena soon.

    That being said, there is a general opinion that the certifications available are perhaps not yet of the quality or relevance seen in the IT or forensics field (see below). It is still relatively early days for e-discovery certifications.

    There are a host of forensic qualifications that address the “left side” of the EDRM, specifically data preservation, collection and forensic analysis. The most popular and highly respected certifications are the Certified Computer Forensic Examiner (CCFE), Certified Computer Examiner (CCE) and Certified Fraud Examiner (CFE), which are offered by independent industry groups.

    Two recognized expert organizations in the field also provide in-depth respected training courses: Arkfeld’s eDiscovery Education Center and Ralph Losey’s E-Discovery Team. Also well regarded are courses provided by the E-Discovery Training Academy at Georgetown Law Center.

    Vendor provided certifications

    E-discovery software providers have been providing courses and certifications in the use of their software tools for many years.

    While many are undoubtedly thinly disguised marketing devices, a few may have reached the status of genuine skill certification. Certifications tend to focus on a single software suite for a specific phase of the EDRM such as collection,

     processing, predictive coding or document review. A couple of e-discovery vendors offer a general competency training program with one going so far as to establish a subsidiary company to issue certifications.

    Although vendor created certifications vary in content and quality, it is valuable if the service provider has personnel certified in the software used in day-to-day e-discovery work. There are too many software providers offering courses to summarize here, but following are the most respected.

    • Relativity by kCura is probably the most widely used document review software platform. kCura offers five different certifications. Although designed to show competency in specific review and analytics components, the certifications have become credentials of broader e-discovery knowledge and practical skills. The Certified Relativity Administrator (CRA) is an important industry certification, along with the CRA – Analytics which is particularly rigorous. The Relativity Certified Sales Professional (RCSP) certification is a foundational level qualification. In April, they introduced the Relativity reviewer certification.

    • Guidance and Access Data are companies that provide exams focused on data forensics and collections but which also require a broad e-discovery understanding. The Guidance EnCEP exam claims to tests proficiencyin their collection software as well as in “e-discovery planning, project management, and best practices, spanning legal hold to load file creation.” Access Data offers the AccessData Certified Examiner (ACE) and the mobile forensic qualification AccessData Mobile Examiner (AME) certifications.

    Applying general project management principles can be vital for effective e-discovery and discovery management.

    The Project Management Institute (PMI) offers eight different project management certifications. Most widespread is the Project Management Professional (PMP) qualification and the entry level Certified Associate in Project Management (CAPM). A separate organization (IPMA) offers four progressively harder IPMA series certifications.

    Certifications in related disciplines

    Certifications that are not specifically for e-discovery can also demonstrate relevant skill sets to any potential client.

    The Association for Information and Image Management (AIIM) and Association of Records Managers and Administrators (ARMA) work in the overlapping disciplines of enterprise records management and information governance.

    • AIIM offers a Certified Information Professional (CIP) program which is relevant to e-discovery competency.

    • ARMA offers the Information Governance Professional (IGP) certification and recently introduced the Certified Information Governance (CIG) qualification.

    • There is also the Certified Records Manager (CRM) qualification from the Institute of Certified Records Managers (ICRM).

    As time moves on, qualifications in the other related areas such as data security and privacy will likely be increasingly relevant to e-discovery, but are beyond the scope here.

    However, given recent high profile data security incidents anyone qualifying an e-discovery vendor would want to see staff with CISSP, CISA, CISM and similar certifications.

    Attorney e-discovery certification and competency

    There are still no certified e-discovery specialist qualifications for attorneys similar to the twenty plus law practice areas recognized by the American Bar Association (ABA). This is a gap in law practice specialization which will likely be addressed by the ABA in the near future.

    Certain e-discovery Continuing Legal Education (CLE) programs authorized by state bar organizations can indicate expertise, but the content and quality of CLE seminars varies widely. There are about a dozen law schools offering e-discovery courses for credit mainly focused on legal rules. But Bryan University and Hamline University, for example, offer graduate certificate programs in e-discovery with a technical and legal focus.

    There are still no certified e-discovery specialist qualifications for attorneys similar to the twenty plus law practice areas recognized by the American Bar Association (ABA). This is a gap in law practice specialization which will likely be addressed by the ABA in the near future"

    However, the recent California Bar proposed advisory opinion concerning e-discovery may be the start of a change for attorney certification. It describes a broad duty of competency in e-discovery that extends to the ability to "analyze and understand a client’s ESI systems and storage". This is a demanding requirement and if it cannot be met the attorney must use counsel who is competent or decline the representation.

    An increased demand for e-discovery certifications for attorneys is sure to follow in California and nationally.

    As for law schools, in a recent survey 124 of 193 schools offered no e-discovery curricula at all.

    Finally, let’s not forget that your e-discovery services provider should have a healthy proportion of attorneys and JDs rather than just software, project management or technical personnel. It is also useful to probe professional credentials to see if team members have ever practiced law, worked in litigation support in law firms or at least participated in major document reviews.

    Industry representation

    One should also assess evidence of substantive contributions to various industry bodies when selecting an e-discovery or Discovery Management partner.

    Finding active members of EDRM participating in one of the many EDRM working groups, or perhaps a co-author of a publication can be a useful sign. Similarly, active participation in the Sedona Conference working groups and committees can be a valuable indicator of expertise.

    A cheat sheet summary of certifications to look for

    Inventus takes individual and industry qualification very seriously, has an unusually high proportion of certified professionals and actively promotes continuous professional development.

    There is, however, a confusing array of individual qualifications and acronyms in the e-discovery field making evaluation of e-discovery providers by potential clients challenging."

    Accordingly, here is cheat sheet of key certifications for which to look:

    • Independent e-discovery associations: CEDS, CeDP, CLSS

    • Independent organization forensic qualifications: CFE, CCFE, CCE

    • Credible e-discovery software providers: RCA, RCA Analytics, RCSP, ACE, AME, EnCEP

    • Respected experts: Arkfeld, Losey courses

    • Industry bodies: active participation in working groups and/or authorship of EDRM or Sedona Conference

    • Project Management: PMP, CAPM, IPMA

    • Independent associations in records/information governance: CIP, CIG, IPG, CRM
      Cyber security CISSP, CISA, CISM
    • Vendor issued certifications in software used by the e-discovery service provider

    Part 2 will examine standards and certifications bearing on e-discovery competency at the organization and service delivery level.

  • 17 Sep 2014 5:15 PM | Chere Estrin (Administrator)

    The 2014 Legal Hold and Data Preservation Benchmark Survey, the largest survey conducted in the area of electronic discovery, identifies several key findings that show a significant shift toward more automation in litigation holds resulting in higher confidence levels in the data preservation process. Based on survey findings, Zapproved predicts that the majority of all legal data preservation will be automated by 2015.

    Results demonstrate how the move to automating data preservation yields significant benefits, including confidence in their processes. The evolving trend is that legal hold processes are becoming more mature. From last year to this, those using automated systems jumped from 34 percent to 44 percent, reporting benefits not only in efficiency and process maturity, but also in their confidence to defend their preservation practices, which is required more often in today’s legal climate.

    The extensive survey provides insight on the current attitudes, risks and pain points associated with the obligation to preserve data. Building on the inaugural 2013 study, over 536 professionals dealing directly with litigation hold management participated.

    Key Findings

    One key dimension of the study was comparing those respondents that currently use manual legal hold processes with those who have automated systems in place.

    • ·         From last year to this, those using automated systems jumped from 34 percent to 44 percent.
    • ·         Those using manual processes are more than three times as likely to be ‘very unsatisfied’, and more than five times more likely to lack confidence in their process.
    • ·         Eighty percent of automated users consider their processes are better than most as compared to less than half of the manual users.

    Another aspect of the study solicited input on best practices and preservation effectiveness.

    • ·         While two-thirds train their employees on legal holds, only 46 percent felt that employees understand their preservation obligation, suggesting that more training is necessary.
    • ·         Three out of four preservation managers believe that employees will follow through on their preservation obligations.
    • ·         From 2013 to 2014, the number of participants that have defended practices moved from 21 percent to 31 percent.
    • ·         Respondents indicating that preservation was an undue burden dropped from 15 percent to 12.9 percent.
    • ·         Respondents said automating hold processes improves performance by 22 percent in the nine key categories considered integral to demonstrating an adequate hold process.

    The survey also showed a significant jump in the number of hours spent per month.

    • ·         This year 56 percent spend more than 5 hours per month on legal holds up from 52 percent.
    • ·         Those users deemed “power-preservers” (issuing six or more holds a month) increased from less than 15 percent to nearly 20 percent with this year’s survey.
    • ·         A majority of the organizations now issue legal holds in most of their matters.

    An interesting trend emerged this year regarding outside counsel.

    • ·         The 2014 response shows only 3 percent rely on outside counsel, which is half the portion indicated in 2013; suggesting that preservation is handled almost entirely in-house now.

    Regarding collections of electronically stored information (ESI), the survey queried organizational preferences in the various collections methods including: “collect to preserve”, “preserve in place”, or both methods.

    • ·         Most organizations (49 percent) use both methods, but “preserve in place” is preferred by a 3-to-1 margin when a single method is employed.

    What Next?

    This survey offers a benchmark of practices by industry peers and topics to consider when evaluating an organization’s legal hold and data preservation processes.  

    • ·         Conduct an audit of current preservation processes and measure effectiveness to ensure processes meet current legal standards.
    • ·         Prepare to defend preservation practices and demonstrate that the process is consistent with a detailed audit trail.
    • ·         Focus on employee training and continually developing a culture of compliance in order to improve preservation practices and confidently demonstrate good faith in meeting requirements.
    • ·         Continuous education of best practices from leading experts and peers is necessary for keeping up with standards in this fast-moving area of the law.


    The survey, conducted from May 5, 2014 to June 30, 2014, used an online questionnaire conducted by The Steinberg Group LLC. The survey sample, of 536, only included individuals that affirmatively acknowledged that they are responsible for managing litigation hold processes.

    When looking at titles, the sample was distributed as follows:

    ·         37.0 percent of participants were attorneys, with 17.2 percent self-identifying as GC/AGC

    ·         42.6 percent were litigation support or paralegals

    ·         Remaining 20.3 percent were non-legal staff responsible for administering legal holds, such as records managers and IT professionals

    Download the full survey here: http://www.legalholdpro.com/survey2014.

    About the Author

    Brad Harris is the Vice President of Legal Products at Zapproved, Inc. He has more than 30 years of experience in the high technology and enterprise software sectors, including assisting Fortune 1000 companies enhance their e-discovery preparedness through technology and process improvement.

  • 18 Aug 2014 8:58 PM | Chere Estrin (Administrator)

    When it’s time to hire, we legal technologists usually go for two skill sets: technical and business.

    Of course, that’s not surprising; our work is often complex and risky.  However, when evaluating skills or even culture fit, we often overlook the fact that the same personality traits that make us great at the technical aspects of our job can lead to challenges when dealing with the more humanistic elements of the business.  Although IQ plays an important role in personal success, especially in complex disciplines such as the sciences, studies have demonstrated that emotional intelligence or EQ can play an even more important role. In fact, experts in this field argue that IQ contributes only 20% to life success.

    Which means that the majority of your achievements come from emotional intelligence.

    So it stands to reason that mastering emotional intelligence and understanding professional interpersonal relationships in today’s workplace should be considered as much a core skill or competency as technical ability or general business acumen.

    Do as I do

    An emotionally intelligent leader is adept at recognizing and examining his or her emotional responses in an honest an introspective fashion. When we are better able to understand the strengths and weaknesses of our own personalities, we can avoid many of the triggers that often result in unnecessary workplace conflict or tension. 

    This can be particularly useful for those of us who spend significant time supporting or servicing the litigation industry, where our internal/external clients are trained to engage and argue for a living.  In these situations, failing to hire or to develop emotionally intelligent staff can result in unproductive debates, or worse, escalating conflicts that might otherwise have been de-escalated (or never have occurred in the first place).

    When we fail to understand emotional intelligence, we risk allowing our emotions to mimic or further influence negative emotions around us. This can create a feedback loop in which negativity and conflict build to a point where even the most trivial disagreement can derail a project, or (in an extreme example) irreparably damage business relationships.  We’ve all experienced situations where tensions were high, voices were raised, and both verbal and non-verbal communication led to what can best be described as non-constructive dialogue.  Recognizing and managing our emotional responses as they are occurring allows us to more easily avoid or defuse many of these situations. 

    So what are some of the markers of Emotional Intelligence?  Goleman argues that first and foremost is:

    Ability to empathize with others.  Whether it’s a client, a member of your staff, or that angry person standing in front of you at the grocery checkout, what’s paramount is not just listening to what a person is saying, but truly empathizing with his position.  However, it’s important not to confuse empathy with agreement. When you empathize, it doesn’t necessarily mean that you agree with another person’s position, merely that you understand her perspective. Remember, the more self-aware you are, the more skilled you will become at reading other people’s feelings.

    Understand and recognize your personal biases.  We all view the world through our own individual lens that was shaped and honed by our cultural, educational, and experiential influences.  That lens shapes our perspective and how we view the world and others.  As a result, two individuals looking at the same set of circumstances may come to two completely different conclusions.  Are you looking at the problem from the lens of the technologist? If so, is that influencing your position in a way that someone without a technical background might not understand?

    Focus on relationship management, not just people management.  The ability to express feelings is a key social skill, and emotions are often contagious.  As managers, it’s important to realize that we send emotional signals during each and every encounter, and our staff, clients, and others around us may actually mimic our emotions.  During social interactions, people tend to mirror the body language of those around them, leading to what’s known as “mood coordination.” The better you are at reading others, the more effectively you can control the signals you send. This awareness will help you to manage the effect you have on others.


    Goleman, D. (2005). Emotional Intelligence. New York, NY: Bantam Books.

    Goleman, D., & Boyatzis, R. (2008). Social intelligence and the biology of leadership. Harvard Business

    Review, 86(9), 74–81.

    About the Author: Donald Billings has 20 years of experience in leadership, entrepreneurial, and consultative roles serving global law firms, fortune 100 companies, and non-profits. He holds a B.Sc in computer science/software engineering, master certificates in business administration, legal studies, and information security management, and an M.Sc in leadership with a focus in innovation and technology. He is currently pursuing a doctorate in business administration with a focus in technology entrepreneurship.

  • 10 Jul 2014 6:51 PM | Chere Estrin (Administrator)

    Reality Check!!

    I was at a meeting with a potential client recently, and was presented with a set of their predominant case-teams' dis-satisfactions with both their external and internal ediscovery - litigation support teams and providers.  One of their most prominent complaints was the time it took to load data received on any of their matters. 

    They operated on a 5-day rule, that is to say 5 days following receipt of any data, it was supposed to be loaded ready for review. That sounded like a fairly generous time-frame to me, as I've rarely worked for patient litigation teams, and that sounded like a rather reasonable amount of time to load data into any database.

    I've frequently been involved in situations where immediate attention to document loading was required, such as when, during a deposition, documents were produced and needed to be marked, numbered and loaded so they could be instantly used and recorded.  Quite different than any 5-day protocol. 

    It might also take as much time to load a single document as 1,000 because the data load set-up time was more significant than the time it might take to load any reasonable bunch of documents.

    So what was the problem?  Where was the reality check disconnect??

    Turns out that documents could be received in any of their many - several offices.  The attorneys receiving them would need to cross-ship and deliver the pages to be scanned or discs to be loaded to the main support office for processing and ingestion.  That could easily take a day or two, and likewise, the attorney also might not get to this re-delivery task for a day or two. 

    With a 5-day load standard, subtracting 2 or 4 days from the task made everything into a [un]fairly critical crisis situation! 

    Simple remedy -

    When analyzing the 5-day from receipt standard expectation, I pointed out that they need to start the clock to when the processing folks got the data in their own hands. 

    If the case-team wanted their data loaded sooner, all they needed to do was turn around the data to the support folks sooner, whether by FTP or internal network upload, expedite the cross-shipping, or eliminate introducing any delays of their own doing.

    Actually easy to fix once this was pointed out to them. It was unfortunately easier for them to assign blame than admit they were really part of the problem.  Upon review, however, while their standard protocol was arguably generous, their own behavior was not.  They had no one to blame but themselves for delay. 

    Reality Check!!

  • 28 May 2014 2:50 PM | Chere Estrin (Administrator)

    Other Concerns

    Another possible limitation of Cormack and Grossman’s study is the document representation that they used. Their CAL algorithm could be run with any kind of document representation, but they chose one that may be particularly difficult for other approaches, such as random sampling. 

    Cormack and Grossman’s method destroys any semantics (meaning) in the documents.  By breaking the text into 4-character shingles (including spaces and punctuation, I think), it destroys the morphemes.  Morphemes are the basic units of meaning in a language. Words consist of one or more morphemes. Cormack and Grossman’s method treats documents as sequences of characters, not words.

    They do two other things that make training more difficult.  The number of distinct words in a document collection is typically on the order of around 100,000 for small document sets, growing by roughly one new word per document beyond this.  About half of those words occur exactly once in the corpus.  The number of four-character shingles  is much larger.  Most, if not all machine learning algorithms are sensitive to the number of distinct “dimensions” on which they are trained. A dimension is simply a variable on which the items may differ, for example the presence or absence of a word, o the presence/absence of a letter shingle.

    Typically information retrieval systems work to reduce the number of distinct features to classify, but Cormack and Grossman appear to have chosen a representation that increases them.  Instead of limiting the dimensions to a few hundred thousand words, Cormack and Grossman use shingles. They note that there are 4,294,967,296 potential shingle representations , which they reduce through hashing to 1,000,081 distinct codes. Each code (each of the 1 million final values) represents about four thousand different four-character shingles (4 billion potential shingles to 1 million codes).

    If I understand correctly, they encode each shingle originally as a 32-bit binary string (whether a character, including some punctuation, is present or not in the shingle), throwing away repeat characters in the shingle.  They then encode the shingles into a million-bit binary string, throwing away the number of times each shingle occurred in a document.  That’s a lot of information to throw away. This representation may work well for the CAL algorithm, but it is not optimal for other machine learning algorithms.


    In conclusion, Cormack and Grossman claim that:

    •Random sampling yields inferior results
    •Reliance on keywords and random sampling is a questionable practice, certainly not necessary to train predictive coding
    •The CAL process is not subject to starting bias

    None of these claims is supported by the evidence that they present. Their use of an inadequate and biased evaluation standard, a so-called “gold standard,” makes it very difficult to draw any real conclusions from this study. Other design decisions in the study also strongly bias the results to favor their CAL system.

    There is no reason to think that their results imply that random sampling per se is the issue. The best that they could reasonably claim is that it did not work well in the situation that they arbitrarily constructed, with the algorithm that they chose for it, and measured against a task that was also biased in favor of their CAL algorithm. The fact that random sampling does work well in other situations belies their conclusion. Random sampling can, in fact be very effective at identifying responsive documents and it can be verified by an independent standard of judgment.

    Cormack and Grossman raise a straw-man argument that others claim that only random sampling is valid for training predictive coding. I don’t know who actually makes this claim. I believe that random sampling is sufficient to produce accurate results, but I do not argue that random sampling is necessary.

    At OrcaTec, we frequently use combinations of pre-categorized documents and random sampling, but random sampling alone is efficient and sufficient for most of our matters. Further, there are other reasons to want to use random sampling for training predictive coding.
    As Cormack and Grossman note, random sampling can give more representative training examples, their trumped up example of a system learning that spreadsheets are not responsive aside. They are correct in noting that random sampling is more likely to capture common categories of documents than rare ones, but unless there is some method for identifying rare documents, no statistical learning system is going to be very good at identifying them. If the users know of examples of rare responsive documents, then these can be included in a training set, followed by random sampling. As Cormack and Grossman note, however, there is no scientifically useful protocol available for identifying these rare instances.

    Random sampling gives direct, unbiased, and transparent feedback as to how well the system is predicting responsive documents.

    I am truly puzzled that Cormack and Grossman argue that their CAL system is not affected by starting bias. They argue that a new classifier is generated after every 1,000 documents. This classifier derives its positive examples from the documents that have previously been identified as responsive and then presents more like those for further review. I see no opportunity for a rare unanticipated document type to be selected.

    Random sampling, on the other hand, presents for review a representative sample of documents without regard for the system’s predictions about which are responsive. New and rare topics, thus, have the opportunity to be included in the training set, but they are not guaranteed to be included.

    Finally, to paraphrase Mark Twain, any reports concerning the death of random sampling for training predictive coding are grossly exaggerated. Far from being a questionable practice, if your goal is to minimize the number of unnecessary non-responsive documents to review while maximizing the accuracy of your selection, then random sampling can provide a powerful, defensible, means to achieve that goal.


  • 28 May 2014 8:46 AM | Chere Estrin (Administrator)

    This is my second blog post about Cormack and Grossman’s latest article, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.” In their article they compared several versions of simple random sampling as the training set. In my previous blog, I argued that they are CAR vendors, they have a vested interest in making their algorithm appear better than others, and that the system that they tested using random sampling bears almost no resemblance to any system using it to actually do predictive coding.

    In this post, I want to focus more on the science in the Cormack and Grossman article. It seems that several flaws in their methodology render their conclusions not just externally invalid — they don’t apply to systems that they did not study, but internally invalid as well — they don’t apply to the systems they did study.

    Their paper has been accepted for presentation at the annual conference of SIGIR (The Special Interest Group on Information Retrieval) of the ACM (The Association for Computing Machinery). I am also a member of this organization.

    External Validity

    Poor performance on a caricature of a random sampling system or one using simple active learning does not indict all systems that use random sampling or simple active learning. As a demonstration that this inference is not warranted, I showed results from some OrcaTec projects that used random sampling. Rather than returning poor results, these projects returned results that were better than Cormack and Grossman’s favored algorithm. My point was not to show that their CAL algorithm is not good, but rather, in contrast to their claims, to show that random sampling can lead to high levels of predictive accuracy.

    Some of the differences between the system that they tested and the OrcaTec system include:

    • Solving for a different task
    • Using a different underlying learning model
    • Using a different underlying document representation

    What they describe as the “TAR task” is different from the task that most predictive coding systems seem to use, including OrcaTec’s. Cormack and Grossman’s task is to continuously optimize the documents presented for review, whereas it is more common to divide the predictive coding process into two parts, one where the system is trained, and a second where the system’s predictions are reviewed. The continuous task is well suited to their CAL model, but not to the other two systems that they tested.

    Cormack and Grossman’s SPL system uses a support vector machine as its underlying machine learning model. The support vector machine is common among several predictive coding systems, but it is not used in the OrcaTec system. The different underlying models in the two systems means that it is not valid to extend the results obtained with one to the other. They have different learning properties.

    The Cormack and Grossman system used a peculiar way of representing the documents in the collection, whereas the OrcaTec system uses words. I discuss this issue in more detail below, but this difference means that the results of their SPL method cannot be validly applied to other forms of random-sampling training.

    These differences limit the validity of any comparisons between the results that Cormack and Grossman obtained and the results of competing predictive coding providers, including OrcaTec.

    All other things being equal, one might be willing to say that in their specific tasks, with their specific collections, and their specific models, that their Continuous Active Learning (CAL) system performed better than a Simple Active Learning (SAL) or Simple Passive Learning (SPL), but even this comparison turns out to be invalid.

    Internal Validity

    As mentioned, the Cormack and Grossman task is well suited to CAL, but because neither of the other two systems are designed to present the most relevant documents for review at each of the training stages, their task does not apply to them. Presenting random documents for review (SPL) or the most uncertain documents for review (SAL) is simply not designed to produce highly responsive documents during the initial stages of training. These systems are designed for a two stage process of learning followed by prediction. Whatever the technical merits of a system, pitting one designed to produce responsive documents at the beginning of the task against two systems that are designed to get information about documents at the beginning of the task is not a valid comparison.

    Their choice of task strongly biases their results in favor of the CAL task, meaning that their internal comparisons, within their study, are invalid. It is metaphorically like comparing a dolphin and a horse and saying, oh look, that dolphin does not run so fast in the desert.

    The bigger issue is with how they constructed the set of documents that they used to measure the performance of the system. Their so-called “gold standard,” unless I am seriously mistaken, guarantees that their CAL system and not any of the others will be successful.

    The measurement of Recall (the completeness of the process identifying responsive documents) and Precision (the selectivity of the process identifying responsive documents), requires that we know the true status (responsive or non-responsive) of the documents being assessed by the TAR/CAR system. How that standard set is constructed is critically important. As they note: “To evaluate the result, we require a ‘gold standard’ indicating the true responsiveness of all, or a statistical sample, of the documents in the collection.”

    Here I want to focus on how the true set, the so-called “gold standard” was derived for the four matters they present. They say that for the “true” responsiveness values “for the legal-matter-derived tasks, we used the coding rendered by the first-pass reviewer in the course of the review. Documents that were never seen by the first-pass reviewer (because they were never identified as potentially responsive) were deemed to be coded as non-responsive.”

    If I am reading this correctly, it means that the CAL algorithm and (perhaps) the keyword search were used to identify the set of responsive documents. If true, this approach is a very serious problem. They used the tool that they want to measure to derive the standard against which they were planning to measure it. They used the prediction of the CAR model as the standard against which to measure the prediction of the CAR model. The reviewers could decrease the number of documents that were deemed responsive, but they could not apparently increase the number of documents that were deemed responsive.

    It is not at all surprising that the documents predicted to be responsive during the standard setting were the same documents identified by the same system during the assessment. They were both predicted by the same system.

    The double use of the same measurement means that only the documents that were identified as responsive by the CAL algorithm or by the keyword search used to create the seed set could be considered responsive. If either of the other two algorithms identified any other documents as responsive, they would have matched an unseen document, and so by Cormack and Grossman’s definition, they would have been counted as an error. Again, if true, it means that only the CAL algorithm could consistently produce documents that would be called responsive. The other algorithms could produce correctly classified responsive document only if they happened to agree with the CAL algorithm or the keyword search. Only those documents would have been reviewed and so only those could be candidates for being responsive.

    Cormack and Grossman report the relative contributions of the keyword search and the CAL algorithms. Recall and Precision for the keyword searches used to generate the seed sets are shown in their Table 3. For two of the matters (B and D), the keyword searches yielded high Recall (presumably when judged against the merged standard of keyword and CAL), but low to moderate Precision. In fact, in these two matters, the keyword search resulted in more responsive documents than the CAL algorithm did. For the other two matters (A and C), the CAL algorithm identified more responsive documents than the keyword search did.

    Their keyword seed set sometimes showed better Recall than their CAL results.

    Table 3: Keyword seed queries
    RecallPrecisionMax CAL RecallRecall 90Precision

    Two matters out of four found poorer Recall after using the CAL method than the keyword search used to generate the seed set achieved.

    Keyword Recall and Precision are from their Table 3.

    Max CAL Recall is the maximum that CAL achieved in the study.

    Recall 90 is their first Recall above 0.90, or max Recall if it never reached 0.90

    Precision at Recall 90 is the last column

    The CAL algorithm selects for review the top 1,000 documents that it predicts to be responsive. So this algorithm does not present any documents for review that are not predicted to be responsive. If a document is responsive, but does not resemble any of the documents already predicted to be responsive, then it will not be found, except, perhaps at the very end of training when the number of remaining similar documents is exhausted. Documents that are similar to the seed set are considered, documents that are not similar to the seed set are unlikely to ever be considered, let alone predicted to be responsive. This limitation is hidden by the apparent fact that only similar documents were contained in the validation set (the “gold standard”). The system can appear to be performing well, even when it is not. An invalid standard set means that any measures of accuracy compared to this standard are also, necessarily, invalid.

    Cormack and Grossman do not provide, therefore, even a valid measure of what their CAL system has accomplished. The CAL system may be quite good, but we cannot tell that from this study without validation on an independent set of documents that were not generated by the CAL system itself.

    n text classification, the more disjoint the relevant topics are (where disjoint topics share few target words or other features), the more difficult it will be for a method like Cormack and Grossman’s to find them all.  Their method depends on the starting condition (the seed set) because it ranks documents by their resemblance to the seed set.  It is very difficult to add to the seed set in their situation, because documents that are not similar to the seed set are not encountered until the similar ones have been exhausted. And in their study, those documents would automatically have been counted as non-responsive.

    Random sampling is known to be slower than certain other training algorithms, under some conditions, but more resistant to noise.  As mentioned, it is also more resistant to training bias because the items selected for review are not determined by the success or failure of previous predictions. Whether it will, in fact, be poorer depends on the complexity of the space. Even if it is somewhat slower, though, it still has the advantage of being open to identifying disjoint responsive documents.


  • 01 May 2014 1:49 PM | Chere Estrin (Administrator)
    We live in a world where most of our actions are recorded in some way, shape or form. Most individuals today, starting with kids through seniors, have access to email, smart-phones, tablets, social media, and other cloud-based apps that are used for communicating and ultimately can be used in civil and criminal proceedings. Some of us refer to this as the post-Snowden era, there's some credence to privacy as we knew it, no longer exists. We're not going backwards.

    Everything we're going to see whether it's Sterling and the NBA, or whether it's this week's alleged Benghazi smoking-gun email that came to light, electronic discovery is going to be a part of, if not all, the majority of civil and criminal litigation going forward. We're at a time where certification and education is extremely important for those that have spent time in the industry as well as for those that are in college and seeking jobs in the IT field, Security field, and Legal field. It doesn't just touch lawyers anymore; it's wide-ranging from lawyers, to technologists, paralegals and more.

    Authored by Jeff Fehrman, OLP Board of Governors; CSO, Mindseye Solutions
  • 30 Apr 2014 12:07 PM | Chere Estrin (Administrator)
    News surrounding the NBA’s sanctions against Donald Sterling is flooding the internet with buzz ranging from Clippers’ fans ready to restore focus on their play-off team, to ethical divides about the events transpired that resulted in the now infamous recorded private conversation. While the circumstances have been handled out of court thus far, little doubt remains that somewhere down the line, a case will be filed. What considerations will be admissible? How will eDiscovery occur? And, what are the methods that could be employed to make or defend a case of this nature?

    Herb Roitblat, Chief Scientist and Chief Technology Officer at OrcaTec and OLP Board of Governors member, offered his comments on the matter, “The old adage is, don’t say or write anything down that you don’t want to appear on the front pages of the New York Times, or in this case, the Los Angeles Times. No matter who is involved, there’s bound to be a case at some point from this. And, central to that, there’s bound to be questions concerning who knew what and when?”

    “If Sterling’s remarks on the recording released last week are any indication, there’s likely to be other similar remarks lurking in other electronic repositories of data, including mobile phones, tablets, and VoIP,” said Fernando Pinguelo, Esq., Partner/Chair, Cyber Security & Data Protection and Crisis Management groups at Scarinci Hollenbeck, and OLP Board of Governors member, and furthered, “With the Internet of Things comes untold ways in which one’s deepest thoughts and opinions can be recorded and exposed for all to see. Any pending or future lawsuits implicating Sterling will likely involve electronic discovery, including the duty to preserve relevant data in any form.”

    As technology evolves so too does the user’s intent and possible ramifications. For legal professionals following this headlining story, considering their own knowledge and ability to assist with or lead in litigation support and eDiscovery should be top of mind. Do they have the proper training and are they up to date with current standards on finding and investigating expert witnesses, document review, and eDiscovery legal ethics and best practices? Having a working knowledge of the latest technologies and legal principles could be the difference-maker between a win or a loss.

    For a decade now, eDiscovery has been a crucial area of the American legal system. OLP has developed extensive programs through webinars and live, interactive online courses designed to educate legal professionals and their teams in eDiscovery to become skilled practitioners in this high-demand specialty. For more information, visit the OLP upcoming webinars and courses.
<< First  < Prev   1   2   Next >  Last >> 
Powered by Wild Apricot Membership Software