Developments in Legal Information Retrieval
Added value from content integration and knowledge based systems
Digital legal sources
In recent years, ever more legal information has become available digitally. With that, the importance of these sources that can be consulted online has increased tremendously. For practicing lawyers as well as for law school students, ‘legal databanks’ often have become the primary source of information. The information concerned is no longer limited to just legislation and case law, as the major part of legal literature and practically all legal journals can now be consulted in digital format as well.
In fact, many professional resources are solely available digitally these days, with no printed version being published (at least not officially), and this number will probably increase in the future as well. An example of that, indispensable for practically every British lawyer, is the extensive collection of case reports and legislation available from the British and Irish Legal Information Institute (BAILII), which currently contains more than 400.000 documents while around 16.000 case reports are added to that every year. Furthermore, there is a huge and still increasing number of web sites that publish ‘legal news’ online, such as Lexology and International Law Office. Digital-only magazines are for instance the European Journal for Law and Technology (EJLT) that has already been published by UK law schools for over two decades, and several Law Reviews compiled and published by universities. Finally, apart from these ‘external sources’, many law firms these days have extensive collections of ‘internal’ documents, for instance containing know how, that can be consulted via the firm’s internal network (intranet).
As an internet connection these days is available on practically every legal work spot, documentation in digital format can be consulted right at the lawyer’s desk, while the paper equivalent could only be obtained from the library. Another advantage is that searching in digital collections can be faster and more efficient. A case report or an article from a journal can be retrieved, even if the exact location is not known, by querying the database using (combinations of) relevant terms.
A major disadvantage of the multitude of available digital sources is that these sources cannot be used and searched in a uniform way. The publishers of publicly available information as well as commercial databases each apply their own search mechanism, with a proprietary user interface. The same goes for collections of documents in an organization’s know how system. Because of that, queries to retrieve information often have to be formulated and executed multiple times, in all these separate databases. Results from each separate query have to be gathered and combined, in order to obtain a list of all relevant sources eventually. That is one of the main reasons why ‘content integration’, which enables professionals to consult all relevant sources at the same time, currently receives a lot of attention, in legal practice and other sectors.
When the issue is not only to find and consult the correct sources, but also to transfer the contents of that in an efficient way to clients or colleagues, so-called ‘knowledge based systems’ can be a useful addition as well.
The term (Legal) Knowledge Based System refers to computer software capable of making very specific pieces of knowledge available to users, tailored to their needs – for instance, a case they are dealing with – and adjusted to what they already know. Such knowledge based systems usually operated within a previously determined domain (an area of law, or even a single problem within that domain). For that reason, their application needs to be carefully considered, but under the right conditions can bring unique advantages.
Content integration and content aggregation
Content integration (CI) systems are capable of retrieving data from multiple (digital) sources, external as well as internal ones and publicly available as well as commercial ones. This technology in itself is not new. Several publishers already apply certain forms of content integration to bundle their own databases and to make it possible to query these in one go. Examples of this are ‘Westlaw UK / Next’ and the combined sources offered in LexisNexis. Although such ‘information portals’ definitely are useful and have contributed considerably to the simplification of searching the included databases, also for inexperienced users, they are seldom complete in the sense that they contain all legal sources a user needs.
Being a publisher’s product themselves, they usually do not offer access to sources of other (competing) publishers. They do contain publicly available materials (for instance legislation), but often only a limited selection of these. Therefore, the searching in all required resources by means of a single query, resulting in a single list of results with each ‘hit’ ranked optimally in that list is not possible. The same of course goes for the simultaneous searching in external sources as well as internal ‘know how’ documents.
The fact that combining sources from different publishers in one retrieval system was not possible, was considered an important drawback by many Dutch lawyers. As no readymade products to solve this issue where available initially – say, at the beginning of the new millennium – several major law firms developed (partial) solutions themselves. They obtained licenses to store publisher’s content themselves, and to run their own retrieval systems on the integrated set. Some of these systems have been in use for more than a decade, although most have now been replaced with ‘outsourced’ (usually commercial) solutions. Since then, several alternatives have become available. Companies have systems on offer that make it possible to retrieve data from different origins (be it commercial, publicly available or internally owned) by means of one single query, resulting in one single, ordered list of results. In the Netherlands, two companies are active on this market, named ‘Legal Intelligence’ and ‘Rechtsorde’. The solutions these companies offer have many similarities. They enable a law firm to retrieve documents and other data from all licensed resources in one go, through a specific legal information ‘portal’ customized for their needs. This portal offers the possibility to search in the usual ways, for instance by means of keywords, and to refine search results where needed. Search and ‘drill down’ options have been specifically tuned to the legal content involved. Searching can for instance take place based on articles from legislation, court names, verdict dates, case numbers or identifiers of parliamentary documents. An important, added advantage of these content integration systems, apart from the integration of sources, is that they have been equipped with improved possibilities for searching and selecting documents.
These improvements are in fact a real necessity in this case, given the huge amount of documents – often at least 4 or 5 million – that are available from the joined databases
Furthermore, options have been added to store search results in a structured way, for later re-use, and to notify users of newly added documents on particular subjects. With all that, these systems can have considerable added value compared to the searching in separate databases. This will be illustrated in the next section using one of the available content integration products, namely that of Rechtsorde.
Before that, however, I would like to explain that Content Integration (CI), as described here, has to be distinguished from Content Aggregation (CA). The latter term is used for services that do not actually integrate document collections, but are capable of ‘commanding’ separate searches in multiple existing document collections, from one central interface. Different from CI systems, CA usually implies that the actual searching is performed by the original database search engines and results are combined afterwards. For browsing purposes, aggregator sites often download brief descriptions (for instance: titles and abstracts) from the separate document collections. When a user then selects one of these, or clicks on a ‘hit’ presented by the search function, the corresponding document is retrieved from the database where it resides, and is shown from there. Aggregation systems are relatively easy to implement, as the majority of professional databases not only provide user interfaces that give us the possibility to search and browse their contents, but also so-called web services that can be consulted by automatic processes (such as the search algorithm of a content aggregator’s retrieval system). That means that no special software needs to be developed to perform these ‘distributed search operations’. The results of CI are often better than those of CA, however, as only a CI system can truly integrate sources, for instance by creating new crosslinks (from one source to another) based on the document content.
The CI system Rechtsorde.nl is produced by a company with the same name, today a 100% subsidiary of Sdu Publishers in The Netherlands, and with that of the French ELS publishing corporation. Rechtsorde exists since 2005 and initially focused on integrating internet sources that are publicly accessible, such as legislation and official publications of the Dutch government as well as EU legislation and case law from for instance the Eur-lex web site.
An obvious next step was the extension of the information on offer with important commercial sources from legal publishers, such as annotated case law collections, professional journals and law reviews and reference works. This of course was only possible in close cooperation with the respective publishers. At the same time, the system was adapted for the addition of a law firms own ‘internal sources’, such as know how collections. Rechtsorde is currently used by almost 30.000 lawyers and law students, from hundreds of law firms, companies, universities, libraries and governmental organizations. The system can retrieve information from around 1800 different publications (web sites, journals, reference books, literature, etc.) where each publication can consist of hundreds or even thousands of different issues or parts. The total number of available legal documents that can be consulted through Rechtsorde is currently higher than 8 million.
Practicing lawyers do not only use external (public and/or commercial) sources. Every law firm accumulates considerable knowledge and expertise, a lot of which is stored in the form of documents. These documents are practically always stored digitally these days. Some larger firm use special Document Management Systems (DMS) for this, which make it possible to store each document that is created, from simple one-line e-mail messages to 200 page agreements, in a central database automatically. Often, these document collections also contain subsets, such as ‘knowledge documents’, ‘model agreements’, ‘model letters’, etc. Given this, it makes sense to include such more or less structured internal document collections – which are, by the way, increasingly stored ‘cloud based’ – in a CI system. A lawyer then can access all digital legal information the firm has at its disposal from one single interface and by means of one single query. Although many users are enthusiastic about the perspectives of all this, there are certainly a few areas of concern. The most important of these is the security of the data involved. Law firms are usually very cautious about their internal data, which can include data about clients and accumulated knowledge from many employees, collected over a long period of time. This means that integration and use of internal information without proper security precautions is unthinkable. For this problem, several solutions have been developed over time.
A common element is that the documents involved are not allowed to leave the firm’s safe environment where they are stored. In order to make them available for retrieval in a CI system, sometimes a local indexing service is applied, which forwards indexes of the internal documents – properly encrypted, if necessary – to the central server where they are integrated in the complete collection. Another possibility, which is sometimes preferred, is to install a local CI server within the organization, which indexes and enables searching of internal documents, but also passes on each query to the external server where the rest of the content resides. The results of the internal and the external query are then combined before they are shown to the user. Rechtsorde so far in most cases has used the latter method, also known as ‘federated search’.
As will probably be clear by now, the use of internal content with a CI solution can put some demands on the IT infrastructure of a law firm. Furthermore, it really helps if the internal document collection is properly structured, for instance by document type and/or area of law. Also, it should be ascertained that for instance obsoleted or personal data are not included with the materials that are made accessible to every firm employee. But if such demands are met – larger firms often have already taken care of that – the integration of all this content can result in a very powerful, fast and at the same time user friendly instrument for consulting and processing legal information.
A new way to search information
As shown in the previous paragraphs, content integration causes large amounts of information to be joined together, originating from different sources. In general, one would expect that a user would find more useful materials from such a larger collection. But that is not always the case. The area to be searched is more extended, therefore the requirements for a method to distinguish relevant from irrelevant information are also higher. Just as is the case for a search engine aimed at world wide web pages, such as Google, which usually also generates thousands or even millions of hits, it is important that the list of results is ‘filtered’ as much as possible and that the most relevant hits are positioned at the top of the list.
This means that for a legal content integration system, a powerful search function adapted to the requirements of its users is absolutely necessary. It should enable users to select exactly the correct information. Otherwise, chances are that the user will content himself with only a limited selection from the available materials, even if this selection is a relatively random one (for instance, based on the presence of a single keyword). Such a selection would probably already contain a few dozens of documents, of which several might be useful. The fact that, in the meantime, more than 90% of all available information might be missing from this initial set – because these document do not contain the keyword that was searched – is something most users are unaware of.
Searching by means of keywords is, meanwhile, still considered to be a practical and reasonably effective method by most users. It is, for that reason, still the basis for the majority of all search queries in most retrieval systems. The quality of results of these queries, however, can often be improved considerably by the addition of selection mechanisms that can be applied before or after the query is executed. An example of the first would be an option to select particular subsets of data in which the searching will take place. An example of the second would be a mechanism to refine search results afterwards, for instance by filtering those using criteria from ‘metadata’ or using additional keywords. For that reason, legal content integration systems usually contain mechanisms to preselect sources in which the searching takes place and to refine (‘drill down’) search results based on for instance the type or the publication date of documents. When for instance particular case reports are searched, the case number(s) can be used as additional search criteria, while the name of the author or the volume number can be used when searching journal content.
To summarize; content integration is not only a matter of combining as many sources as possible. Precisely as a result of that process of combining, the use of a reinforced search mechanism becomes necessary, to ensure that the user, notwithstanding the huge amount of available information, remains capable of selecting all relevant information as quickly and efficiently as possible.
Ranking search results
Apart from methods to select documents efficiently, retrieval systems have another essential function. To make sure that the most relevant documents can be found quickly, given the method of selection applied by the user, the list of ‘hits’ (showing the user’s selection) should be sorted in the best possible way with respect to the (expected) relevancy of the documents.
To achieve that, the ranking of documents could for instance be based on the amount in which they correspond to the search query (documents with the largest number of important query terms on top), or on the source they belong to (documents from the most authoritative source on top), or on the topicality (recent documents first). Such ranking criteria could be used separately, or (more commonly) in combination, to achieve an overall ranking with the most relevant documents at the top of the list. Apart from that, users can often also switch to alternative ranking methods, such as a purely chronological ranking, if that is more appropriate for them.
Storing search results and notification
When performing research using sources, correct storage and processing of search results requires specific attention. Common commercial en public databases usually do not provide much more for this than a function to print (parts of) documents that were retrieved. Content integration systems, on the other hand, usually contain a much more extensive ‘dossier’ function. Users can save documents or parts of documents and can arrange these in digital filing systems. Usually, these filing systems also provide the possibility to add personal notes, hyperlinks and sometimes even extra files, which can be uploaded and put into the dossier. Elements from dossiers can be printed, sent via e-mail or exported, the latter for instance in the form of a word processing file.
A very convenient option in content integration systems is also the so-called notification function. This function entails that the system will monitor its sources for new information. When anything is added to, say, a journal (new edition) or a news source (new message), this addition is compared to criteria specified by the user, and if it complies with these, a notification is issued. This could take the form of an e-mail message, or a message in a special area of the system itself. Users can indicate very precisely what they want to be notified of. This could be a new (digital) issue of a magazine being published, but also a document being added to the total content, no matter from what source, which conforms to a certain search query. As the general idea is that all information that might be relevant to a (legal) user is made available in one single system, the notification process can be efficient, with for instance a single e-mail message each day or each week, in which message all notifications are bundled.
Knowledge Based Systems
A knowledge based system is generally defined as a system capable of certain forms of reasoning, applying knowledge in solving problems, offering advice, and undertaking a variety of other tasks. This reasoning – for instance evaluating conditions and drawing conclusions from that – takes place with respect to the ‘knowledge’ the system has access to. But what exactly does this knowledge entail? In most cases, it consists of coherent sets of data and/or information, for instance concerning a particular legal field, such as labor law: when is it necessary to add a particular clause to an employment contract? Here, the system is completely dependent of the knowledge that has been stored in it by its author (or programmer). By molding this knowledge in particular formats, for instance in that of a series of so-called ‘production rules’ of the type ‘IF [condition] THEN [conclusion]’, the system can reason by evaluation the conditions. If a condition is fulfilled, the corresponding conclusion can be drawn. It might seem as if, while doing this, a computer operates independently and reaches conclusions autonomously. But we must not forget that all possible (combinations of) conditions and conclusions have to be anticipated and programmed in advance by the system builder.
When, at any time, a situation would arise in which none of the available conditions can be fulfilled, the system would not be able to draw conclusions and therefore would not be able to reason any further. That would in fact be an error in the program (a ‘bug’). Despite the presence of possibly very advanced and complex knowledge, a knowledge based system therefore in itself is not any smarter than other computer software. It is not capable of analyzing the substance of its knowledge, just to reason with it following predefined procedures. All in all, a knowledge based system is a computer program that comprises knowledge in a specific format (for instance: a series of production rules). With that knowledge, the program can reason, to eventually reach a conclusion. Commonly, during the reasoning process it will become clear that certain data, used in for instance a condition that is evaluated, are missing. Such data can then be requested from the user. By answering questions the system poses, the user in fact enters all relevant data with respect to a certain case. The conclusion that will eventually be reached will then be relevant for that particular case. In the process, knowledge based systems are often capable of putting together certain documents, such as letters, conclusions or even full contracts or sets of general conditions. This could even be their main purpose for a practicing lawyer.
The knowledge in a legal knowledge based system (LKBS) will usually have to be provided by a human expert. Such knowledge will probably have to be processed in certain ways in order to be useable, however. Let’s assume, for instance, the legal expert mentions a series of situations that have specific legal consequences. To enter this into an LKBS, we have to make sure that:
each of these situations is described in the form of one or more rules, in such a way that it can be tested if the situation occurs by checking the conditions of the rules (this is called ‘formalization’);
the legal consequences (if applicable) will be administered in the system, to make sure that further reasoning taking that into account is possible;
even if none of the situations proves to be at hand, the system will still be able to draw a conclusion (for instance that the legal consequences will not be applicable in this case) and will be able to continue reasoning.
Especially the latter possibility, in which none of the situations mentioned by the expert are appropriate, often causes problems. It is of course conceivable that the list of situations (entered by the expert) is not restricted and that there could be more situations with the same legal consequences. For that reason, we have to be careful when drawing negative conclusions. A final check by the same or another expert might be desirable.
Use in legal practice
Although crafting a knowledge based system can be quite labor-intensive, as follows from the previous section, and requires detailed knowledge about the subject at hand, popularity of these systems has increased quite strongly in the last few years. One reason for that is that software to develop a moderately complex system, the so-called ‘shell’ or ‘development studio’ software, is often quite user friendly these days. For that reason, legal experts without detailed knowledge of computers can make use of them, too. Furthermore, many legal organizations, such as law firms, already possess an IT infrastructure that would enable them to make knowledge based systems, once developed, available to employees and (certain) clients very easily. The latter application would for instance be possible through so-called ‘client portals’. These are in fact web sites law firms use to make selected information available exclusively to (specific) clients. This allows for interesting options, for instance to offer ‘automated model documents’. Such model documents enable clients of the firm to put together tailor-made legal documents, such as labor contracts, non-disclosure agreements or uncomplicated service agreements, using the specialized knowledge of their law firm, but without the need for any of the firm’s lawyers to be directly involved. The access to such automated model documents could be granted on a ‘fixed fee’ basis. Provided an agreement to this extent has been properly established by both parties, this can be a possibility that is economically viable for both the firm and the client.
The main reason to include knowledge based systems in this contribution about integrated information retrieval systems is that interesting perspectives arise when these two technologies are combined. One way to achieve that would be when a knowledge based system would be added to a content integration system as an ‘intelligent frontend’ for entering search queries. The LKBS could be set up to ask a few specific questions to the user, suitable for defining a preselection of sources. After that, the actual searching could take place much more efficiently, which is increasingly important given the speed at which digital sources grow in size and complexity.
The other way around, content integration systems could constitute an important source for the building, use and maintenance of knowledge based systems. To start with the first, it will be obvious that having all relevant sources at hand is a huge advantage for authors of knowledge based systems, as it provides them with the possibility to obtain all information necessary to formulate a correct and complete set of rules and information to accompany these. The second option would occur when a content integration system and knowledge based system would be actively linked in such a way that they can be used together simultaneously. This is useful, as knowledge based systems usually contain large amounts of (background) information, next to the rules used to draw conclusions. This information can be consulted by the user when the system asks for input. For instance, think about a system capable of assembling an employment contract, valid for a restricted period of time. Such a system is likely to contain questions about possible previous employment contracts between the same parties.
Questions like that should be accompanied by information about the number of times temporary employment contracts can be renewed legally, before (as is the case in some European countries) a contract for unlimited time is prescribed. This information, which is based on current legislation, can be obtained ‘live’ from a connected content integration system, which if so desired can also deliver relevant legal comments and applicable case law with it. Even more interesting might be the possibility to facilitate maintenance for knowledge based systems. For this maintenance is definitely a point of concern within the legal domain. Changes in legislation or new case law can make changes in certain rules necessary.
The problem is, however, that the number of rules can be huge, causing the author to lose the overview. That might lead to maintenance not being performed timely and properly. Such problems could hamper the deployment of the LKBS and in fact have been the cause of systems being put out of service in the past.An effective solution for this particular problem of maintenance can be the entering of so-called ‘metadata’ into knowledge based systems. The author of the system can enter data – invisible for the end user – indicating the specific articles from legislation that are decisive for the formulation of a particular rule and also what case law is of importance for it.
When these metadata are saved and output, they can be read by a connected content integration system which can then transform these data to ‘notification requests’ automatically. If at any time changes occur in the respective pieces of legislation or when new case law on the subject is published, the need to make changes to the corresponding parts of the knowledge based system will be indicated automatically. This facilitates the proper functioning of the knowledge based system, not only at present but also in the future, much more effectively than before. By that, the practical applicability of these systems can be expected to increase further.
Summary and conclusions
The role of digital information for legal practice has become very prominent in the past decade. The huge and still growing number of available sources – external, but often also within legal organizations – leads to an increasing demand for technology to select exactly the right documents from these enormous collections. Integrating sources as much as possible is a vital precondition to perform this retrieval process efficiently. In addition to that, ever more advanced search technology is necessary, which not only supports users optimally, but also connects as closely as possible to the professional information being accessed. At the same time, operating the system should remain a straightforward and uncomplicated process.
Content integration systems can fulfill these needs by adapting the information retrieval process to the workflow and needs of practicing lawyers. In doing so, searching and retrieving information becomes more efficient while the number of sources found increases. The added value created from that can be of decisive importance for the quality of service towards clients.
In the past, knowledge based systems were seen as a phenomenon with no connection to primary legal information sources, only to be used within carefully determined domains for very specific information needs. They were notorious for being hard to maintain, which made it difficult to use them in areas of law where changes occur often. Now that these systems can be linked to content integration systems, providing them with the option to connect to all relevant information sources, new possibilities to apply them occur. It allows information to be made available in an intelligent way, while at the same time offering solutions to common maintenance problems.
The latter becomes possible by adding metadata to knowledge based systems and at the same time adding an option to automatically generate alerts for legal changes relevant to these metadata. Developments with respect to content integration have been very swift in the legal area the last decade. Finally being able to use the huge collections of digital sources effectively and efficiently was an attractive perspective for many lawyers and this will probably be even more so in the future. The option to also use knowledge based technology, directly connected to the available legal sources, shows what is currently possible with respect to intelligent, integrated information retrieval and application and will probably boost that technology even more in the next couple of years.
First published Jan 2017 LegalBusinessWorld
Anat Hovav and Paul Gray,, ‘Future penetration of Academic Electronic Journals: Four Scenarios’, in: Information Systems Frontiers, Volume 4 (2002), p. 229-244.
Erasmus School of Law, for instance, publishes the digital Erasmus Law Review, available on www.erasmuslawreview.nl.
Metadata are descriptive data often added to content, such as the name of the source the data come from, the date on which they were last updated, their author, etc.
Richard Susskind, The Future of Law, New York: Oxford University Press 1998, p. 120-121.
Kees van Noortwijk is professor of Law and Technology at Erasmus School of Law, Rotterdam, The Netherlands, and also works as a consultant for the Dutch company Rechtsorde BV.