top of page

The role of natural language in driving innovation in the legal sector

A focus on multilingual legal data, legal tech and global legal education

While the debate on whether data is the new oil is ongoing (compare Wired Magazine’s No, Data Is Not The New Oil, February 2019, to The Economist’s The World’s Most Valuable Resource Is No Longer Oil, But Data, May 2017) the rising importance of data is becoming obvious to an increasing number of people. This includes the importance of data access, data ownership, the vast digitization of data, and finally the use of big data to train algorithms and create intelligent machines with ever-increasing usage and applications in the business sector and beyond. Together with many promises (efficiency, predictive analytics, artificial intelligence) the rise of big data has also come with a number of threats, as for example the rise of data monopolies [1], serious challenges for data protection and privacy [2], and also the much-discussed phenomenon of algorithmic bias in view of biased data that are fed into the algorithms [3].

There are many instances of algorithmic uses of big data that have transformed a number of sectors. Take for example the advertising sector, and the use of big consumer data to predict consumer preferences and then facilitate targeted advertising. There is also, for instance, a rising use of big data in the healthcare sector, leading to critical changes in predictive medicine but also in the insurance sector.

Another example is the use of big data and predictive analytics in policing. This article focuses on the use of big data in the legal sector, and specifically on the development of legal tech tools: how legal resources (legislation, judicial decisions, Court submissions and other resources, such as legal treatises and scholarship) are used to train algorithms that will provide for important tools for lawyers and judges.

Data is relevant for law; so is the data revolution relevant to the forming future of the legal profession and inevitably also the legal education. Among the many other sectors affected, big data is relevant for the future of the law in many ways: data analytics can help us predict how judges decide, and can thus become a powerful tool in the hands of both lawyers and judges. And while we might not be ready yet to accept a robot as judge, or AI treaty drafting and the robot negotiator, the transformation of the legal world of practice in view of legal tech innovations is already underway. Legal education has also been affected: law school curricula adapt quickly to provide to students the necessary technical understanding and skills to enter a legal world where the management of data is of critical importance, where legal tech startups are proliferating, and where machine learning and artificial intelligence are of the utmost relevance for the future of the profession.

A relatively understudied element of computerized data, especially data relevant in legal tech (mostly digitized or born-digital legal resources), is the natural language in which the information exists and in particular the multilingual nature of legal resources around jurisdictions. English legal resources can help build algorithms that could be useful in litigation by, for example, predicting or helping to predict legal outcomes in English courts. This data includes pleadings, other submissions to courts, legal decisions, legislation, regulation, treatises etc. While the legal world is increasingly becoming more global, linguistic diversity among different jurisdictions and legal cultures persists. Is this legal diversity reflected in the development of legal tech tools?

The challenges in working with different languages in creating, interpreting and applying the law are substantial, even in the context of experienced multilingual supranational structures. Case in point: the EU institutions, which are some of the most important multilingual legislative bodies around the world, and their grappling with the 24 different languages of the member States. Suffice to think about the role of language in the European Parliament deliberations, or of the role of the legal translation service officially translating new directives and regulations as they come out of the European Council [4]. Finally, think of the Luxemburg courts hearing pleadings in different languages and publishing decisions in all official languages. While the legal value of each language within the EU is the same, the languages are not functionally equal (or at least equivalent).

Indeed, there is a de facto dominance of certain languages as legal (and other) professionals gravitate towards one or few common languages to read, to write, and to communicate legal texts. Thus, the development of legal tech tools is bound to be faster for the de facto dominant or wide spoken languages. The above conclusion applies in broader contexts outside of the EU. In the global context, especially in certain legal domains as for example for international contracts, English has been established as the lingua franca. It is thus to be expected that in this domain the development of legal tech tools will also be predominantly using data written in English and subsequently producing results in the same language.

Furthermore, the ethnography of legally relevant data that are actually gathered, and then fed into legal tech algorithms in order to create algorithmic tools, is not necessarily representative of the real ethnography of legally relevant data. Dominant languages, such as English, Spanish and French seem to have the biggest potential for experimentation and growth in the space of legal tech. The volume of data for these languages is big and thus the quality of the algorithmic results rises. At the same time jurisdictions with official languages spoken by fewer people offer a smaller volume of data which might be affecting the quality of effectiveness of such algorithms. On the flip side, less widespread languages offer for niche markets to produce legal tech tools – that is algorithms that can read legal and process resources in those languages, for example in Italian, Czech or Greek. Finally, experts from those jurisdictions must be involved to determine which resources are relevant and which not in order to build such tools. This creates space for another niche market of legal experts including legal translators. Indeed, a national expert can point to the relevant legal databases that include legislation and case-law and can also identify good and bad law and also possible gaps and how to fill them.

Given the language silos between legal jurisdictions, and also the differences in legal cultures in the broadest meaning of the term (most notable example is the Judges’ different writing styles in different jurisdictions), there is ample room for diverse legal tech market tools, just like there is room for diversity of languages in legal jurisdictions.

The evolution of the legal market directs changes at the level of legal education. What are the implications for legal education at a global level? As legal education becomes increasingly globalized, should future lawyers and judges also be trained to build and use legal tech tools? And if so, in which language(s)? Arguably, besides tech skills that are increasingly becoming a learning priority in legal education, the lawyer of the next generation can benefit from a systematically comparative reading and understanding of the law, from familiarity with multiple jurisdictions and ideally also from multilingual skills. These are skills and assets that would allow future lawyers to participate actively in the creation of legal tools in their respective jurisdictions and in niche areas of practice. One could predict that fluency in more than one most spoken natural languages will remain a valuable asset to face the new challenges and opportunities that the recent transformations of the legal profession bring about. And this, I argue, now extends also to the transformations in the domain of legal tech.



[1] See, for example, Tim Wu, The curse of bigness: Antitrust in the new gilded age. (2018).

[2] The Medium, The Privacy Paradox: Is the End of Privacy Inevitable? The Price of Convenience, August 2017, at

[3] See Alex​ ​Campolo,​ Madelyn​ ​Sanfilippo,​ ​ Meredith​ ​Whittaker,​ ​& Kate​ ​Crawford,​ ​AI Now 2017 Report, available at

[4] General Secretariat of the Council Directorate Report, The language service of the General Secretariat of the Council of the European Union - making multilingualism work (2012), available at


About the Author Argyri Panezi (LL.M., Ph.D.) is an expert in law and technology and intellectual property. She is a professor of law and technology at IE Law School.

Argyri received her LL.B. from the University of Athens, her LL.M. from Harvard Law School and her Ph.D. from the European University Institute in Florence. Her thesis examined the legal challenges for the creation of digital libraries and explored normative directions for copyright rules and exceptions currently applicable to libraries. Since 2018 she has worked as a research fellow at the Digital Civil Society Lab at the Stanford University Center on Philanthropy and Civil Society, where she is exploring digital civil society interactions with libraries and other cultural heritage institutions and the nexus of AI and digitization. Prior to her doctoral studies she practiced competition law in a private law firm in Brussels.

bottom of page