Digital Humanities and Corpus Linguistics: JCU Welcomes Fabio Ciambella

On February 21, 2023, the JCU Department of English Language and Literature, and the Department of Mathematics, Natural, and Applied Sciences welcomed Dr. Fabio Ciambella for a talk on “Corpus Linguistics.” The talk, which was part of the “DH@JCU2023 – Digital Humanities in Practice” lecture series, was moderated by JCU English Professor Alessandra Grego.

Digital Humanities

Digital Humanities

Ciambella holds a Dottorato di Ricerca from the University of Rome “Tor Vergata,” and is currently a full-time Research Fellow of English Language and Translation at Sapienza University of Rome. His research interests include the relationship between dance and early modern and Victorian literature and language, historical pragmatics, corpus linguistics, and Second Language Acquisition, on which he has published extensively.

Ciambella began by defining “corpus” and “linguistics.” He explained that a corpus is “a machine-readable collection of spoken or written texts that were produced in a natural communicative setting, and that is compiled with the intention to be representative and balanced with respect to a particular linguistic variety of register or genre, and to be analyzed linguistically.” Linguistics is the “scientific study of language, and its focus is the systematic investigation of the properties of particular languages as well as the characteristics of language in general.”

Corpus linguistics, as Ciambella explained, is a methodology for computer-based empirical analysis of language use performed in large, electronically available collections of naturally occurring spoken and written texts. The aim of corpus linguistics is not only to test existing theories but also to help formulate new ones, by observing emerging patterns that would be invisible without the aid of digital research tools.

There are various corpora accessible online, of which Ciambella mentioned the British National Corpus, created by Oxford University Press in the early 1990s, the Corpus of Contemporary American English (COCA), and the Corpus of Historical American (COHA). Ciambella explained that the benefits of corpora include the fact that they give us a more accurate description of language, larger language samples, contextualized real usage of a word, and examples of specific registers.

The use of computers is fundamental in corpus linguistics, which is one of the first fields of intersection between digital approaches and humanistic areas of inquiry. Digital collection and storage of large amounts of language data, rapid automated processes, and easy repeatability of research, all allow for the reputability of studies and the checking of the statistical reliability of the results. Corpus linguistics, also known as computational linguistics, Ciambella concluded, is a perfect example of how humanities research can be empowered and broadened by the introduction of digital tools, allowing researchers to study language use and language change over time.

He then mentioned three free tools which can be downloaded and used as an introduction to corpus linguistics. Lancsbox, a software developed at the University of Lancaster, which is among the most famous universities for corpus linguistics. AntConc, which takes its name from its creator Lawrence Anthony. Sketch Engine, a software developed in the Czech Republic that has a moderate cost for individual academic users, but allows for more advanced research. Finally, Ciambella gave a practical demonstration on how easy it is to use one of these tools, Voyant, which can be downloaded and put to use with extreme rapidity, to provide simple but useful answers.

Dr. Ciambella concluded his talk with a Q&A session. One interesting question was “why not use Google for corpus linguistics?” He explained that considering the web as a corpus is risky because it might take into account and analyze random texts that are not scientifically relevant, or texts where words might be misused. Therefore, the web is not reliable for corpus linguistics, so the use of dedicated tools is recommended.