iBet uBet web content aggregator. Adding the entire web to your favor.

Language Corpora

The term language corpus is used to mean a number of rather different things. It may refer simply to any collection of linguistic data (for example, written, spoken, signed, or multimodal), although many practitioners prefer to reserve it for collections which have been organized or collected with a particular end in view, generally to characterize a particular state or variety of one or more languages. A text corpus is a large and unstructured set of texts (nowadays usually electronically stored and processed) used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

Language Corpora

Acquis Communautaire (AC)
The Acquis Communautaire (AC) is the total body of European Union (EU) law applicable in the the EU Member States, and currently comprises selected texts written between the 1950s and now. A collection of parallel texts in the following 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish.
Australian National Corpus
The Australian National Corpus is a discovery service that collates and provides access to assorted examples of Australian English text, transcriptions, audio and audio-visual materials.
BYU Law & Corpus Linguistics
Designed specifically for lawyers and scholars, the new Law and Corpus Linguistics Technology Platform for linguistic analysis includes: The Corpus of Founding Era American English;
The Corpus of Early Modern English; The Corpus of Supreme Court of the United States.
Chinese corpora
A collection of Chinese corpora and frequency lists provided by Leeds University. Please note, that access may not be available outside of Leeds University. The University is working to solve this issue.
Chinese-English Parallel Corpora
Aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others.
corpus.byu.edu
The most widely used online corpora -- more than 130,000 distinct researchers, teachers, and students each month.
Demo corpora for teaching
Small to moderate sized text collections for teaching, text-analysis workshops, etc.
European language corpora
Collated by Humboldt University Berlin, Faculty of Language, Literature and Humanities
Japanese corpora
Corpora built by the National Institute for Japanese Language and Linguistics.
Open Korean Corpora
Open Korean Corpora provides a list of freely accessible and downloadable datasets
Parallel corpora
Scroll down past navigation. This page is your 'shopping list' for parallel texts.
Research Centre for Professional Communication in English - Corpora resources
Department of English, The Hong Kong Polytechnic University
SEAlang Library
SEAlang Library resources include bilingual and monolingual dictionaries, monolingual text corpora, aligned bitext corpora, and a variety of tools for manipulating, searching, and displaying complex scripts.
Virtual Language Observatory
Search through hundreds of thousands of language resources, browse and use facets to narrow down to your language of interest and resource type (coprora).
Wikipedia - list of text corpora
A list of text corpora in various languages collated by Wikipedia

Corpus linguistics: A guide to the methodology by Anatol Stefanowitsch
Publication Date: 2020
Corpora are widely used in linguistics, but not always wisely. This book attempts to frame corpus linguistics systematically as a variant of the observational method. The first part introduces the reader to the general methodological discussions surrounding corpus data as well as the practice of doing corpus linguistics, including issues such as the scientific research cycle, research design, extraction of corpus data and statistical evaluation. The second part consists of a number of case studies from the main areas of corpus linguistics (lexical associations, morphology, grammar, text and metaphor), surveying the range of issues studied in corpus linguistics while at the same time showing how they fit into the methodology outlined in the first part.
Corpus Linguistics for Education by Pascual Pérez-Paredes
Publication Date: 2020
Taking a hands-on approach to showcase the applications of corpora in the exploration of educationally relevant topics, this book:
- covers 18 key skills including corpus building, the role of frequency, different corpus methods, transcription and annotation;
- demonstrates the use of available corpora and desktop and online corpus analysis tools to conduct original analyses;
- features case studies and step-by-step guides within each chapter;
- emphasises the use of interview data in research projects.
Understanding Corpus Linguistics by Danielle Barth; Stefan Schnell
Publication Date: 2022
This textbook introduces the fundamental concepts and methods of corpus linguistics for students approaching this topic for the first time, putting specific emphasis on the enormous linguistic diversity represented by approximately 7,000 human languages and broadening the scope of current concerns in general corpus linguistics.

Text mining & text analysis

Language Corpora