CLARIN Portal INT

Welcome to the CLARIN portal of the Dutch Language Institute (INT). The INT is one of the three CLARIN B Centres in The Netherlands and it serves as an exclusive CLARIN B Centre for Flanders (Belgium). In fulfilling this role the INT provides researchers, (assistant) professors and students with (advice about) data and tools for linguistic research.

The INT also offers assistance and an infrastructure to researchers or institutions that want to share data or tools that were developed in research projects in the social sciences and humanities. We archive the materials and make them available to other researchers and we also ensure that they can be found within the CLARIN infrastructure through the CLARIN search engine (Virtual Language Observatory). More information on how to deposit materials with the INT can be found here.

More data and tools for Dutch can be found at K-Dutch, the CLARIN Knowledge Centre for Dutch (CLARIN K Centre), CLAPOP, the portal of the Dutch CLARIN community and by means of the CLARIN Virtual Language Observatory, a metadata-based portal for all CLARIN language resources and tools.

If you don't have an academic account that already provides you free access to CLARIN resources, you can Register for a CLARIN account at the CLARIN account registration page.

Quick start

Resources and tools

AutoSearch: This demonstrator allows users to define one or more corpora and upload data for the corpora, after which the corpora will be made automatically searchable in a private workspace.
Users can upload text data annotated with lemma + part of speech tags in TEI or FoLiA format, either as a single XML file or as an archive (zip or tar.gz) containing several XML files. Corpus size is limited to begin with (25 MB limit per uploaded file; 500,000 token limit for an entire corpus), but these limits may be increased at a later point in time. The search application is powered by the INT BlackLab corpus search engine. The search interface is the same as the one used in for example the Corpus of Contemporary Dutch / Corpus Hedendaags Nederlands.
CGN: The Corpus Gesproken Nederlands (Corpus Spoken Dutch) is a collection of 900 hours (almost 9 million words) of contemporary spoken Dutch from native speakers in Flanders and the Netherlands.
The speech recordings are aligned with several transcriptions (e.g. orthographic, phonetic) and annotations (syntax, POS-tags). Metadata, lexica, frequency lists and the tool Corex which can be used to explore the data are included.
Cornetto-LMF (Lexicon Markup Framework): Cornetto is a lexical resource for the Dutch language which combines two resources with different semantic structures.
It includes the Dutch Wordnet which organizes words in sets of synonyms (synsets) and records semantic relations between them. It also includes the Dutch Reference Lexicon which organizes words in form-meaning units (lexical entries) and describes them with short definitions, usage constraints, selection restrictions, syntactic behaviours, combinatorial information and illustrative contexts. Cornetto can be considered as the combination of a thesaurus and a dictionary. It is accessible for human use via a web browser and it is also available in XML for computational use (opensourcewordnet). Cornetto has circa 177,000 lexical entries and 70,000 synsets.
Corpus of Contemporary Dutch (Corpus Hedendaags Nederlands): A collection of more than 800,000 texts taken from newspapers, magazines, news broadcasts and legal writings (1814-2013).
The corpus is a combination of the 5, 27 and 38 Million Words Corpora and the PAROLE Corpus, supplemented with newspaper texts from NRC and De Standaard (until 2013).
Corpus Gysseling: The Corpus Gysseling made available here consists of the collection of all thirteenth-century texts that have served as source material for the Early Middle Dutch Dictionary.
It is the digital edition, enriched with part of speech and lemma, of the thirteenth-century material from the Corpus of Middle Dutch texts (until the year 1300), issued in the period from 1977 to 1987 by the Ghent linguist Maurits Gysseling.
Corpus VU-DNC (VU University Diachronic News text Corpus): The VU-DNC Corpus is a diachronic Dutch newspaper corpus (VU Free University Dutch Newspaper Corpus).
The corpus consists of data from five newspapers: Algemeen Dagblad, NRC (Handelsblad), De Telegraaf, Trouw and De Volkskrant. For each of the newspapers, data of two years (1950/1951 and 2002) are available. The articles were selected by topic (e.g. headline news, foreign news and sports). Special feature of the corpus is that both the presence of subjective elements in the articles and the presence of direct speech have been annotated. The subjective elements are annotated based on a set of lexical elements (subjectivity lexicon). As a result, the corpus is very useful to linguistically oriented researchers who are interested in diachrony and/or subjectivity and to communication scientists and media scholars who are interested in changing practices regarding the framing of coverage.
Dictionary of the Frisian Language (Woordenboek der Friese Taal): The "Wurdboek fan de Fryske taal" is a scientific, descriptive dictionary containing about 120,000 entries.
The dictionary articles provide information on the spelling, part of speech, pronunciation, inflection, etymology, meaning (illustrated with quotes), compositions and derivations of each keyword, along with idiomatic information (collocations, proverbs and figurative meanings).
DuELME-LMF (Lexicon Markup Framework): Please note that the online application to search DuELME was deactivated in October 2023. For more information about the download version of the lexicon, see Taalmaterialen.
DuELME is a lexicon of more than 5,000 Dutch multiple-word expressions. Expressions with the same syntactic pattern are divided into so-called Equivalence Classes, which makes it possible to integrate the lexicon with minimal manual effort into an NLP system. The lexicon has been developed within the framework of the IRME project. (Documentation )
The Dutch Parallel Corpus: The Dutch Parallel Corpus (DPC) is a 10-million-word, sentence-aligned parallel corpus for the language pairs Dutch-English and Dutch-French, with Dutch as the central language.
The corpus contains five different text types and is balanced with respect to text type and translation direction. The entire corpus has been aligned at sentence level and further enriched with linguistic information (lemmas and PoS-tags). A small subset of the Dutch-English part has also been manually aligned at the sub-sentential level.
GaLAHaD: GaLAHaD (Generating Linguistic Annotations for Historical Dutch) provides a flexible environment for automatic enrichment and the evaluation of enrichment tools.
Users can upload their data, automatically add part-of-speech tags and lemmas, inspect the results, and analyze the performance of various tools using a given gold standard. The annotated material can be uploaded to the Autosearch corpus exploration environment and to the LAnCeLoT tool for manual correction of linguistic annotation.
GrETEL 3: GrETEL is a query engine in which linguists can use a natural language example as a starting point for searching a treebank with limited knowledge about tree representations and formal query languages.
By allowing users to search for constructions which are similar to the example they provide, it aims to bridge the gap between traditional and computational linguistics. GrETEL was developed by the Centre for Computational Linguistics at the University of Leuven.
GrETEL 4 federated: GrETEL stands for Greedy Extraction of Trees for Empirical Linguistics. It is a user-friendly search engine for the exploitation of syntactically annotated coropra or treebanks.
This is an expanded version of GrETEL. It has been updated with federated search capabilities, allowing the querying of multiple corpora simultaneously. Further, it includes several additional treebanks made available in cooperation with the University of Groningen, graciously provided and hosted by them through their PaQu platform.
LAnCeLoT: LAnCeLoT (Linguistic Annotation Corpus Laundry Tool) enables researchers to manually correct and refine enrichments, such as those from GaLAHaD, which is essential for high-quality corpus analysis.
LAnCeLoT simplifies this revision process by providing an intuitive, interactive environment for querying, inspecting, and correcting linguistic annotations. Researchers can review and refine corpus data, improving annotation consistency without requiring technical expertise. Currently LAnCeLoT supports the TEI P5 format.
Language Portal (Taalportaal): Taalportaal is a large project aiming at a comprehensive and authoritative scientific grammar, originally only for Dutch and Frisian.
Since 2015, Afrikaans has also been added to the original project. Taalportaal is an interactive knowledge base about these three languages, covering syntax, morphology and phonology.
LASSY LARGE: The Lassy Large Corpus is a collection written texts consisting of approximately 700 million words with automatically generated annotations.
The lemmas and POS-tags were generated with Tadpole (now Frog) and the syntactical dependency structures were generated with Alpino.
LASSY SMALL: The Lassy Small Corpus is a corpus of approximately 1 million words with manually verified syntactical annotations.
The lemmas and POS-tags were generated with Tadpole (now Frog) and the syntactical depency structures were generated with Alpino. The lemmas, POS-tags and syntactic tree structures were manually verified and corrected.
MATEO: MAchine Translation Evaluation Online
MAchine Translation Evaluation Online (MATEO) brings automatic machine translation evaluation to the masses with an accessible user-interface. It was developed at Ghent University, in the Language and Translation Technology Team (LT3) in 2022-2023.
NameScape: Recent research has conclusively proven that names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”).
Research on large corpora is needed to gain a better understanding of, for example, what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. The project aims to fill the need by annotating a substantial amount of literary works with a rich tag set, thereby enabling the participating parties to perform their research in more depth than previously possible. Several exploratory visualization tools will help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.
Please note that NameScape was deactivated in 2023.
OpenConvert: The OpenConvert tools convert to TEI from a number of input formats (ALTO, text, Word, HTML).
The tools were available as a Java command line tool, a web service and a web application. These tools have been discontinued may 2023 because of aging software and dependencies
OpenSoNaR: OpenSoNaR is an online system that allows for analyzing and searching the over 500 million word Dutch reference corpus SoNaR developed within the STEVIN programme under the aegis of the Dutch Language Union.
It is the result of cooperation between INT, TiCC - Tilburg University and company De Taalmonsters, in CLARIN-NL Call 4 project OpenSoNaR. The system incorporates the texts and metadata of SoNaR-500 and SoNaR New Media corpora. The project's main aim was to facilitate the use of the SoNaR corpus by providing a user-friendly online interface, regardless of the user's personal computer expertise. User groups representing Linguistics, Media and Communication Studies, as well as Literary and Cultural Sciences have provided practical use cases on the basis of which the interface has been developed. The system is available here for use in research and educational settings.
PICCL: PICCL (Philosophical Integrator of Computational and Corpus Libraries) offers a workflow for corpus building and builds on a variety of tools.
The primary component of PICCL is TICCL, a Text-induced Corpus Clean-up system, which performs spelling correction and OCR post-correction (normalisation of spelling variants etc).
Stylene: Stylene is a robust, modular system for stylometry and readability research on the basis of existing techniques for automatic text analysis and machine learning, and the development of a web service that allows researchers in the humanities and social sciences to analyze texts with this system.
In this way, the project will make available to researchers recent advances in research on the computational modeling of style and readability. The system was developed in a cooperation between the CLiPS (University of Antwerp) and LT³ (Ghent University) research groups.
Text2Picto: Text2Picto is a translation tool aimed at enhancing communication for people with reading disabilities.
Text2Picto translates Dutch, English, Spanish or French sentences into pictographs –that is, graphic symbols that serve as stand-ins for verbal communication. The Text2Picto demo has been developed by the Centre for Computational Linguistics at the University of Leuven.
Textlens: Textlens is an online text processing dashboard, which provides the state of the art of linguistic processing tools such as spaCy and Stanza for tasks such as automatic tokenization, lemmatization, part of speech tagging, named entity recognition and dependency analysis for Dutch, English, French and German.
No programming knowledge or installation is required, instead a user can execute, monitor and control linguistic processing tasks from a user-oriented graphical web interface. The only requirement is access to a web browser, which makes automatic linguistic processing readily available across platforms and devices. In addition, a fully documented API is available for programmatic access.
WebCelex: WebCelex is a webbased interface to the CELEX lexical databases of English, Dutch and German.
CELEX was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven. For each language, the database contains detailed information on: orthography (variations in spelling, hyphenation), phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress), morphology (derivational and compositional structure, inflectional paradigms), syntax (word class, word class-specific subcategorizations, argument structures) and word frequency (summed word and lemma counts, based on recent and representative text corpora).

Notice

In connection with copyright law, some products or tools are only accessible with a user ID and password. Are you employed by a university or scientific institute? Then you can log in with the user ID and password of your own organization. Is your organization not in the list or do you not have an account at an academic institution? Then you can open an account with CLARIN.EU

About this portal

The Repository "CLARIN Centre INT" gives access to language resources and tools from the INT and other CLARIN Members. The INT has obtained the Data Seal of Approval.

Information about deposition
Preservation Plan
End User License Agreement (Dutch with English translation)
Privacy Policy of the INT Research Environments
Privacy statement
Terms and conditions

About CLARIN

CLARIN wants to achieve an integrated, interoperable research infrastructure of language resources and language technology. This infrastructure must be stable, permanent, accessible and expandable; it should put an end to the current fragmentation, and promote the use of computational techniques in the humanities (eHumanities).

About the Dutch Language Institute

More information about the Dutch Language Institute (INT) can be found on our website. General information is also available in English.