Department of Language Modeling

Dataset December 2023

Corpus of Contemporary Polish (KWJP)

A balanced and representative corpus of written Polish covering texts from 2011-2020, divided into fiction, non-fiction, and journalism.

View Details View Source

Software November 2021

COMBO

A language-independent NLP system for dependency parsing, part-of-speech tagging, lemmatisation, morphological analysis, and more, built on top of PyTorch and AllenNLP.

View Details Download View Source

Dataset May 2014

Polish Summaries Corpus

A corpus of Polish news summaries based on texts extracted from the Rzeczpospolita corpus, released under the CC BY 3.0 licence. Co-funded by the ATLAS project and by the European Union.

View Details Download View Source

Dataset January 2012

National Corpus of Polish (NKJP)

A reference corpus of the Polish language with over 1.5 billion words from classic literature, newspapers, specialist periodicals, conversation transcripts, and internet texts, including a manually annotated 1-million-word balanced subcorpus.

View Details Download View Source

Resources

Corpus of Contemporary Polish (KWJP)

COMBO

Polish Summaries Corpus

National Corpus of Polish (NKJP)