Dataset December 2023
Corpus of Contemporary Polish (KWJP)
A balanced and representative corpus of written Polish covering texts from 2011-2020, divided into fiction, non-fiction, and journalism.
A balanced and representative corpus of written Polish covering texts from 2011-2020, divided into fiction, non-fiction, and journalism.
A language-independent NLP system for dependency parsing, part-of-speech tagging, lemmatisation, morphological analysis, and more, built on top of PyTorch and AllenNLP.
A corpus of Polish news summaries based on texts extracted from the Rzeczpospolita corpus, released under the CC BY 3.0 licence. Co-funded by the ATLAS project and by the European Union.
A reference corpus of the Polish language with over 1.5 billion words from classic literature, newspapers, specialist periodicals, conversation transcripts, and internet texts, including a manually annotated 1-million-word balanced subcorpus.