Data & Software
Data
PolyNews
PolyNews is a multilingual dataset containing news titles in 77 languages and 19 scripts.
PolyNews aims to provide an easily-accessible, unified and de-duplicated dataset that combines five disparate data sources. It can be used for domain adaptation of language models, language modeling, or text generation in both high-resource and low-resource languages.
Access the dataset on HuggingFace Datasets
PolyNewsParallel
PolyNews is a multilingual parallel dataset containing news titles 833 language pairs, spanning in 64 languages and 17 scripts.
PolyNewsParallel aims to provide an easily-accessible, unified and de-duplicated dataset that combines three disparate data sources. It can be used for machine translation or text retrieval in both high-resource and low-resource languages.
Access the dataset on HuggingFace Datasets
xMIND (A Multilingual Dataset for Cross-lingual News Recommendation)
xMIND is a large-scale multilingual news dataset for multi- and cross-lingual news recommendation. xMIND is derived from the English MIND dataset using open-source neural machine translation (i.e., NLLB 3.3B).
xMIND contains 130K news translated into 14 linguistically and geographically diverse languages, with digital footprints of varying sizes. The goal of xMIND is to serve as a benchmark dataset for news recommendation, and to foster broader research into multilingual and cross-lingual news recommendation, for speakers of both high and low-resource languages.
Access the dataset on GitHub
Access xMINDlarge and xMINDsmall on HuggingFace Datasets
Paper: MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation
NeMig - A Bilingual News Collection and Knowledge Graph about Migration
NeMig represents a bilingual news collection and knowledge graphs on the topic of migration. The news corpora in German and English were collected from online media outlets from Germany and the US, respectively. NeMIg contains rich textual and metadata information, sentiment and political orientation annotations, as well as named entities extracted from the articles’ content and metadata and linked to Wikidata. The corresponding knowledge graphs (NeMigKG) built from each corpus are expanded with up to two-hop neighbors from Wikidata of the initial set of linked entities.
Access the dataset on Zenodo
Paper: NeMig - A Bilingual News Collection and Knowledge Graph about Migration
Models
NaSE (News-adapted Sentence Encoder)
NaSE is a news-adapted sentence encoder, domain-specialized starting from a pretrained massively multilingual sentence encoder. It leverages the PolyNews and PolyNewsParallel corpora, and was pretrained using two objectives, namely denoising auto-encoding and sequence-to-sequence machine translation.
Access the model directly from HuggingFace Models
Software
NewsRecLib: A PyTorch-Lightning Library for Neural News Recommendation
- Code: GitHub
- Documentation: Read the Docs
- Paper: NewsRecLib: A PyTorch-Lightning Library for Neural News Recommendation