NYTAC-CC is a climate change-specific subcorpus extracted from the New York Times Annotated Corpus (NYTAC). It consists of 3,630 XML files, each representing a full-length news article related to climate change, published between 1987 and 2007. The corpus was created to facilitate research on climate discourse in traditional news media and supports both quantitative and qualitative NLP analyses. Due to licensing restrictions, the corpus itself is not included in this repository. However, we provide a list of filenames corresponding to the extracted articles.
- Source: Extracted from the New York Times Annotated Corpus (LDC release), which includes 1.8 million articles.
- Time Span: 1987-2007
- Number of Articles: 3,630
- Format: XML files, each corresponding to a single article.
- Metadata: Articles include metadata such as publication date, newsroom desk, and manually-annotated topics.
The NYTAC-CC subcorpus was created using a hybrid retrieval approach that combines:
- Keyword-based filtering: Identifying climate change-related terms using curated lists of bigrams and keywords.
- Supervised classification: Using an XGBoost classifier trained on manually annotated samples to refine the selection and remove false positives.
- Validation with ClimateBERT: Assessing corpus relevance with a climate-specific BERT model to ensure high precision and recall.
The full corpus is not publicly available due to licensing constraints. However, you can:
- Find the list of extracted article filenames in file_list.txt.
- Use the list to retrieve the corresponding articles if you have access to the NYT Annotated Corpus via the Linguistic Data Consortium (LDC).
NYTAC-CC can be used for:
- Climate change discourse analysis in news media
- Topic modeling and temporal trend analysis (e.g., using LDA)
- Sentiment analysis and stance detection
- Supervised learning tasks for NLP research
You can find the full paper in this repository under NYTAC-CC_paper.pdf
.
If you use NYTAC-CC in your research, please cite:
Grasso, F., Patz, R., & Stede, M. (2024). NYTAC-CC: A Climate Change Subcorpus based on New York Times Articles. CLiC-it 2024: Tenth Italian Conference on Computational Linguistics.
This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.
For questions or collaborations, please contact Manfred Stede or Francesca Grasso at stede@uni-potsdam.de or fr.grasso@unito.it.