NYTAC-CC: A Climate Change Subcorpus from the New York Times Annotated Corpus

Overview

NYTAC-CC is a climate change-specific subcorpus extracted from the New York Times Annotated Corpus (NYTAC). It consists of 3,630 XML files, each representing a full-length news article related to climate change, published between 1987 and 2007. The corpus was created to facilitate research on climate discourse in traditional news media and supports both quantitative and qualitative NLP analyses. Due to licensing restrictions, the corpus itself is not included in this repository. However, we provide a list of filenames corresponding to the extracted articles.

Corpus Description

Source: Extracted from the New York Times Annotated Corpus (LDC release), which includes 1.8 million articles.
Time Span: 1987-2007
Number of Articles: 3,630
Format: XML files, each corresponding to a single article.
Metadata: Articles include metadata such as publication date, newsroom desk, and manually-annotated topics.

Corpus Construction

The NYTAC-CC subcorpus was created using a hybrid retrieval approach that combines:

Keyword-based filtering: Identifying climate change-related terms using curated lists of bigrams and keywords.
Supervised classification: Using an XGBoost classifier trained on manually annotated samples to refine the selection and remove false positives.
Validation with ClimateBERT: Assessing corpus relevance with a climate-specific BERT model to ensure high precision and recall.

Data Access

The full corpus is not publicly available due to licensing constraints. However, you can:

Find the list of extracted article filenames in file_list.txt.
Use the list to retrieve the corresponding articles if you have access to the NYT Annotated Corpus via the Linguistic Data Consortium (LDC).

Applications

NYTAC-CC can be used for:

Climate change discourse analysis in news media
Topic modeling and temporal trend analysis (e.g., using LDA)
Sentiment analysis and stance detection
Supervised learning tasks for NLP research

Citation

You can find the full paper in this repository under NYTAC-CC_paper.pdf.

If you use NYTAC-CC in your research, please cite:

Grasso, F., Patz, R., & Stede, M. (2024). NYTAC-CC: A Climate Change Subcorpus based on New York Times Articles. CLiC-it 2024: Tenth Italian Conference on Computational Linguistics.

License

This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.

Contact

For questions or collaborations, please contact Manfred Stede or Francesca Grasso at stede@uni-potsdam.de or fr.grasso@unito.it.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
CLiC_it_2024_NYTAC_CC.pdf		CLiC_it_2024_NYTAC_CC.pdf
README.md		README.md
file_list.txt		file_list.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYTAC-CC: A Climate Change Subcorpus from the New York Times Annotated Corpus

Overview

Corpus Description

Corpus Construction

Data Access

Applications

Citation

License

Contact

About

Releases

Packages

Contributors 2

discourse-lab/NYTAC-CC

Folders and files

Latest commit

History

Repository files navigation

NYTAC-CC: A Climate Change Subcorpus from the New York Times Annotated Corpus

Overview

Corpus Description

Corpus Construction

Data Access

Applications

Citation

License

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages