-
Notifications
You must be signed in to change notification settings - Fork 4
RedwoodsTop
The LinGO Redwoods Treebank is a collection of hand-annotated corpora analysed with the LinGO ERG. For each utterance from a corpus, the treebank records (in principle) all analyses hypothesized by the grammar, together with an annotator decision as to which reading is preferred in context.
The key innovative aspect of the Redwoods approach to treebanking is the anchoring of all linguistic data captured in the treebank to the HPSG framework and a generally-available broad-coverage grammar of English, viz. the LinGO English Resource Grammar. Unlike existing treebanks, there is no need to define a (new) form of grammatical representation specific to the treebank (and, consequently, less dissemination effort in establishing this representation). Instead, the treebank records complete syntacto-semantic analyses as defined by the LinGO ERG; tools are provided to extract many different types of linguistic information at varying granularity.
Other relevant aspects of the Redwoods Treebank include the integration of alternate, though dispreferred analyses for each utterance and the dynamic nature of the annotations: as the underlying grammar evolves and improves its analyses, there is a provision for a (nearly) fully automated update of the treebank against a version of the original corpus analysed with the revised grammar. As a methodological results, part of the Redwoods data are now regularly maintained as part of the grammar regression cycle with each new release of the ERG.
As of October 2011, we are in the process of releasing the 45,000-sentence Seventh Growth, a substantially enlarged new revision of the Redwoods treebank, consisting of data sets from several distinct domains, including not only the Verbmobil and ecommerce corpora from earlier releases, but now also data from the LOGON Norwegian-English MT corpus, the WeScience 100-article portion of the English Wikipedia and a portion of the semantically tagged subset of the Brown corpus (SemCor). The version of the grammar used in parsing this data is "ERG (1111)".
The following table summarizes the Seventh Growth in terms of the total number of utterances, average string length, and average ambiguity rates for three sub-divisions, viz. rejected items (t-active = 0), fully disambiguated items (t-active = 1), and a small number of items for which annotators considered more than one analysis active (t-active > 1), typically where the ambiguity resides in tokenization alternatives. The profile name abbreviations are as follows: CB = Cathedral and Bazaar essay, CSLI = syntactic test suite, EC* = ecommerce corpus, FRACAS = semantic test suite, HIKE and JH* and PS* and RONDANE and TG* = LOGON corpus, MRS = semantic test suite, RTC* = Tanaka corpus, SC* = SemCor corpus, TREC = TREC 9 corpus, WS* = WeScience corpus.
-
Items Parsed t-active = 0 t-active = 1 t-active > 1 CB 769 677 96 28.58 440 581 17.82 312 0 0.00 0 CSLI 1348 917 0 0.00 0 917 6.44 8 0 0.00 0 ECOC 1254 1216 34 11.47 254 1181 7.57 88 1 11.00 222 ECOS 1678 1596 88 10.87 173 1505 8.43 102 3 5.00 29 ECPA 1654 1580 92 11.35 170 1486 8.22 79 2 7.50 20 ECPR 1207 1168 66 12.14 177 1102 9.32 110 0 0.00 0 FRACAS 640 636 5 12.60 196 631 7.60 50 0 0.00 0 HIKE 330 329 2 13.50 327 327 12.85 192 0 0.00 0 JH0 261 247 8 31.00 473 239 18.85 358 0 0.00 0 JH1 1353 1319 29 21.86 341 1287 13.17 217 3 20.67 500 JH2 1307 1240 84 20.64 409 1153 13.68 236 3 12.67 500 JH3 1443 1401 70 23.70 434 1329 12.86 211 2 21.50 278 JH4 1603 1540 58 21.57 374 1479 12.83 216 3 31.33 500 JH5 464 437 21 21.57 281 416 12.09 203 0 0.00 0 JHK 250 245 11 19.27 407 234 12.49 190 0 0.00 0 JHU 294 286 8 25.75 438 278 12.87 200 0 0.00 0 MRS 107 107 0 0.00 0 107 4.47 3 0 0.00 0 PSK 45 42 2 4.00 22 40 9.37 91 0 0.00 0 PS 965 932 26 20.08 382 903 13.64 229 3 20.67 392 PSU 45 42 5 6.20 6 37 12.59 214 0 0.00 0 RONDANE 1402 1271 97 22.10 405 1170 14.36 250 4 20.00 403 RTC000 1500 1442 32 16.34 155 1410 11.46 42 0 0.00 0 RTC001 1500 1440 47 15.45 114 1392 11.50 46 1 18.00 500 SC01 1000 918 73 27.21 380 845 15.34 236 0 0.00 0 SC02 1103 1006 95 26.46 395 906 15.07 238 5 30.20 500 SC03 1000 922 77 26.84 387 843 14.75 234 2 11.00 251 TG1 1013 970 62 20.47 384 907 13.51 235 1 24.00 500 TG2 1001 958 57 20.95 343 900 14.14 248 1 8.00 8 TGK 90 84 4 16.50 381 78 14.68 249 2 18.00 333 TGU 90 88 4 27.75 365 83 14.08 296 1 13.00 417 TREC 693 685 5 11.60 132 680 6.91 34 0 0.00 0 VM6 4037 3883 201 11.41 248 3668 7.53 150 13 11.46 212 VM13 3408 3256 174 13.52 279 3075 8.08 153 4 12.00 447 VM31 3914 3763 164 10.68 205 3595 5.90 91 2 9.00 257 VM32 1034 1013 21 11.48 172 992 7.46 127 0 0.00 0 WS01 805 707 90 28.46 425 615 15.76 251 2 21.50 262 WS02 946 880 66 27.55 433 810 15.15 264 4 27.25 500 WS03 920 821 78 24.62 403 740 14.76 255 3 29.33 500 WS04 988 884 103 27.79 444 775 14.93 247 6 22.17 421 WS05 911 774 106 27.02 431 660 15.48 265 8 11.87 236 WS06 890 791 73 23.56 389 713 15.28 272 5 19.80 405 WS07 807 723 63 25.98 442 649 14.68 250 11 19.36 361 WS08 904 791 69 29.30 439 709 17.26 266 13 20.69 342 WS09 940 861 48 24.79 407 812 14.58 255 1 23.00 500 WS10 914 815 93 25.98 387 710 15.34 248 12 27.67 460 WS11 746 660 60 28.87 449 598 14.83 266 2 11.00 269 WS12 786 682 53 25.13 402 627 16.37 295 2 22.50 268 WS13 1001 888 74 26.22 425 800 14.78 255 14 25.64 464 Totals 51360 47933 2794 44994 139
Earlier relevant Redwoods revisions include the Second Growth, Third Growth, and Fifth Growth.
Like the previous Redwoods Fifth Growth revision, the Seventh Growth is distributed in [incr tsdb()] profile form exclusively (see below for instructions on how to expand the data into a textual export format), but we have limited the number of dispreferred analyses per item to a maximum of the 500 best analyses according to our MaxEnt model trained on an interim version of this treebank. In principle, Redwoods users could use the LKB or PET parsers to obtain the complete set of analyses and then use the [incr tsdb()] update facility to automatically produce a version of the treebank against the unrestricted profile. However, we expect that the reduced distribution provides a sufficiently large portion of the dispreferred analyses for high-quality stochastic modelling and that the substantial reduction in overall size will actually benefit experimentation.
(fix: update instructions for LOGON 'redwoods' script)
Assuming a functional installation of the LKB, ERG, and [incr tsdb()] (see the LogonInstallation page, for details), the process of exporting all or parts of the Redwoods Treebank into a collection of plain text files can be fully automated by virtue of a shell script. For (somewhat incomplete, sadly) instructions on exporting various views on the Redwoods data, please see the Section Exporting Various Plain-Text Formats on the WeScience page.
Following is an incomplete selection of publications on the creation and use of the Redwoods treebank.
-
Oepen, Stephan, Kristina Toutanova, Stuart Shieber, Christopher Manning, Dan Flickinger, and Thorsten Brants (2002). The LinGO Redwoods Treebank: Motivation and Preliminary Applications. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan (pages 1253-1257).
-
Oepen, Stephan, Dan Flickinger, Kristina Toutanova, and Christoper D. Manning (2002). LinGO Redwoods. A Rich and Dynamic Treebank for HPSG. In Proceedings of The First Workshop on Treebanks and Linguistic Theories (TLT 2002), Sozopol, Bulgaria.
-
Toutanova, Kristina, Christoper D. Manning, and Stephan Oepen (2002). Parse Ranking for a Rich HPSG Grammar. In Proceedings of The First Workshop on Treebanks and Linguistic Theories (TLT 2002), Sozopol, Bulgaria.
-
Toutanova, Kristina and Christopher D. Manning (2002). Feature Selection for a Rich HPSG Grammar Using Decision Trees. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL 2002), Taipei, Taiwan.
-
Velldal, Erik, Stephan Oepen, and Dan Flickinger (2004). Paraphrasing Treebanks for Stochastic Realization Ranking. In Proceedings of The Third Workshop on Treebanks and Linguistic Theories (TLT 2004), Tuebingen, Germany.
-
Oepen, Stephan, Dan Flickinger, Kristina Toutanova, and Christoper D. Manning (2004). LinGO Redwoods: A Rich and Dynamic Treebank for HPSG. Research on Language and Computation 2(4):575-596.
An overview presentation on many of the methodological aspects of the Redwoods initiative is available from an invited presentation at the 2003 Treebanks and Linguistic Theories workshop.
The Redwoods treebank has been under active development at the CSLI LinGO Laboratory since sometime early in 2001. The annotation environment was built from the combination of the LKB tree comparison window (originally developed by Rob Malouf) and the [incr tsdb()] profiling tools; Stephan Oepen did the bulk of the Redwoods software development. Dan Flickinger, as the main developer of the ERG, has been an invaluable source of inspiration on the treebank design and has also been the main treebanker since Redwoods Second Growth. Chris Manning and Kristina Toutanova, and Stuart Shieber, as early adopters and consultants on the overall design of the resource and representations, have greatly influenced the evolution of the treebank and pioneered its use for stochastic parse selection. Ezra Callahan was the first annotator, constructing what has been released as the First Growth during a ten-week summer internship (it appears Ezra then went on to become employee #6 at Facebook and retired at age 31). John Beavers did the annotations of the new ecommerce sections (and later became a professor of linguistics in Texas). Francis Bond and his colleagues at the NTT Research Laboratory have been vigorous supporters, adapted the Redwoods approach for Japanese (dubbing their treebank Hinoki), and thus helped a lot in scaling up the technology. Marty Mayberry, Jason Baldridge, Alex Lascarides, and Miles Osborne, as active users of the ERG and Redwoods data, have provided crucial feedback on the representations and software and positively contributed to recent developments. Tim Baldwin, Emily M. Bender, Kathryn Campbell-Kibler, Ann Copestake, Andreas Eisele, Rob Malouf, Rebecca Neil, Ivan Sag, Erik Velldal, and Tom Wasow have all helped through advice and productive critique in various stages of the project.
The development of the Redwoods treebank was financed opportunistically from numerous sources, including multiple donations to CSLI from YY Technologies (Mountain View, CA), a CSLI Seeding Grant, the Stanford Symbolic Systems Program (through multiple sponsored summer internships), the Commission of the European Community (through the Deep-Thought project), Scottish Enterprise (through the ROSIE project), Nippon Telegraph and Telephone Corporation (NTT) (through a sponsored research contract to the LinGO Laboratory), and the Norwegian LOGON Initiative (through financial support to Dan Flickinger and Stephan Oepen).
Home | Forum | Discussions | Events