RedwoodsTop

Overview

The LinGO Redwoods Treebank is a collection of hand-annotated corpora analysed with the LinGO ERG. For each utterance from a corpus, the treebank records (in principle) all analyses hypothesized by the grammar, together with an annotator decision as to which reading is preferred in context.

The key innovative aspect of the Redwoods approach to treebanking is the anchoring of all linguistic data captured in the treebank to the HPSG framework and a generally-available broad-coverage grammar of English, viz. the LinGO English Resource Grammar. Unlike existing treebanks, there is no need to define a (new) form of grammatical representation specific to the treebank (and, consequently, less dissemination effort in establishing this representation). Instead, the treebank records complete syntacto-semantic analyses as defined by the LinGO ERG; tools are provided to extract many different types of linguistic information at varying granularity.

Other relevant aspects of the Redwoods Treebank include the integration of alternate, though dispreferred analyses for each utterance and the dynamic nature of the annotations: as the underlying grammar evolves and improves its analyses, there is a provision for a (nearly) fully automated update of the treebank against a version of the original corpus analysed with the revised grammar. As a methodological results, part of the Redwoods data are now regularly maintained as part of the grammar regression cycle with each new release of the ERG.

Current Development Status

As of October 2011, we are in the process of releasing the 45,000-sentence Seventh Growth, a substantially enlarged new revision of the Redwoods treebank, consisting of data sets from several distinct domains, including not only the Verbmobil and ecommerce corpora from earlier releases, but now also data from the LOGON Norwegian-English MT corpus, the WeScience 100-article portion of the English Wikipedia and a portion of the semantically tagged subset of the Brown corpus (SemCor). The version of the grammar used in parsing this data is "ERG (1111)".

The following table summarizes the Seventh Growth in terms of the total number of utterances, average string length, and average ambiguity rates for three sub-divisions, viz. rejected items (t-active = 0), fully disambiguated items (t-active = 1), and a small number of items for which annotators considered more than one analysis active (t-active > 1), typically where the ambiguity resides in tokenization alternatives. The profile name abbreviations are as follows: CB = Cathedral and Bazaar essay, CSLI = syntactic test suite, EC* = ecommerce corpus, FRACAS = semantic test suite, HIKE and JH* and PS* and RONDANE and TG* = LOGON corpus, MRS = semantic test suite, RTC* = Tanaka corpus, SC* = SemCor corpus, TREC = TREC 9 corpus, WS* = WeScience corpus.


	Items	Parsed	t-active = 0			t-active = 1			t-active > 1
CB	769	677	96	28.58	440	581	17.82	312	0	0.00	0
CSLI	1348	917	0	0.00	0	917	6.44	8	0	0.00	0
ECOC	1254	1216	34	11.47	254	1181	7.57	88	1	11.00	222
ECOS	1678	1596	88	10.87	173	1505	8.43	102	3	5.00	29
ECPA	1654	1580	92	11.35	170	1486	8.22	79	2	7.50	20
ECPR	1207	1168	66	12.14	177	1102	9.32	110	0	0.00	0
FRACAS	640	636	5	12.60	196	631	7.60	50	0	0.00	0
HIKE	330	329	2	13.50	327	327	12.85	192	0	0.00	0
JH0	261	247	8	31.00	473	239	18.85	358	0	0.00	0
JH1	1353	1319	29	21.86	341	1287	13.17	217	3	20.67	500
JH2	1307	1240	84	20.64	409	1153	13.68	236	3	12.67	500
JH3	1443	1401	70	23.70	434	1329	12.86	211	2	21.50	278
JH4	1603	1540	58	21.57	374	1479	12.83	216	3	31.33	500
JH5	464	437	21	21.57	281	416	12.09	203	0	0.00	0
JHK	250	245	11	19.27	407	234	12.49	190	0	0.00	0
JHU	294	286	8	25.75	438	278	12.87	200	0	0.00	0
MRS	107	107	0	0.00	0	107	4.47	3	0	0.00	0
PSK	45	42	2	4.00	22	40	9.37	91	0	0.00	0
PS	965	932	26	20.08	382	903	13.64	229	3	20.67	392
PSU	45	42	5	6.20	6	37	12.59	214	0	0.00	0
RONDANE	1402	1271	97	22.10	405	1170	14.36	250	4	20.00	403
RTC000	1500	1442	32	16.34	155	1410	11.46	42	0	0.00	0
RTC001	1500	1440	47	15.45	114	1392	11.50	46	1	18.00	500
SC01	1000	918	73	27.21	380	845	15.34	236	0	0.00	0
SC02	1103	1006	95	26.46	395	906	15.07	238	5	30.20	500
SC03	1000	922	77	26.84	387	843	14.75	234	2	11.00	251
TG1	1013	970	62	20.47	384	907	13.51	235	1	24.00	500
TG2	1001	958	57	20.95	343	900	14.14	248	1	8.00	8
TGK	90	84	4	16.50	381	78	14.68	249	2	18.00	333
TGU	90	88	4	27.75	365	83	14.08	296	1	13.00	417
TREC	693	685	5	11.60	132	680	6.91	34	0	0.00	0
VM6	4037	3883	201	11.41	248	3668	7.53	150	13	11.46	212
VM13	3408	3256	174	13.52	279	3075	8.08	153	4	12.00	447
VM31	3914	3763	164	10.68	205	3595	5.90	91	2	9.00	257
VM32	1034	1013	21	11.48	172	992	7.46	127	0	0.00	0
WS01	805	707	90	28.46	425	615	15.76	251	2	21.50	262
WS02	946	880	66	27.55	433	810	15.15	264	4	27.25	500
WS03	920	821	78	24.62	403	740	14.76	255	3	29.33	500
WS04	988	884	103	27.79	444	775	14.93	247	6	22.17	421
WS05	911	774	106	27.02	431	660	15.48	265	8	11.87	236
WS06	890	791	73	23.56	389	713	15.28	272	5	19.80	405
WS07	807	723	63	25.98	442	649	14.68	250	11	19.36	361
WS08	904	791	69	29.30	439	709	17.26	266	13	20.69	342
WS09	940	861	48	24.79	407	812	14.58	255	1	23.00	500
WS10	914	815	93	25.98	387	710	15.34	248	12	27.67	460
WS11	746	660	60	28.87	449	598	14.83	266	2	11.00	269
WS12	786	682	53	25.13	402	627	16.37	295	2	22.50	268
WS13	1001	888	74	26.22	425	800	14.78	255	14	25.64	464
Totals	51360	47933	2794			44994			139

Earlier relevant Redwoods revisions include the Second Growth, Third Growth, and Fifth Growth.

Data Format

Like the previous Redwoods Fifth Growth revision, the Seventh Growth is distributed in [incr tsdb()] profile form exclusively (see below for instructions on how to expand the data into a textual export format), but we have limited the number of dispreferred analyses per item to a maximum of the 500 best analyses according to our MaxEnt model trained on an interim version of this treebank. In principle, Redwoods users could use the LKB or PET parsers to obtain the complete set of analyses and then use the [incr tsdb()] update facility to automatically produce a version of the treebank against the unrestricted profile. However, we expect that the reduced distribution provides a sufficiently large portion of the dispreferred analyses for high-quality stochastic modelling and that the substantial reduction in overall size will actually benefit experimentation.

(fix: update instructions for LOGON 'redwoods' script)

Expanding and Exporting

Assuming a functional installation of the LKB, ERG, and [incr tsdb()] (see the LogonInstallation page, for details), the process of exporting all or parts of the Redwoods Treebank into a collection of plain text files can be fully automated by virtue of a shell script. For (somewhat incomplete, sadly) instructions on exporting various views on the Redwoods data, please see the Section Exporting Various Plain-Text Formats on the WeScience page.

History

http://www.delph-in.net/redwoods/ezra.jpg

Bibliography

Following is an incomplete selection of publications on the creation and use of the Redwoods treebank.

Oepen, Stephan, Kristina Toutanova, Stuart Shieber, Christopher Manning, Dan Flickinger, and Thorsten Brants (2002). The LinGO Redwoods Treebank: Motivation and Preliminary Applications. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan (pages 1253-1257).
Oepen, Stephan, Dan Flickinger, Kristina Toutanova, and Christoper D. Manning (2002). LinGO Redwoods. A Rich and Dynamic Treebank for HPSG. In Proceedings of The First Workshop on Treebanks and Linguistic Theories (TLT 2002), Sozopol, Bulgaria.
Toutanova, Kristina, Christoper D. Manning, and Stephan Oepen (2002). Parse Ranking for a Rich HPSG Grammar. In Proceedings of The First Workshop on Treebanks and Linguistic Theories (TLT 2002), Sozopol, Bulgaria.
Toutanova, Kristina and Christopher D. Manning (2002). Feature Selection for a Rich HPSG Grammar Using Decision Trees. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL 2002), Taipei, Taiwan.
Velldal, Erik, Stephan Oepen, and Dan Flickinger (2004). Paraphrasing Treebanks for Stochastic Realization Ranking. In Proceedings of The Third Workshop on Treebanks and Linguistic Theories (TLT 2004), Tuebingen, Germany.
Oepen, Stephan, Dan Flickinger, Kristina Toutanova, and Christoper D. Manning (2004). LinGO Redwoods: A Rich and Dynamic Treebank for HPSG. Research on Language and Computation 2(4):575-596.

An overview presentation on many of the methodological aspects of the Redwoods initiative is available from an invited presentation at the 2003 Treebanks and Linguistic Theories workshop.

Acknowledgements

The Redwoods treebank has been under active development at the CSLI LinGO Laboratory since sometime early in 2001. The annotation environment was built from the combination of the LKB tree comparison window (originally developed by Rob Malouf) and the [incr tsdb()] profiling tools; Stephan Oepen did the bulk of the Redwoods software development. Dan Flickinger, as the main developer of the ERG, has been an invaluable source of inspiration on the treebank design and has also been the main treebanker since Redwoods Second Growth. Chris Manning and Kristina Toutanova, and Stuart Shieber, as early adopters and consultants on the overall design of the resource and representations, have greatly influenced the evolution of the treebank and pioneered its use for stochastic parse selection. Ezra Callahan was the first annotator, constructing what has been released as the First Growth during a ten-week summer internship (it appears Ezra then went on to become employee #6 at Facebook and retired at age 31). John Beavers did the annotations of the new ecommerce sections (and later became a professor of linguistics in Texas). Francis Bond and his colleagues at the NTT Research Laboratory have been vigorous supporters, adapted the Redwoods approach for Japanese (dubbing their treebank Hinoki), and thus helped a lot in scaling up the technology. Marty Mayberry, Jason Baldridge, Alex Lascarides, and Miles Osborne, as active users of the ERG and Redwoods data, have provided crucial feedback on the representations and software and positively contributed to recent developments. Tim Baldwin, Emily M. Bender, Kathryn Campbell-Kibler, Ann Copestake, Andreas Eisele, Rob Malouf, Rebecca Neil, Ivan Sag, Erik Velldal, and Tom Wasow have all helped through advice and productive critique in various stages of the project.

The development of the Redwoods treebank was financed opportunistically from numerous sources, including multiple donations to CSLI from YY Technologies (Mountain View, CA), a CSLI Seeding Grant, the Stanford Symbolic Systems Program (through multiple sponsored summer internships), the Commission of the European Community (through the Deep-Thought project), Scottish Enterprise (through the ROSIE project), Nippon Telegraph and Telephone Corporation (NTT) (through a sponsored research contract to the LinGO Laboratory), and the Norwegian LOGON Initiative (through financial support to Dan Flickinger and Stephan Oepen).

Home | Forum | Discussions | Events

Provide feedback

Saved searches

Use saved searches to filter your results more quickly