Skip to content

Commit b53a290

Browse files
author
Elia Robyn Speer
committed
Address some edge cases of validity (fixes #47)
1 parent 4ab0e17 commit b53a290

File tree

2 files changed

+48
-4
lines changed

2 files changed

+48
-4
lines changed

README.md

+8-2
Original file line numberDiff line numberDiff line change
@@ -635,8 +635,14 @@ date.
635635
valid, such as 'aaj' or 'en-Latnx', because the regex could match a prefix of
636636
a subtag. The validity regex is now required to match completely.
637637

638-
- Bug fix: a language tag that is entirely private use, like 'x-private', is
639-
valid.
638+
- Bug fixes that address some edge cases of validity:
639+
640+
- A language tag that is entirely private use, like 'x-private', is valid
641+
- A language tag that uses the same extension twice, like 'en-a-bbb-a-ccc',
642+
is invalid
643+
- A language tag that uses the same variant twice, like 'de-1901-1901', is
644+
invalid
645+
- A language tag with two extlangs, like 'sgn-ase-bfi', is invalid
640646

641647

642648
## Version 3.1 (February 2021)

langcodes/__init__.py

+40-2
Original file line numberDiff line numberDiff line change
@@ -754,6 +754,26 @@ def is_valid(self) -> bool:
754754
>>> Language.get('x-heptapod').is_valid()
755755
True
756756
757+
A language tag with multiple extlangs will parse, but is not valid.
758+
The only allowed example is 'zh-min-nan', which normalizes to the
759+
language 'nan'.
760+
761+
>>> Language.get('zh-min-nan').is_valid()
762+
True
763+
>>> Language.get('sgn-ase-bfi').is_valid()
764+
False
765+
766+
These examples check that duplicate tags are not valid:
767+
768+
>>> Language.get('de-1901').is_valid()
769+
True
770+
>>> Language.get('de-1901-1901').is_valid()
771+
False
772+
>>> Language.get('en-a-bbb-c-ddd').is_valid()
773+
True
774+
>>> Language.get('en-a-bbb-a-ddd').is_valid()
775+
False
776+
757777
Of course, you should be prepared to catch a failure to parse the
758778
language code at all:
759779
@@ -762,13 +782,31 @@ def is_valid(self) -> bool:
762782
...
763783
langcodes.tag_parser.LanguageTagError: Expected a language code, got 'c'
764784
"""
785+
if self.extlangs is not None:
786+
# An erratum to BCP 47 says that tags with more than one extlang are
787+
# invalid.
788+
if len(self.extlangs) > 1:
789+
return False
790+
765791
subtags = [self.language, self.script, self.territory]
792+
checked_subtags = []
766793
if self.variants is not None:
767794
subtags.extend(self.variants)
768795
for subtag in subtags:
769796
if subtag is not None:
797+
checked_subtags.append(subtag)
770798
if not subtag.startswith('x-') and not VALIDITY.match(subtag):
771799
return False
800+
801+
# We check extensions for validity by ensuring that there aren't
802+
# two extensions introduced by the same letter. For example, you can't
803+
# have two 'u-' extensions.
804+
if self.extensions:
805+
checked_subtags.extend(
806+
[extension[:2] for extension in self.extensions]
807+
)
808+
if len(set(checked_subtags)) != len(checked_subtags):
809+
return False
772810
return True
773811

774812
def has_name_data(self) -> bool:
@@ -1555,8 +1593,8 @@ def standardize_tag(tag: Union[str, Language], macro: bool = False) -> str:
15551593
def tag_is_valid(tag: Union[str, Language]) -> bool:
15561594
"""
15571595
Determines whether a string is a valid language tag. This is similar to
1558-
Language.get(tag).is_valid(), but can return False in the case where the
1559-
tag doesn't parse.
1596+
Language.get(tag).is_valid(), but can return False in the case where
1597+
the tag doesn't parse.
15601598
15611599
>>> tag_is_valid('ja')
15621600
True

0 commit comments

Comments
 (0)