Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected number of tokens from host part #12

Open
miguelspinn3r opened this issue Aug 9, 2016 · 3 comments
Open

Unexpected number of tokens from host part #12

miguelspinn3r opened this issue Aug 9, 2016 · 3 comments

Comments

@miguelspinn3r
Copy link

I have follow the readme examples but, when testing locally the last one is not returning the same results as in the readme:

curl -XPUT 'http://localhost:9200/twitter21/' -d '{
    "settings": {
        "analysis": {
            "filter": {
                "url_host": {
                    "type": "url",
                    "part": ["host"],
                    "url_decode": false
                }
            },
            "analyzer": {
                "url_host": {
                    "filter": ["url_host"],
                    "tokenizer": "whitespace"
                }
            }
        }
    },
    "mappings": {
        "example_type": {
            "properties": {
                "url": {
                    "type": "multi_field",
                    "fields": {
                        "url": {"type": "string"},
                        "host": {"type": "string", "analyzer": "url_host"}
                    }
                }
            }
        }
    }
}'

{"acknowledged":true}




curl 'http://localhost:9200/twitter21/_analyze?analyzer=url_host&pretty' -d 'https://foo.bar.com/baz.html'

{
  "tokens" : [ {
    "token" : "foo.bar.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "bar.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 2
  } ]
}

I was expecting to retrieve just one token foo.bar.com instead of 3, also i believe the start_offset and end_offset are wrong.

These are the elasticsearch and plugin versions:

./elasticsearch -version
Version: 2.3.4, Build: e455fd0/2016-06-30T11:24:31Z, JVM: 1.8.0_101

 bin/plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.3.4.2/elasticsearch-analysis-url-2.3.4.2.zip

Please correct me if I'm wrong, I hope I'm not missing something else.

Thanks in advance

@jlinn
Copy link
Owner

jlinn commented Aug 9, 2016

Ah. Looks like I neglected to update the readme for the filter a while back. Internally, the token filter delegates to the tokenizer, and the same default configurations apply, meaning tokenize_host is true by default. Three tokens is the correct result given your configuration. The offsets are pretty obviously wrong, though. Thanks for letting me know. I'll update the readme and take a look at the offset issue.

@miguelspinn3r
Copy link
Author

Thanks for the info, It generates one token with the right config

@jlinn
Copy link
Owner

jlinn commented Aug 15, 2016

The token offsets and types should be correct in versions 2.3.4.3 and 2.3.5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants