Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo Data no longer accessible #415

Closed
fdion opened this issue Apr 27, 2023 · 11 comments
Closed

Demo Data no longer accessible #415

fdion opened this issue Apr 27, 2023 · 11 comments
Labels
question Further information is requested

Comments

@fdion
Copy link

fdion commented Apr 27, 2023

  • Orion version: 0.4.1
  • Python version: 3.8 in conda environment
  • Operating System: Windows 10

Description

Trying to import a signal from any of the example notebooks fails.

What I Did

from orion.data import load_signal
train_data = load_signal('S-1-train')


---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 train_data = load_signal('S-1-train')

File ~\Anaconda3\envs\orion\lib\site-packages\orion\data.py:142, in load_signal(signal, test_size, timestamp_column, value_column)
    140     data = load_csv(signal, timestamp_column, value_column)
    141 else:
--> 142     data = download(signal)
    144 data = format_csv(data)
    146 if test_size is None:

File ~\Anaconda3\envs\orion\lib\site-packages\orion\data.py:89, in download(name, test_size, data_path)
     87     LOGGER.info('Downloading CSV %s from %s', name, url)
     88     os.makedirs(data_path, exist_ok=True)
---> 89     data = pd.read_csv(url)
     90     data.to_csv(filename, index=False)
     92 return data

File ~\Anaconda3\envs\orion\lib\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File ~\Anaconda3\envs\orion\lib\site-packages\pandas\io\parsers\readers.py:678, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    663 kwds_defaults = _refine_defaults_read(
    664     dialect,
    665     delimiter,
   (...)
    674     defaults={"delimiter": ","},
    675 )
    676 kwds.update(kwds_defaults)
--> 678 return _read(filepath_or_buffer, kwds)

File ~\Anaconda3\envs\orion\lib\site-packages\pandas\io\parsers\readers.py:575, in _read(filepath_or_buffer, kwds)
    572 _validate_names(kwds.get("names", None))
    574 # Create the parser.
--> 575 parser = TextFileReader(filepath_or_buffer, **kwds)
    577 if chunksize or iterator:
    578     return parser

File ~\Anaconda3\envs\orion\lib\site-packages\pandas\io\parsers\readers.py:932, in TextFileReader.__init__(self, f, engine, **kwds)
    929     self.options["has_index_names"] = kwds["has_index_names"]
    931 self.handles: IOHandles | None = None
--> 932 self._engine = self._make_engine(f, self.engine)

File ~\Anaconda3\envs\orion\lib\site-packages\pandas\io\parsers\readers.py:1216, in TextFileReader._make_engine(self, f, engine)
   1212     mode = "rb"
   1213 # error: No overload variant of "get_handle" matches argument types
   1214 # "Union[str, PathLike[str], ReadCsvBuffer[bytes], ReadCsvBuffer[str]]"
   1215 # , "str", "bool", "Any", "Any", "Any", "Any", "Any"
-> 1216 self.handles = get_handle(  # type: ignore[call-overload]
   1217     f,
   1218     mode,
   1219     encoding=self.options.get("encoding", None),
   1220     compression=self.options.get("compression", None),
   1221     memory_map=self.options.get("memory_map", False),
   1222     is_text=is_text,
   1223     errors=self.options.get("encoding_errors", "strict"),
   1224     storage_options=self.options.get("storage_options", None),
   1225 )
   1226 assert self.handles is not None
   1227 f = self.handles.handle

File ~\Anaconda3\envs\orion\lib\site-packages\pandas\io\common.py:667, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    664     codecs.lookup_error(errors)
    666 # open URLs
--> 667 ioargs = _get_filepath_or_buffer(
    668     path_or_buf,
    669     encoding=encoding,
    670     compression=compression,
    671     mode=mode,
    672     storage_options=storage_options,
    673 )
    675 handle = ioargs.filepath_or_buffer
    676 handles: list[BaseBuffer]

File ~\Anaconda3\envs\orion\lib\site-packages\pandas\io\common.py:336, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    334 # assuming storage_options is to be interpreted as headers
    335 req_info = urllib.request.Request(filepath_or_buffer, headers=storage_options)
--> 336 with urlopen(req_info) as req:
    337     content_encoding = req.headers.get("Content-Encoding", None)
    338     if content_encoding == "gzip":
    339         # Override compression based on Content-Encoding header

File ~\Anaconda3\envs\orion\lib\site-packages\pandas\io\common.py:236, in urlopen(*args, **kwargs)
    230 """
    231 Lazy-import wrapper for stdlib urlopen, as that imports a big chunk of
    232 the stdlib.
    233 """
    234 import urllib.request
--> 236 return urllib.request.urlopen(*args, **kwargs)

File ~\Anaconda3\envs\orion\lib\urllib\request.py:222, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220 else:
    221     opener = _opener
--> 222 return opener.open(url, data, timeout)

File ~\Anaconda3\envs\orion\lib\urllib\request.py:531, in OpenerDirector.open(self, fullurl, data, timeout)
    529 for processor in self.process_response.get(protocol, []):
    530     meth = getattr(processor, meth_name)
--> 531     response = meth(req, response)
    533 return response

File ~\Anaconda3\envs\orion\lib\urllib\request.py:640, in HTTPErrorProcessor.http_response(self, request, response)
    637 # According to RFC 2616, "2xx" code indicates that the client's
    638 # request was successfully received, understood, and accepted.
    639 if not (200 <= code < 300):
--> 640     response = self.parent.error(
    641         'http', request, response, code, msg, hdrs)
    643 return response

File ~\Anaconda3\envs\orion\lib\urllib\request.py:569, in OpenerDirector.error(self, proto, *args)
    567 if http_err:
    568     args = (dict, 'default', 'http_error_default') + orig_args
--> 569     return self._call_chain(*args)

File ~\Anaconda3\envs\orion\lib\urllib\request.py:502, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
    500 for handler in handlers:
    501     func = getattr(handler, meth_name)
--> 502     result = func(*args)
    503     if result is not None:
    504         return result

File ~\Anaconda3\envs\orion\lib\urllib\request.py:649, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
    648 def http_error_default(self, req, fp, code, msg, hdrs):
--> 649     raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: HTTP Error 403: Forbidden
@SebiChesh
Copy link

SebiChesh commented Apr 27, 2023

Same with me

@sarahmish
Copy link
Collaborator

Thank you for noting this issue.

We're working on getting things back up and running again. Thank you for being patient.

@ushasai
Copy link

ushasai commented May 6, 2023

Can you please let us know if their is any alternative if its not fixed.

@sarahmish
Copy link
Collaborator

@ushasai a fix is proposed in PR #418

Please use the following url to download the data: https://sintel-orion.s3.amazonaws.com/
For example, for S-1 use https://sintel-orion.s3.amazonaws.com/S-1.csv

@sarahmish sarahmish added the question Further information is requested label May 18, 2023
@nunobv
Copy link

nunobv commented May 24, 2023

@sarahmish is there a way to access the full list of datasets available (like an index)?
Or do we have to check the Benchmark Results csv and manually change the url?

In the past we could access it via https://d3-ai-orion.s3.amazonaws.com/index.html, but as of today the following message is shown:

This XML file does not appear to have any style information associated with it. The document tree is shown below.

<Error>
    <Code>AllAccessDisabled</Code>
    <Message>All access to this object has been disabled</Message>
    <RequestId>THFF58AQA71JW1TD</RequestId>
    <HostId>tAVu+rIPTPNWVYTk8lBSYEwXF6AvP0hGaiyRaufVLOhOEpOjjvTHBLXq/TpSL5Aev86hiScXDcg=</HostId>
</Error>

On top of this, all of the datasets (either complete, train or test splits) come unlabelled. Does load_anomalies work for every available dataset? If not, where do you grab the ground truth from?

@sarahmish
Copy link
Collaborator

sarahmish commented May 24, 2023

I just added an index to the s3 bucket.

Yes, load_anomalies works for every signal we have in the bucket

@nunobv
Copy link

nunobv commented May 26, 2023

Thank you @sarahmish.

One thing that is not clear to me: when do you use the train/test datasets versus the complete one? In the benchmarking results there's a field named "split" which I assume is related to this.

Do the results shown refer to models fitted on the train dataset and later detecting anomalies on the test dataset? If so, how do you adjust this experimental setup for the cases where there is no train/test split like in the Yahoo datasets?

Sorry if this goes outside of the scope of the original question.

@sarahmish
Copy link
Collaborator

@nunobv not at all!

In terms of "split", sometimes data are divided into training/testing in advance (for example, signals in MSL and SMAP have a prior split). In order to have comparative results to other models, we use the same training/testing split.

Yahoo datasets are not split and are applied to the entire signal.

I hope this answers your question!

@nunobv
Copy link

nunobv commented May 30, 2023

Thanks @sarahmish.
That's what I supposed was happening.

I've only taken a look at the first 7/8 signal splits in the SMAP dataset, but as far as I can tell the train.csv has no anomalies whatsoever. Meaning we are training on the normal class only. I guess this may be questionable, as we're probably training the models in a somewhat "semi-supervised" context and not on a pure "unsupervised" context. Which, I'd say, may artificially increase the performance of the models (especially generative ones).

As the the same training strategy is being made for every single pipeline, I guess the effect is transversal, and the current benchmarking can still be used, even if only on a "relative" fashion..

What's your take on this?

(in the meantime, I'll check if the training splits do or do not have any anomalies for every other signal)

@sarahmish
Copy link
Collaborator

@nunobv yes, in the SMAP and MSL datasets the anomalies are only present in the test split. I agree that there is a level of supervision since we have prior knowledge that the training split does not contain any anomalies and only "normal" observations.

I want to emphasize a couple of points about the benchmark:

  • only the NASA signals (SMAP & MSL) are split into training and testing, the other datasets are not split.
  • we apply the same split to be consistent with the original paper for the pipeline lstm_dynamic_threshold. For it to be fair benchmark, all pipelines must be presented with the same dataset.

When investigating the benchmark results, you'll notice that pipelines have high f1 score in Yahoo and NAB datasets too, indicating that even without split, the pipelines are able to find anomalies.

Let me know if you have any further question

@nunobv
Copy link

nunobv commented May 31, 2023

100% clear.
That's what I meant with my observation that the "effect was transversal" as the same split is applied to every pipeline.

Thank you for your (usual) diligence!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants