Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] compute::LocalTimestamp() Performs incorrect conversion #45751

Open
gowerc opened this issue Mar 11, 2025 · 5 comments
Open

[C++] compute::LocalTimestamp() Performs incorrect conversion #45751

gowerc opened this issue Mar 11, 2025 · 5 comments

Comments

@gowerc
Copy link

gowerc commented Mar 11, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Apologies in advance if I've made a mistake here I am relatively new to the arrow Cpp API and also to managing datetime stamps, that being said I think there might be a bug with the compute::LocalTimestamp() function (at least it appears to be producing results I wouldn't have expected:

For example take a timestamp(seconds) of

2222997212 = Monday, June 11, 2040  3:13:32 UTC
           = Sunday  June 10, 2040 23:13:32 America/New York (EDT)

Assuming that the value was stored in a Timestamp array with a timezone of EDT I would have expected after running compute::LocalTimestamp() a value to be produced of:

2222982812 = Sunday, June 10, 2040 23:13:32 UTC

However in practice when doing this I am observing an actual value of:

2222979212 = Sunday, June 10, 2040 22:13:32 UTC

I tried searching but I couldn't see any other issues (open or closed) related to this.


I am running on Fedora 41 using libarrow-16.1.0-12.fc41.x86_64 (latest available from the fedora package manager)

--- EDIT - Just tested against arrow-19.0.1 and am still getting the same behavior ---

Code I am running to reproduce this:

#include <arrow/api.h>
#include <arrow/io/api.h>
#include <arrow/compute/api.h>
#include <iostream>
#include <memory>


arrow::Status RunMain() {
    // Create timestamp array with the target value
    arrow::TimestampBuilder builder(
        arrow::timestamp(arrow::TimeUnit::SECOND, "America/New_York"),
        arrow::default_memory_pool()
    );
    ARROW_RETURN_NOT_OK(builder.Append(2222997212));
    ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Array> array_raw, builder.Finish());
    auto array = std::static_pointer_cast<arrow::TimestampArray>(array_raw);


    // Display what the current value is
    std::cout << "Value = " << array->Value(0) << std::endl; // 2222997212

    // Cast to localtime zone and the display the value again
    ARROW_ASSIGN_OR_RAISE(
        auto array_converted_raw,
        arrow::compute::LocalTimestamp(array)
    )
    auto array_converted = std::static_pointer_cast<arrow::TimestampArray>(array_converted_raw.make_array());
    std::cout << "Value = " << array_converted->Value(0) << std::endl; // 2222979212
    
    return arrow::Status::OK();
}


int main (int argc, char** argv) {
    arrow::Status st = RunMain();
    if (!st.ok()) {
        std::cerr << st << std::endl;
        return 1;
    }
    return 0;
}

Component(s)

C++

@gowerc gowerc changed the title compute::LocalTimestamp() Resulting in incorrect conversion compute::LocalTimestamp() Performs incorrect conversion Mar 11, 2025
@gowerc
Copy link
Author

gowerc commented Mar 11, 2025

Just to add, experimenting with a different timezone library (link) gets the expected 2222982812 value:

#include <iostream>
#include <chrono>
#include <date/date.h>
#include <date/tz.h>


int main() {
    date::sys_seconds utc_time{std::chrono::seconds(2222997212)};
    date::zoned_time ny_time{"America/New_York", utc_time};
    std::cout << "Epoch seconds:  " << ny_time.get_sys_time().time_since_epoch().count() << std::endl;
    std::cout << "UTC time:       " << date::format("%F %T %Z", utc_time) << '\n';
    std::cout << "NY time:        " << date::format("%F %T %Z", ny_time) << '\n';
    date::local_seconds naive_local = ny_time.get_local_time();
    std::cout << "NY time naive:  " << naive_local.time_since_epoch().count() << "\n";
}

Output:

Epoch seconds:  2222997212
UTC time:       2040-06-11 03:13:32 UTC
NY time:        2040-06-10 23:13:32 EDT
NY time naive:  2222982812

@kou kou changed the title compute::LocalTimestamp() Performs incorrect conversion [C++] compute::LocalTimestamp() Performs incorrect conversion Mar 11, 2025
@kou
Copy link
Member

kou commented Mar 12, 2025

TimestampArray values don't depend on timezone. TimestampArray::type() has timezone information instead. If you want to get offset-ed seconds by timezone. You need to convert it by yourself or we may want to add a new compute kernel for it.

BTW, why do you want to get offset-ed seconds?

FYI: The document of local_timestamp(): https://arrow.apache.org/docs/cpp/compute.html#timezone-handling

@gowerc
Copy link
Author

gowerc commented Mar 12, 2025

Hi @kou thank you for your time and reply,

You need to convert it by yourself or we may want to add a new compute kernel for it.

Apologies I am confused, from the documentation I thought that was the exact purpose of the local_timestamp() computation ? In particular from the docs:

local_timestamp function converts UTC-relative timestamps to local “timezone-naive” timestamps. The timezone is taken from the timezone metadata of the input timestamps.

At least the implication of that from the the way its written is that it is performing the following calculation (which is what I am looking for):

$$ time_{local} = time_{utc} + offset(timezone) $$

I should also note for ~99% of cases I've tested so far the local_timestamp function appears to be working as I was hoping / expecting, I just found this one example where it is not performing as expected.

BTW, why do you want to get offset-ed seconds?

I am trying to write a small CLI tool that converts parquet data to XPT format; XPT format however has no support for timezones so how to correctly store timestamp data is dependent on the use case; some users prefer to store the data as timezone-naive whilst others (myself included) prefer to just store the UTC-relative timestamps. To this end I am just providing an option for the user to choose.

EDIT -- After spending more time than I care to admit reading about timestamps I think this might be a bug with regards to how the tzdata information is consumed. At least the issue only seems to occur after 2038 and seems to be mostly with daylight savings that implies its some issue to do with the timezone rules not being correctly applied. 2038 is a common issue point due to it being an overflow point with regards to 32bit ints. I haven't looked at the underlying code here but just seems suspicious that this error occurs at this specific year and that the Arrow produced value is exactly 1-hour off the expected value.

@kou
Copy link
Member

kou commented Mar 13, 2025

Ah, sorry. I misunderstood this. I thought that local_timestamp() converts nothing but local_timesatmp() converts but wrong offset is used, right?

Is this duplicated of #36110 ?

@gowerc
Copy link
Author

gowerc commented Mar 13, 2025

hmm I'm not sure to be honest. I mean on the surface it definitely appears to be related but I'm not sure its exactly the same. In that ticket the issue seems to be a mismatch in how arrow / python interpolate missing rules when going into the future. Here however I can clearly see that the rules in my local tzdata database extend up until 2499:

> zdump -v America/New_York
<snip>
America/New_York  Sun Mar  9 06:59:59 2498 UT = Sun Mar  9 01:59:59 2498 EST isdst=0 gmtoff=-18000
America/New_York  Sun Mar  9 07:00:00 2498 UT = Sun Mar  9 03:00:00 2498 EDT isdst=1 gmtoff=-14400
America/New_York  Sun Nov  2 05:59:59 2498 UT = Sun Nov  2 01:59:59 2498 EDT isdst=1 gmtoff=-14400
America/New_York  Sun Nov  2 06:00:00 2498 UT = Sun Nov  2 01:00:00 2498 EST isdst=0 gmtoff=-18000
America/New_York  Sun Mar  8 06:59:59 2499 UT = Sun Mar  8 01:59:59 2499 EST isdst=0 gmtoff=-18000
America/New_York  Sun Mar  8 07:00:00 2499 UT = Sun Mar  8 03:00:00 2499 EDT isdst=1 gmtoff=-14400
America/New_York  Sun Nov  1 05:59:59 2499 UT = Sun Nov  1 01:59:59 2499 EDT isdst=1 gmtoff=-14400
America/New_York  Sun Nov  1 06:00:00 2499 UT = Sun Nov  1 01:00:00 2499 EST isdst=0 gmtoff=-18000

I also get the correct expected behaviour from both R, Python and Cpp which as far as I can tell are all using the system tzdata source as well so they should be consistent.

import zoneinfo
import datetime

def printtime(time: int):
    ny_time =  datetime.datetime.fromtimestamp(time, zoneinfo.ZoneInfo("America/New_York"))
    print(f"Time: {ny_time} ({ny_time.tzname()})")

print(zoneinfo.TZPATH)  # ('/usr/share/zoneinfo', '/usr/lib/zoneinfo', '/usr/share/lib/zoneinfo', '/etc/zoneinfo')

printtime(2095940701)  # Time: 2036-06-01 09:45:01-04:00 (EDT)
printtime(2127476701)  # Time: 2037-06-01 09:45:01-04:00 (EDT)
printtime(2159012701)  # Time: 2038-06-01 09:45:01-04:00 (EDT)
printtime(2190548701)  # Time: 2039-06-01 09:45:01-04:00 (EDT)
as.POSIXct(2095940701, tz = "America/New_York")      # "2036-06-01 09:45:01 EDT"
as.POSIXct(2127476701, tz = "America/New_York")      # "2037-06-01 09:45:01 EDT"
as.POSIXct(2159012701, tz = "America/New_York")      # "2038-06-01 09:45:01 EDT"
as.POSIXct(2190548701, tz = "America/New_York")      # "2039-06-01 09:45:01 EDT"
#include <iostream>
#include <chrono>
#include <format>

void printme(long long x) {
    std::chrono::sys_seconds utc_time{std::chrono::seconds(x)};
    std::chrono::zoned_time ny_time{"America/New_York", utc_time};
    std::cout << "Local time: " << std::format("{:%F %T %Z}", ny_time) << '\n';
}

int main() {
    std::cout << "C++ Standard Version: " << __cplusplus << std::endl;   // 2020
    printme(2095940701);    // Local time: 2036-06-01 09:45:01 EDT
    printme(2127476701);    // Local time: 2037-06-01 09:45:01 EDT
    printme(2159012701);    // Local time: 2038-06-01 09:45:01 EDT
    printme(2190548701);    // Local time: 2039-06-01 09:45:01 EDT
}

--- EDIT ---

Apologies in case that wasn't clear the issue with the arrow implementation is that it is not applying the correct daylight savings adjustment after 2038 e.g. given a UTC value of 2159012701, which if in "America/New_York" would be EDT, is being adjusted as if it were in EST instead. The above examples show that R / Cpp / Python all correctly recognise that the 2159012701 value should be EDT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants