Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disagg: too many request make tiflash compute node crash #9334

Closed
Lily2025 opened this issue Aug 20, 2024 · 4 comments · Fixed by #9335
Closed

disagg: too many request make tiflash compute node crash #9334

Lily2025 opened this issue Aug 20, 2024 · 4 comments · Fixed by #9335
Assignees
Labels
affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. component/storage type/enhancement The issue or PR belongs to an enhancement.

Comments

@Lily2025
Copy link

Lily2025 commented Aug 20, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、run ch
2、inject one of cn network partition

2. What did you expect to see? (Required)

no crash

3. What did you see instead (Required)

tiflash cn crash occurs after the network isolation recovery

{"stream":"stdout","container":"errorlog","pod":"secondary-tc-tiflash-0","namespace":"ha-test-disagg-tiflash-tps-7552417-1-58","time":"2024-08-19T17:34:59.20448412Z","log":"[2024/08/20 01:34:58.361 +08:00] [ERROR] [BaseDaemon.cpp:560] ["\n 0x55a4c9778b9e\tfaultSignalHandler(int, siginfo_t*, void*) [tiflash+124169118]\n \tlibs/libdaemon/src/BaseDaemon.cpp:211\n 0x7fb214a5e6f0\t [libc.so.6+255728]\n 0x55a4c95a7d9a\tDB::DM::SegmentReadTask::SegmentReadTask(std::__1::shared_ptrDB::Logger const&, DB::Context const&, std::__1::shared_ptrDB::DM::ScanContext const&, DB::DM::RemotePb::RemoteSegment const&, DB::DM::DisaggTaskId const&, unsigned long, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator> const&, unsigned int, long) [tiflash+122264986]\n \t/usr/local/bin/../include/c++/v1/__memory/shared_ptr.h:884\n 0x55a4cad9eb63\tstd::__1::__function::__func<DB::StorageDisaggregated::buildReadTaskForWriteNodeTable(DB::Context const&, std::__1::shared_ptrDB::DM::ScanContext const&, DB::DM::DisaggTaskId const&, unsigned long, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator> const&, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator> const&, std::__1::mutex&, std::__1::list<std::__1::shared_ptrDB::DM::SegmentReadTask, std::__1::allocator<std::__1::shared_ptrDB::DM::SegmentReadTask>>&)::$_0, std::__1::allocator<DB::StorageDisaggregated::buildReadTaskForWriteNodeTable(DB::Context const&, std::__1::shared_ptrDB::DM::ScanContext const&, DB::DM::DisaggTaskId const&, unsigned long, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator> const&, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator> const&, std::__1::mutex&, std::__1::list<std::__1::shared_ptrDB::DM::SegmentReadTask, std::__1::allocator<std::__1::shared_ptrDB::DM::SegmentReadTask>>&)::$_0>, void ()>::operator()() (.139ff689715caee4ff84ce0b2eee41ae) [tiflash+147393379]\n \t/usr/local/bin/../include/c++/v1/__memory/construct_at.h:41\n 0x55a4c9a903b5\tauto DB::wrapInvocable<std::__1::function<void ()>>(bool, std::__1::function<void ()>&&)::'lambda'()::operator()() [tiflash+127411125]\n \t/usr/local/bin/../include/c++/v1/__functional/function.h:517\n 0x55a4c41e60c5\tstd::__1::packaged_task<void ()>::operator()() [tiflash+34439365]\n \t/usr/local/bin/../include/c++/v1/future:1891\n 0x55a4c419e4d6\tDB::DynamicThreadPool::executeTask(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_deleteDB::IExecutableTask>&) [tiflash+34145494]\n \tdbms/src/Common/DynamicThreadPool.cpp:124\n 0x55a4c419e973\tDB::DynamicThreadPool::dynamicWork(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_deleteDB::IExecutableTask>) [tiflash+34146675]\n \tdbms/src/Common/DynamicThreadPool.cpp:148\n 0x55a4c419f3df\tvoid* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_deletestd::__1::__thread_struct>, std::__1::thread DB::ThreadFactory::newThread<void (DB::DynamicThreadPool::)(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_deleteDB::IExecutableTask>), DB::DynamicThreadPool, std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_deleteDB::IExecutableTask>>(bool, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator>, void (DB::DynamicThreadPool::&&)(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_deleteDB::IExecutableTask>), DB::DynamicThreadPool&&, std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_deleteDB::IExecutableTask>&&)::'lambda'(auto&&...), DB::DynamicThreadPool*, std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_deleteDB::IExecutableTask>>>(void*) [tiflash+34149343]\n \t/usr/local/bin/../include/c++/v1/__type_traits/invoke.h:308\n 0x7fb214aa9c02\tstart_thread [libc.so.6+564226]"] [source=BaseDaemon] [thread_id=30184]\n"}

4. What is your TiFlash version? (Required)

/tiflash/tiflash version
TiFlash
Release Version: v8.3.0-alpha
Edition: Community
Git Commit Hash: 14ed7c0
Git Branch: heads/refs/tags/v8.3.0-alpha
UTC Build Time: 2024-08-15 11:39:16
Enable Features: jemalloc sm4(GmSSL) mem-profiling avx2 avx512 unwind thinlto
Profile: RELWITHDEBINFO
Compiler: clang++ 17.0.6

Raft Proxy
Git Commit Hash: 4ebe44d321d4c738d89bc145d418b1d6f3464862
Git Commit Branch: HEAD
UTC Build Time: ""
Rust Version: rustc 1.77.0-nightly (89e2160c4 2023-12-27)
Storage Engine: tiflash
Prometheus Prefix: tiflash_proxy_
Profile: release
Enable Features: external-je

@Lily2025 Lily2025 added the type/bug The issue is confirmed as a bug. label Aug 20, 2024
@Lily2025
Copy link
Author

/assign CalvinNeo

@Lily2025
Copy link
Author

/severity critical

@JinheLin
Copy link
Contributor

img_v3_02du_ce3193f6-9c4b-415c-8b5c-ee1cd4d80adg

The reason is that too many threads were created in StorageDisaggregated, resulting in thread creation failure.

std::__1::system_error, e.what() = thread constructor failed: Resource temporarily unavailable。

@JinheLin JinheLin added affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. and removed may-affects-7.5 may-affects-8.1 labels Aug 20, 2024
@JaySon-Huang
Copy link
Contributor

Change it to an enhancement because it is caused by a large amount of requests making too many threads. We will try to reduce the number of threads created for handling disaggregated requests.

@JaySon-Huang JaySon-Huang added type/enhancement The issue or PR belongs to an enhancement. and removed type/bug The issue is confirmed as a bug. severity/critical labels Aug 20, 2024
@JaySon-Huang JaySon-Huang changed the title tiflash cn crash occurs after the network isolation recovery disagg: too many request make tiflash compute node crash Aug 20, 2024
@ti-chi-bot ti-chi-bot bot closed this as completed in fe07443 Aug 22, 2024
JinheLin added a commit that referenced this issue Sep 3, 2024
…ed requests (#9335)

close #9334

Use IOThreadPool to replace DynamicThreadManager.

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot added a commit that referenced this issue Sep 4, 2024
…ed requests (release-8.1) (#9402)

close #9334

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. component/storage type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants