Bug #12162
pg_interval_t::check_new_interval - for ec pool, should not rely on min_size to determine if the PG was active at the interval
0%
Description
One PG on our cluster stuck at peering+down forever, log shows the peering was blocked by an out/down OSD
2015-06-23 21:27:59.948809 7f26d2bd9700 10 osd.52 pg_epoch: 27468 pg[3.1e3fs0( v 25576'94308 (15251'84307,25576'94308] local-les=25568 n=73284 ec=1152 les/c 25568/18381 27457/27457/27457) [52,23,10,456,433,200,388,330,493,104,426] r=0 lpr=27457 pi=18380-27456/559 crt=25298'94302 lcod 0'0 mlcod 0'0 peering] PriorSet: build_prior final: probe 10(2),22(1),23(1),24(2),46(0),52(0),104(9),200(5),249(3),254(10),265(6),330(7),388(6),426(10),433(4),450(7),456(3),493(8) down 243 blocked_by {243=0} pg_down
The actual problem came from that when building the PriorSet, it blindly used the pool's min_size when check if the PG was r/w during the interval:
2015-06-23 21:28:00.357787 7f26d13d6700 10 osd.52 pg_epoch: 27471 pg[3.1e3fs0( v 25576'94308 (15251'84307,25576'94308] local-les=25568 n=73284 ec=1152 les/c 25568/18381 27471/27471/27471) [52,23,10,456,433,200,388,330,493,104,426] r=0 lpr=27471 pi=18380-27470/561 crt=25298'94302 lcod 0'0 mlcod 0'0 peering] PriorSet: build_prior interval(25614-25615 up [2147483647,2147483647,2147483647,2147483647,2147483647,200,2147483647,2147483647,243,104,426](200) acting [2147483647,2147483647,2147483647,2147483647,2147483647,200,2147483647,2147483647,243,104,426](200) maybe_went_rw)
Ceph version: v0.80.4
Credit goes to Sam for the analysis, thanks Sam!
Related issues
Associated revisions
osd: pg_interval_t::check_new_interval should not rely on pool.min_size to determine if the PG was active
If the pool's min_size is set improperly, during peering, pg_interval_t::check_new_interval
might wrongly determine the PG's state and cause the PG to stuck at down+peering forever
Fixes: #12162
Signed-off-by: Guang Yang yguang@yahoo-inc.com
osd: pg_interval_t::check_new_interval should not rely on pool.min_size to determine if the PG was active
If the pool's min_size is set improperly, during peering, pg_interval_t::check_new_interval
might wrongly determine the PG's state and cause the PG to stuck at down+peering forever
Fixes: #12162
Signed-off-by: Guang Yang yguang@yahoo-inc.com
(cherry picked from commit 684927442d81ea08f95878a8af69d08d3a14d973)
Conflicts:
src/osd/PG.cc
because PG::start_peering_interval has an assert
that is not found in hammer in the context
src/test/osd/types.cc
because include/stringify.h is not included by
types.cc in hammer
History
#1 Updated by Guang Yang almost 9 years ago
- Subject changed from PG is stuck at down+peering forever to pg_interval_t::check_new_interval - for ec pool, should not relay on min_size to determine if the PG was active at the interval
#2 Updated by Guang Yang almost 9 years ago
- Subject changed from pg_interval_t::check_new_interval - for ec pool, should not relay on min_size to determine if the PG was active at the interval to pg_interval_t::check_new_interval - for ec pool, should not rely on min_size to determine if the PG was active at the interval
#3 Updated by Guang Yang almost 9 years ago
#4 Updated by Samuel Just over 8 years ago
- Priority changed from Normal to Urgent
#5 Updated by Samuel Just over 8 years ago
- Status changed from New to 7
#6 Updated by Samuel Just over 8 years ago
- Status changed from 7 to Pending Backport
- Backport set to hammer, firefly
#7 Updated by Loïc Dachary about 8 years ago
- Status changed from Pending Backport to Resolved