Ceph集群磁盤無剩余空間的解決方法
故障描述
OpenStack + Ceph集群在使用過程中,由于虛擬機拷入大量新的數據,導致集群的磁盤迅速消耗,沒有空余空間,虛擬機無法操作,Ceph集群所有操作都無法執行。
故障現象
- 嘗試使用OpenStack重啟虛擬機無效
- 嘗試直接用rbd命令直接刪除塊失敗
- [root@controller ~]# rbd -p volumes rm volume-c55fd052-212d-4107-a2ac-cf53bfc049be
- 2015-04-29 05:31:31.719478 7f5fb82f7760 0 client.4781741.objecter FULL, paused modify 0xe9a9e0 tid 6
- 查看ceph健康狀態
- cluster 059f27e8-a23f-4587-9033-3e3679d03b31
- health HEALTH_ERR 20 pgs backfill_toofull; 20 pgs degraded; 20 pgs stuck unclean; recovery 7482/129081 objects degraded (5.796%); 2 full osd(s); 1 near full osd(s)
- monmap e6: 4 mons at {node-5e40.cloud.com=10.10.20.40:6789/0,node-6670.cloud.com=10.10.20.31:6789/0,node-66c4.cloud.com=10.10.20.36:6789/0,node-fb27.cloud.com=10.10.20.41:6789/0}, election epoch 886, quorum 0,1,2,3 node-6670.cloud.com,node-66c4.cloud.com,node-5e40.cloud.com,node-fb27.cloud.com
- osdmap e2743: 3 osds: 3 up, 3 in
- flags full
- pgmap v6564199: 320 pgs, 4 pools, 262 GB data, 43027 objects
- 786 GB used, 47785 MB / 833 GB avail
- 7482/129081 objects degraded (5.796%)
- 300 active+clean
- 20 active+degraded+remapped+backfill_toofull
- HEALTH_ERR 20 pgs backfill_toofull; 20 pgs degraded; 20 pgs stuck unclean; recovery 7482/129081 objects degraded (5.796%); 2 full osd(s); 1 near full osd(s)
- pg 3.8 is stuck unclean for 7067109.597691, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
- pg 3.7d is stuck unclean for 1852078.505139, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
- pg 3.21 is stuck unclean for 7072842.637848, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
- pg 3.22 is stuck unclean for 7070880.213397, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
- pg 3.a is stuck unclean for 7067057.863562, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
- pg 3.7f is stuck unclean for 7067122.493746, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
- pg 3.5 is stuck unclean for 7067088.369629, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
- pg 3.1e is stuck unclean for 7073386.246281, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
- pg 3.19 is stuck unclean for 7068035.310269, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
- pg 3.5d is stuck unclean for 1852078.505949, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
- pg 3.1a is stuck unclean for 7067088.429544, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
- pg 3.1b is stuck unclean for 7072773.771385, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
- pg 3.3 is stuck unclean for 7067057.864514, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
- pg 3.15 is stuck unclean for 7067088.825483, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
- pg 3.11 is stuck unclean for 7067057.862408, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
- pg 3.6d is stuck unclean for 7067083.634454, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
- pg 3.6e is stuck unclean for 7067098.452576, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
- pg 3.c is stuck unclean for 5658116.678331, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
- pg 3.e is stuck unclean for 7067078.646953, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
- pg 3.20 is stuck unclean for 7067140.530849, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
- pg 3.7d is active+degraded+remapped+backfill_toofull, acting [2,0]
- pg 3.7f is active+degraded+remapped+backfill_toofull, acting [0,2]
- pg 3.6d is active+degraded+remapped+backfill_toofull, acting [2,0]
- pg 3.6e is active+degraded+remapped+backfill_toofull, acting [2,0]
- pg 3.5d is active+degraded+remapped+backfill_toofull, acting [2,0]
- pg 3.20 is active+degraded+remapped+backfill_toofull, acting [0,2]
- pg 3.21 is active+degraded+remapped+backfill_toofull, acting [0,2]
- pg 3.22 is active+degraded+remapped+backfill_toofull, acting [0,2]
- pg 3.1e is active+degraded+remapped+backfill_toofull, acting [0,2]
- pg 3.19 is active+degraded+remapped+backfill_toofull, acting [0,2]
- pg 3.1a is active+degraded+remapped+backfill_toofull, acting [2,0]
- pg 3.1b is active+degraded+remapped+backfill_toofull, acting [0,2]
- pg 3.15 is active+degraded+remapped+backfill_toofull, acting [2,0]
- pg 3.11 is active+degraded+remapped+backfill_toofull, acting [2,0]
- pg 3.c is active+degraded+remapped+backfill_toofull, acting [2,0]
- pg 3.e is active+degraded+remapped+backfill_toofull, acting [2,0]
- pg 3.8 is active+degraded+remapped+backfill_toofull, acting [2,0]
- pg 3.a is active+degraded+remapped+backfill_toofull, acting [2,0]
- pg 3.5 is active+degraded+remapped+backfill_toofull, acting [2,0]
- pg 3.3 is active+degraded+remapped+backfill_toofull, acting [2,0]
- recovery 7482/129081 objects degraded (5.796%)
- osd.0 is full at 95%
- osd.2 is full at 95%
- osd.1 is near full at 93%
#p#
解決方案一(已驗證)
增加OSD節點,這也是官方文檔中推薦的做法,增加新的節點后,Ceph開始重新平衡數據,OSD使用空間開始下降
- 2015-04-29 06:51:58.623262 osd.1 [WRN] OSD near full (91%)
- 2015-04-29 06:52:01.500813 osd.2 [WRN] OSD near full (92%)
解決方案二(理論上,沒有進行驗證)
如果在沒有新的硬盤的情況下,只能采用另外一種方式。在當前狀態下,Ceph不允許任何的讀寫操作,所以此時任何的Ceph命令都不好使,解決的方案就是嘗試降低Ceph對于full的比例定義,我們從上面的日志中可以看到Ceph的full的比例為95%,我們需要做的就是提高full的比例,之后盡快嘗試刪除數據,將比例下降。
- 嘗試直接用命令設置,但是失敗了,Ceph集群并沒有重新同步數據,懷疑可能仍然需要重啟服務本身
- ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
修改配置文件,之后重啟monitor服務,但是擔心出問題,所以沒有敢嘗試該方法,后續經過在郵件列表確認,該方法應該不會對數據產生影響,但是前提是在恢復期間,所有的虛擬機不要向Ceph再寫入任何數據。
默認情況下full的比例是95%,而near full的比例是85%,所以需要根據實際情況對該配置進行調整。
- [global]
- mon osd full ratio = .98
- mon osd nearfull ratio = .80
分析總結
原因
根據Ceph官方文檔中的描述,當一個OSD full比例達到95%時,集群將不接受任何Ceph Client端的讀寫數據的請求。所以導致虛擬機在重啟時,無法啟動的情況。
解決方法
從官方的推薦來看,應該比較支持添加新的OSD的方式,當然臨時的提高比例是一個解決方案,但是并不推薦,因為需要手動的刪除數據去解決,而且一旦再有一個新的節點出現故障,仍然會出現比例變滿的狀況,所以解決之道***是擴容。
思考
在這次故障過程中,有兩點是值得思考的:
- 監控:由于當時服務器在配置過程中DNS配置錯誤,導致監控郵件無法正常發出,從而沒有收到Ceph WARN的提示信息
- 云平臺本身: 由于Ceph的機制,在OpenStack平臺中分配中,大多時候是超分的,從用戶角度看,拷貝大量數據的行為并沒有不妥之處,但是由于云平臺并沒有相應的預警機制,導致了該問題的發生
參考文檔
http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity



















