博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Reroute Unassigned Shards——遇到主shard 出现的解决方法就是重新路由
阅读量:6094 次
发布时间:2019-06-20

本文共 6092 字,大约阅读时间需要 20 分钟。

Red Cluster!

摘自:http://blog.kiyanpro.com/2016/03/06/elasticsearch/reroute-unassigned-shards/

There are 3 cluster states:

  1. green: All primary and replica shards are active
  2. yellow: All primary shards are active, but not all replica shards are active
  3. red: Not all primary shards are active

When cluster health is red, it means cluster is dead. And that means you can do nothing until it’s recovered, which is very bad indeed. I will share with you how to deal with one common situation: when cluster is red due to unassigned shards.

Steps

The general idea is pretty simple: find those shards which are unassigned, manually assign them to a node with reroute API. Let’s see how we can do that step by step. Then we can combine them into a configurable simple script.

Step 1: Check Unassigned Shards

To get cluster information, we usually use cat APIs. There is a GET /_cat/shards endpoint to show a detailed view of what nodes contain which shards.

Cat shards

1
2
3
4
5
6
7
8
9
# cat shards verbose
curl
"http://your.elasticsearch.host.com:9200/_cat/shards?v"
 
# cat shards index
curl
"http://your.elasticsearch.host.com:9200/_cat/shards/wiki2"
# example return
# wiki2 0 p STARTED 197 3.2mb 192.168.56.10 Stiletto
# wiki2 1 p STARTED 205 5.9mb 192.168.56.30 Frankie Raye
# wiki2 2 p STARTED 275 7.8mb 192.168.56.20 Commander Kraken

By piping cat shards to fgrep, we can get all unassigned shards.

Get unassigned shards

1
2
3
4
5
6
# cat shards with fgrep
curl
"http://your.elasticsearch.host.com:9200/_cat/shards" | fgrep UNASSIGNED
# example return
# wiki1 0 r UNASSIGNED ALLOCATION_FAILED
# wiki1 1 r UNASSIGNED ALLOCATION_FAILED
# wiki1 2 r UNASSIGNED ALLOCATION_FAILED

 

If you don’t want to deal with shell script, you can also find these unassigned shards using another endpoint POST /_flush/synced. This endpoint is actually not just some information. It allows an administrator to initiate a synced flush manually. This can be particularly useful for a planned (rolling) cluster restart where you can stop indexing and don’t want to wait the default 5 minutes for idle indices to be sync-flushed automatically. It returns with a json response.

_flush/synced

1
curl -XPOST
"http://your.elasticsearch.host.com:9200/twitter/_flush/synced"

If there are failed shards in the response, we can iterate through a failures array to get all unassigned ones.

Example response with failed shards

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
"_shards": {
"total": 4,
"successful": 1,
"failed": 1
},
"twitter": {
"total": 4,
"successful": 3,
"failed": 1,
"failures": [
{
"shard": 1,
"reason": "unexpected error",
"routing": {
"state": "STARTED",
"primary": false,
"node": "SZNr2J_ORxKTLUCydGX4zA",
"relocating_node": null,
"shard": 1,
"index": "twitter"
}
}
]
}
}

 

Step 2: Reroute

The reroute command allows to explicitly execute a cluster reroute allocation command including specific commands . An unassigned shard can be explicitly allocated on a specific node.

Reroute example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
curl -XPOST
'localhost:9200/_cluster/reroute' -d '{
"commands" : [ {
"move" :
{
"index" : "test", "shard" : 0,
"from_node" : "node1", "to_node" : "node2"
}
},
{
"allocate" : {
"index" : "test", "shard" : 1, "node" : "node3"
}
}
]
}'

There are 3 kinds of commands you can use:

move: Move a started shard from one node to another node. Accepts index and shard for index name and shard number, from_node for the node to move the shard from, and to_node for the node to move the shard to.

cancel: Cancel allocation of a shard (or recovery). Accepts index and shard for index name and shard number, and node for the node to cancel the shard allocation on. It also accepts allow_primary flag to explicitly specify that it is allowed to cancel allocation for a primary shard. This can be used to force resynchronization of existing replicas from the primary shard by cancelling them and allowing them to be reinitialized through the standard reallocation process.

allocate: Allocate an unassigned shard to a node. Accepts the index and shard for index name and shard number, and node to allocate the shard to. It also accepts allow_primary flag to explicitly specify that it is allowed to explicitly allocate a primary shard (might result in data loss).

Combining step 2 with the unassigned shards from Step 1, we can reroute all unassigned shards 1 by 1, thus getting faster cluster recovery from red state.

Example Solutions

Python

Below is a python script I wrote using POST /_flush/synced and POST /reroute

Shell Script

Below is a shell script I found elsewhere in a blog post

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for shard in $(curl -XGET http://localhost:9200/_cat/shards | grep UNASSIGNED | awk '{print $2}'); do
curl -XPOST
'localhost:9200/_cluster/reroute' -d '{
"commands" : [ {
"allocate" : {
"index" : "t37", # index name
"shard" : $shard,
"node" : "datanode15", # node name
"allow_primary" : true
}
}
]
}'
sleep 5
done

EDIT: Based on Vincent’s comment I updated the shell script:

Possible Unassigned Shard Reasons

FYI, these are the possible reasons for a shard be in a unassigned state:

 

Name Comment
INDEX_CREATED Unassigned as a result of an API creation of an index
CLUSTER_RECOVERED Unassigned as a result of a full cluster recovery
INDEX_REOPENED Unassigned as a result of opening a closed index
DANGLING_INDEX_IMPORTED Unassigned as a result of importing a dangling index
NEW_INDEX_RESTORED Unassigned as a result of restoring into a new index
EXISTING_INDEX_RESTORED Unassigned as a result of restoring into a closed index
REPLICA_ADDED Unassigned as a result of explicit addition of a replica
ALLOCATION_FAILED Unassigned as a result of a failed allocation of the shard
NODE_LEFT Unassigned as a result of the node hosting it leaving the cluster
REROUTE_CANCELLED Unassigned as a result of explicit cancel reroute command
REINITIALIZED When a shard moves from started back to initializing, for example, with shadow replicas
REALLOCATED_REPLICA A better replica location is identified and causes the existing replica allocation to be cancelled

References

转载地址:http://yvgwa.baihongyu.com/

你可能感兴趣的文章
iPhone网络编程初体验-简单的聊天程序z
查看>>
N皇后问题(DFS)
查看>>
什么是ThreadLocal
查看>>
apktool 反汇编apk包
查看>>
Compiler
查看>>
Oracle ——如何读执行计划概述
查看>>
时间处理 c++ 获取当前系统时间 1. 时间戳形式 2. char *形式[转]
查看>>
C/C++学习之static_cast和dynamic_cast、reinterpret_cast
查看>>
语法:MySQL中INSERT INTO SELECT的使用
查看>>
[C/C++] ccpuid:CPUID信息模块 V1.03版,改进mmx/sse指令可用性检查(使用signal、setjmp,支持纯C)、修正AVX检查Bug...
查看>>
Tomcat加载servlet类文件 -我们到底能走多远系列(9)
查看>>
LINQ 学习笔记9
查看>>
<Codeforces Round #147 (Div. 2)>A. Free Cash(水题)
查看>>
转 OFBiz财务模型-金融账户
查看>>
一个男人关心的东西 决定了他的层次
查看>>
2013年1月第1个周末
查看>>
jstree的数据后台生成
查看>>
文本文件与二进制文件的比较
查看>>
索引 - 聚集索引设计指南
查看>>
dom4j使用selectSingleNode方法报错
查看>>