Where does it keep its data? Docker volumes?

spanktar · on April 6, 2015

Like Jodok says, we don't recommend this, but you could do it. What this would mean that if you destroy a container, you also destroy the data. This would cause your cluster to have to rebalance itself adding IOPs overhead.

When you map a volume into a container as suggested, the data can persist through a container restart/replacement. When the container is instantiated, the volume is read, the node checksums the shards it finds to make sure they're not stale. If so, they're brought up to date. By tuning the recovery settings you can avoid extraneous shard movement and therefore leverage containers as you would expect.

jodok · on April 6, 2015

we recommend to expose a host directory to crate ('docker run -d -p 4200:4200 -p 4300:4300 -v <data-dir>:/data crate') and configure replicas. if one of the crate containers disappears, replicas will be promoted as primary shard and new replicas created on the fly. it's also possible to expose multiple directories (e.g. on multiple disks for more performance), you can configure crate to use them in parallel.

jpgvm · on April 6, 2015

I guess this is using ES multiple datapath support?

Are you also planning to move to single shard per datapath like ES? If that is the case what is your thoughts on increasing the shard count post single shard per datapath?

dobe · on April 7, 2015

I guess you are talking about the plan to have the data of one shard only on one disk (see https://github.com/elastic/elasticsearch/issues/9498)? This does not necessarily mean that you will end up having only one shard per datapath - only if you have just one shard per node. But you are right, the change might lead to unbalanced disk usage in some scenarios, where increasing the number of shards would solve the problem.

There are two options:

1. (Recommended for now) Export the table with COPY TO ( https://crate.io/docs/stable/sql/reference/copy_to.html). Drop the table and then import it again using COPY FROM (https://crate.io/docs/stable/sql/reference/copy_from.html ).

2. Use insert by query (see https://crate.io/docs/stable/sql/dml.html#inserting-data-by-... ) if it is ok for you to copy the whole data to another table (with more shards).

1) is recommended, since it allows for throttling on import time (see https://crate.io/docs/en/latest/best_practice/data_import.ht...) and also does not require a rename of a table, which is currently not implemented but is on our backlog. However i think once ES 2.0 is out we will have table renames and also throttling in insert by query, so option 2) will be recommended then.

Our genreal recommendation to the fixed number of shards limitation is to choose a higher number of shards upfront (number of expected cores matches the most use cases) or to use partitioned tables (https://crate.io/docs/en/latest/sql/partitioned_tables.html) where possible since those allow to change the number of shards for future partitions.