Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.1k views
in Technique[技术] by (71.8m points)

database design - Cassandra - one big table vs many tables

I'm currently looking trying out Cassandra database. I'm using DataStax Dev center and DataStax C# driver.

My Current model is quite simple and consists of only:

  • ParameterId (int) - would serve as the id of the table.
  • Value (bigint)
  • MeasureTime (timestamp)

I will be having 1000 (no more, no less) parameters, from 1 - 1000. And will be getting an entry for each parameter once pr. second and will be running for years.

My question is regarding whether it is better practice to create a table as:

CREATE TABLE keyspace.measurement (
    parameterId int,
    value bigint,
    measureTime timestamp,
    PRIMARY KEY(parameterId, measureTime)
) WITH CLUSTERING ORDER BY (measureTime DESC)

Or it would be better to create 1000 tables consisting only of a value and measureTime, and if so would I be able to range query on my MeasureTime?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You are going to hit very wide rows with this. I would advise against your table format, and I'd go with something that allows you to control the wideness of the rows.

Depending on your query requirements, I'll write you down a more suitable schema (IMHO):

CREATE TABLE keyspace.measurement (
    parameterId int,
    granularity timestamp,
    value bigint,
    measureTime timestamp,
    PRIMARY KEY((parameterId, granularity), measureTime)
) WITH CLUSTERING ORDER BY (measureTime DESC)

This is very similar to yours, however it has a major advantage: you can configure the wideness of your rows, and you don't have any hotspots. The idea is dead simple: both parameterId and granularity fields make the partition key, so they tell where your data will go, while measureTime will keep your data ordered. Supposing you want to query on a day-by-day basis, you'd store into granularity the value yyyy-mm-dd of your measureTime, grouping together all the measures of the same day.

This allows you to retrieve all the values lying on the same partition (so per given parameterId and granularity fields pair) with an efficient range query. In a day-by-day configuration, you'd end up with 86400 records per partition. This number could be still high (the suggested limit is 10k IIRC), and you can lower tht value by going on hour-by-hour grouping with yyyy-mm-dd HH:00 value instead.

The drawback of that approach is that if you need data from multiple partitions (eg you are grouping on day-by-day basis, but you need data for two consecutive days, eg the last 6 hours of the Jan 19th, and the first 6 hours of Jan 20th), then you'll need to perform multiple queries.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...