Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
352 views
in Technique[技术] by (71.8m points)

mysql - get statistics information by SQL query efficiently for table with 3 columns and 800 million rows

I have a data table with 3 columns and 800 million rows in SQL server database.

  locationID     cardID    value
  NY_xxxxx     xxxxxx    xxxx.xxxx   // x : integer digits from 0 to 9
  NY_xxxxx     xxxxxx    xxxx.xxxx  
  NY_xxxxx     xxxxxx    xxxx.xxxx  
  IA_xxxxx     xxxxxx    xxxx.xxxx  
  IA_xxxxx     xxxxxx    xxxx.xxxx  
  IA_xxxxx     xxxxxx    xxxx.xxxx  
  ...

I need to count the distinct numbers of cardID for the same location.

Also, I need to count location numbers in the same state. For example, for NY in the above table, we have 3 locations.

I also need to know how many locations in each state, and how many cardID in each location and how many cardID in each state.

How to get these statistics information by SQL query efficiently? The data table size is large.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The OP probably already knows this, but here is how to get the answer, disregarding efficiency. First, cards per location, as described in the comments:

SELECT locationid, COUNT(DISTINCT cardID)
FROM table 
GROUP BY locationid 

Next the same thing per state.

SELECT substring(locationid, 1, 2) state, COUNT(DISTINCT cardID)
FROM table 
GROUP BY substring(locationid, 1, 2)

For a single state,

select COUNT(DISTINCT cardID)
from table 
where substring(locationid, 1, 2) = 'NY'

The problem with the first query is that it will return locationID's like NY_1234. If you have not memorized what NY_1234 is in layman's terms, Rochester for example, you have to go outside your database to see what it is.

The second query will be inefficient because it has to apply the substring function on a large number of records. The third will be inefficient because it will have to scan the entire table.

If you had a relational model you could do stuff like this.

select municipality, count(distinct cardID)
from table join location on table.locationid = location.locationID
group by municipality

select stateCode, count(distinct cardID)
from table join location on table.locationid = location.locationID
group by stateCode

State codes would be NY, NJ, etc If you wanted New York, New Jersey, etc, that would simply be another field.

Finally,

select count(distinct cardID)
from table join location on table.locationid = location.locationID
where stateCode = 'NY'

The second two relational queries would be faster than the single table queries because they wouldn't have to use functions. You could speed it up even more with an index on stateCode, but even if you didn't, a much smaller table would be scanned.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...