Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
163 views
in Technique[技术] by (71.8m points)

c# - Create a summary description of a schedule given a list of shifts

Assuming I have a list of shifts for an event (in the format start date/time, end date/time) - is there some sort of algorithm I could use to create a generalized summary of the schedule? It is quite common for most of the shifts to fall into some sort of common recurrence pattern (ie. Mondays from 9:00 am to 1:00 pm, Tuesdays from 10:00 am to 3:00 pm, etc). However, there can (and will be) exceptions to this rule (eg. one of the shifts fell on a holiday and was rescheduled for the next day). It would be fine to exclude those from my "summary", as I'm looking to provide a more general answer of when does this event usually occur.

I guess I'm looking for some sort of statistical method to determine the day and time occurences and create a description based on the most frequent occurences found in the list. Is there some sort of general algorithm for something like this? Has anyone created something similar?

Ideally I'm looking for a solution in C# or VB.NET, but don't mind porting from any other language.

Thanks in advance!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You may use Cluster Analysis.

Clustering is a way to segregate a set of data into similar components (subsets). The "similarity" concept involves some definition of "distance" between points. Many usual formulas for the distance exists, among others the usual Euclidean distance.

Practical Case

Before pointing you to the quirks of the trade, let's show a practical case for your problem, so you may get involved in the algorithms and packages, or discard them upfront.

For easiness, I modelled the problem in Mathematica, because Cluster Analysis is included in the software and very straightforward to set up.

First, generate the data. The format is { DAY, START TIME, END TIME }.
The start and end times have a random variable added (+half hour, zero, -half hour} to show the capability of the algorithm to cope with "noise".

There are three days, three shifts per day and one extra (the last one) "anomalous" shift, which starts at 7 AM and ends at 9 AM (poor guys!).

There are 150 events in each "normal" shift and only two in the exceptional one.

As you can see, some shifts are not very far apart from each other.

I include the code in Mathematica, in case you have access to the software. I'm trying to avoid using the functional syntax, to make the code easier to read for "foreigners".

Here is the data generation code:

Rn[] := 0.5 * RandomInteger[{-1, 1}];

monshft1 = Table[{ 1 , 10 + Rn[] , 15 + Rn[] }, {150}];  // 1
monshft2 = Table[{ 1 , 12 + Rn[] , 17 + Rn[] }, {150}];  // 2
wedshft1 = Table[{ 3 , 10 + Rn[] , 15 + Rn[] }, {150}];  // 3
wedshft2 = Table[{ 3 , 14 + Rn[] , 17 + Rn[] }, {150}];  // 4
frishft1 = Table[{ 5 , 10 + Rn[] , 15 + Rn[] }, {150}];  // 5
frishft2 = Table[{ 5 , 11 + Rn[] , 15 + Rn[] }, {150}];  // 6
monexcp  = Table[{ 1 , 7  + Rn[] , 9  + Rn[] }, {2}];    // 7

Now we join the data, obtaining one big dataset:

data = Join[monshft1, monshft2, wedshft1, wedshft2, frishft1, frishft2, monexcp];

Let's run a cluster analysis for the data:

clusters = FindClusters[data, 7, Method->{"Agglomerate","Linkage"->"Complete"}]

"Agglomerate" and "Linkage" -> "Complete" are two fine tuning options of the clustering methods implemented in Mathematica. They just specify we are trying to find very compact clusters.

I specified to try to detect 7 clusters. If the right number of shifts is unknown, you can try several reasonable values and see the results, or let the algorithm select the more proper value.

We can get a chart with the results, each cluster in a different color (don't mind the code)

ListPointPlot3D[ clusters, 
           PlotStyle->{{PointSize[Large], Pink},    {PointSize[Large], Green},   
                       {PointSize[Large], Yellow},  {PointSize[Large], Red},  
                       {PointSize[Large], Black},   {PointSize[Large], Blue},   
                       {PointSize[Large], Purple},  {PointSize[Large], Brown}},  
                       AxesLabel -> {"DAY", "START TIME", "END TIME"}]  

And the result is:

alt text

Where you can see our seven clusters clearly apart.

That solves part of your problem: identifying the data. Now you also want to be able to label it.

So, we'll get each cluster and take means (rounded):

Table[Round[Mean[clusters[[i]]]], {i, 7}]  

The result is:

Day   Start  End
{"1", "10", "15"},
{"1", "12", "17"},
{"3", "10", "15"},
{"3", "14", "17"},
{"5", "10", "15"},
{"5", "11", "15"},
{"1",  "7",  "9"}

And with that you get again your seven classes.

Now, perhaps you want to classify the shifts, no matter the day. If the same people make the same task at the same time everyday, so it's no useful to call it "Monday shift from 10 to 15", because it happens also on Weds and Fridays (as in our example).

Let's analyze the data disregarding the first column:

clusters=
 FindClusters[Take[data, All, -2],Method->{"Agglomerate","Linkage"->"Complete"}];

In this case, we are not selecting the number of clusters to retrieve, leaving the decision to the package.

The result is

image

You can see that five clusters have been identified.

Let's try to "label" them as before:

Grid[Table[Round[Mean[clusters[[i]]]], {i, 5}]]

The result is:

 START  END
{"10", "15"},
{"12", "17"},
{"14", "17"},
{"11", "15"},
{ "7",  "9"}

Which is exactly what we "suspected": there are repeated events each day at the same time that could be grouped together.

Edit: Overnight Shifts and Normalization

If you have (or plan to have) shifts that start one day and end on the following, it's better to model

{Start-Day Start-Hour Length}  // Correct!

than

{Start-Day Start-Hour End-Day End-Hour}  // Incorrect!  

That's because as with any statistical method, the correlation between the variables must be made explicit, or the method fails miserably. The principle could run something like "keep your candidate data normalized". Both concepts are almost the same (the attributes should be independent).

--- Edit end ---

By now I guess you understand pretty well what kind of things you can do with this kind if analysis.

Some references

  1. Of course, Wikipedia, its "references" and "further reading" are good guide.
  2. A nice video here showing the capabilities of Statsoft, but you can get there many ideas about other things you can do with the algorithm.
  3. Here is a basic explanation of the algorithms involved
  4. Here you can find the impressive functionality of R for Cluster Analysis (R is a VERY good option)
  5. Finally, here you can find a long list of free and commercial software for statistics in general, including clustering.

HTH!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...