The DCG [Discounted Cumulative Gain] and nDCG [normalized DCG] are usually a good measure for ranked lists.
It gives the full gain for relevant document if it is ranked first, and the gain decreases as rank decreases.
Using DCG/nDCG to evaluate the system compared to the SOA base line:
Note: If you set all results returned by "state of the art system" as relevant, then your system is identical to the state of the art if they recieved the same rank using DCG/nDCG.
Thus, a possible evaluation could be: DCG(your_system)/DCG(state_of_the_art_system)
To further enhance it, you can give a relevance grade [relevance will not be binary] - and will be determined according to how each document was ranked in the state of the art. For example rel_i = 1/log(1+i)
for each document in the state of the art system.
If the value recieved by this evaluation function is close to 1: your system is very similar to the base line.
Example:
mySystem = [1,2,5,4,6,7]
stateOfTheArt = [1,2,4,5,6,9]
First you give score to each document, according to the state of the art system [using the formula from above]:
doc1 = 1.0
doc2 = 0.6309297535714574
doc3 = 0.0
doc4 = 0.5
doc5 = 0.43067655807339306
doc6 = 0.38685280723454163
doc7 = 0
doc8 = 0
doc9 = 0.3562071871080222
Now you calculate DCG(stateOfTheArt)
, and use the relevance as stated above [note relevance is not binary here, and get DCG(stateOfTheArt)= 2.1100933062283396
Next, calculate it for your system using the same relecance weights and get: DCG(mySystem) = 1.9784040064803783
Thus, the evaluation is DCG(mySystem)/DCG(stateOfTheArt) = 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939