Here is a graphic that illustrates the threshold approach to note onset detection:
This image shows a typical WAV file with three discrete notes played in succession. The red line represents a chosen signal threshold, and the blue lines represent note start positions returned by a simple algorithm that marks a start when the signal level crosses the threshold.
As the image shows, selecting a proper absolute threshold is difficult. In this case, the first note is picked up fine, the second note is missed completely, and the third note (barely) is started very late. In general, a low threshold causes you to pick up phantom notes, while raising it causes you to miss notes. One solution to this problem is to use a relative threshold that triggers a start if the signal increases by a certain percentage over a certain time, but this has problems of its own.
A simpler solution is to use the somewhat-counterintuitively named compression (not MP3 compression - that's something else entirely) on your wave file first. Compression essentially flattens the spikes in your audio data and then amplifies everything so that more of the audio is near the maximum values. The effect on the above sample would look like this (which shows why the name "compression" appears to make no sense - on audio equipment it's usually labelled "loudness"):
After compression, the absolute threshold approach will work much better (although it's easy to over-compress and start picking up fictional note starts, the same effect as lowering the threshold). There are a lot of wave editors out there that do a good job of compression, and it's better to let them handle this task - you'll probably need to do a fair amount of work "cleaning up" your wave files before detecting notes in them anyway.
In coding terms, a WAV file loaded into memory is essentially just an array of two-byte integers, where 0 represents no signal and 32,767 and -32,768 represent the peaks. In its simplest form, a threshold detection algorithm would just start at the first sample and read through the array until it finds a value greater than the threshold.
short threshold = 10000;
for (int i = 0; i < samples.Length; i++)
{
if ((short)Math.Abs(samples[i]) > threshold)
{
// here is one note onset point
}
}
In practice this works horribly, since normal audio has all sorts of transient spikes above a given threshold. One solution is to use a running average signal strength (i.e. don't mark a start until the average of the last n samples is above the threshold).
short threshold = 10000;
int window_length = 100;
int running_total = 0;
// tally up the first window_length samples
for (int i = 0; i < window_length; i++)
{
running_total += samples[i];
}
// calculate moving average
for (int i = window_length; i < samples.Length; i++)
{
// remove oldest sample and add current
running_total -= samples[i - window_length];
running_total += samples[i];
short moving_average = running_total / window_length;
if (moving_average > threshold)
{
// here is one note onset point
int onset_point = i - (window_length / 2);
}
}
All of this requires much tweaking and playing around with settings to get it to find the start positions of a WAV file accurately, and usually what works for one file will not work very well on another. This is a very difficult and not-perfectly-solved problem domain you've chosen, but I think it's cool that you're tackling it.
Update: this graphic shows a detail of note detection I left out, namely detecting when the note ends:
The yellow line represents the off-threshold. Once the algorithm has detected a note start, it assumes the note continues until the running average signal strength drops below this value (shown here by the purple lines). This is, of course, another source of difficulties, as is the case where two or more notes overlap (polyphony).
Once you've detected the start and stop points of each note, you can now analyze each slice of WAV file data to determine the pitches.
Update 2: I just read your updated question. Pitch-detection through auto-correlation is much easier to implement than FFT if you're writing your own from scratch, but if you've already checked out and used a pre-built FFT library, you're better off using it for sure. Once you've identified the start and stop positions of each note (and included some padding at the beginning and end for the missed attack and release portions), you can now pull out each slice of audio data and pass it to an FFT function to determine the pitch.
One important point here is not to use a slice of the compressed audio data, but rather to use a slice of the original, unmodified data. The compression process distorts the audio and may produce an inaccurate pitch reading.
One last point about note attack times is that it may be less of a problem than you think. Often in music an instrument with a slow attack (like a soft synth) will begin a note earlier than a sharp attack instrument (like a piano) and both notes will sound as if they're starting at the same time. If you're playing instruments in this manner, the algorithm with pick up the same start time for both kinds of instruments, which is good from a WAV-to-MIDI perspective.
Last update (I hope): Forget what I said about including some paddings samples from the early attack part of each note - I forgot this is actually a bad idea for pitch detection. The attack portions of many instruments (especially piano and other percussive-type instruments) contain transients that aren't multiples of the fundamental pitch, and will tend to screw up pitch detection. You actually want to start each slice a little after the attack for this reason.
Oh, and kind of important: the term "compression" here does not refer to MP3-style compression.
Update again: here is a simple function that does non-dynamic compression:
public void StaticCompress(short[] samples, float param)
{
for (int i = 0; i < samples.Length; i++)
{
int sign = (samples[i] < 0) ? -1 : 1;
float norm = ABS(samples[i] / 32768); // NOT short.MaxValue
norm = 1.0 - POW(1.0 - norm, param);
samples[i] = 32768 * norm * sign;
}
}
When param = 1.0, this function will have no effect on the audio. Larger param values (2.0 is good, which will square the normalized difference between each sample and the max peak value) will produce more compression and a louder overall (but crappy) sound. Values under 1.0 will produce an expansion effect.
One other probably obvious point: you should record the music in a small, non-echoic room since echoes are often picked up by this algorithm as phantom notes.
Update: here is a version of StaticCompress that will compile in C# and explicity casts everything. This returns the expected result:
public void StaticCompress(short[] samples, double param)
{
for (int i = 0; i < samples.Length; i++)
{
Compress(ref samples[i], param);
}
}
public void Compress(ref short orig, double param)
{
double sign = 1;
if (orig < 0)
{
sign = -1;
}
// 32768 is max abs value of a short. best practice is to pre-
// normalize data or use peak value in place of 32768
double norm = Math.Abs((double)orig / 32768.0);
norm = 1.0 - Math.Pow(1.0 - norm, param);
orig = (short)(32768.0 * norm * sign); // should round before cast,
// but won't affect note onset detection
}
Sorry, my knowledge score on Matlab is 0. If you posted another question on why your Matlab function doesn't work as expected it would get answered (just not by me).