Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
246 views
in Technique[技术] by (71.8m points)

c# - How to detect if a file is PDF or TIFF?

Please bear with me as I've been thrown into the middle of this project without knowing all the background. If you've got WTF questions, trust me, I have them too.

Here is the scenario: I've got a bunch of files residing on an IIS server. They have no file extension on them. Just naked files with names like "asda-2342-sd3rs-asd24-ut57" and so on. Nothing intuitive.

The problem is I need to serve up files on an ASP.NET (2.0) page and display the tiff files as tiff and the PDF files as PDF. Unfortunately I don't know which is which and I need to be able to display them appropriately in their respective formats.

For example, lets say that there are 2 files I need to display, one is tiff and one is PDF. The page should show up with a tiff image, and perhaps a link that would open up the PDF in a new tab/window.

The problem:

As these files are all extension-less I had to force IIS to just serve everything up as TIFF. But if I do this, the PDF files won't display. I could change IIS to force the MIME type to be PDF for unknown file extensions but I'd have the reverse problem.

http://support.microsoft.com/kb/326965

Is this problem easier than I think or is it as nasty as I am expecting?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

OK, enough people are getting this wrong that I'm going to post some code I have to identify TIFFs:

private const int kTiffTagLength = 12;
private const int kHeaderSize = 2;
private const int kMinimumTiffSize = 8;
private const byte kIntelMark = 0x49;
private const byte kMotorolaMark = 0x4d;
private const ushort kTiffMagicNumber = 42;


private bool IsTiff(Stream stm)
{
    stm.Seek(0);
    if (stm.Length < kMinimumTiffSize)
        return false;
    byte[] header = new byte[kHeaderSize];

    stm.Read(header, 0, header.Length);

    if (header[0] != header[1] || (header[0] != kIntelMark && header[0] != kMotorolaMark))
        return false;
    bool isIntel = header[0] == kIntelMark;

    ushort magicNumber = ReadShort(stm, isIntel);
    if (magicNumber != kTiffMagicNumber)
        return false;
    return true;
}

private ushort ReadShort(Stream stm, bool isIntel)
{
    byte[] b = new byte[2];
    _stm.Read(b, 0, b.Length);
    return ToShort(_isIntel, b[0], b[1]);
}

private static ushort ToShort(bool isIntel, byte b0, byte b1)
{
    if (isIntel)
    {
        return (ushort)(((int)b1 << 8) | (int)b0);
    }
    else
    {
        return (ushort)(((int)b0 << 8) | (int)b1);
    }
}

I hacked apart some much more general code to get this.

For PDF, I have code that looks like this:

public bool IsPdf(Stream stm)
{
    stm.Seek(0, SeekOrigin.Begin);
    PdfToken token;
    while ((token = GetToken(stm)) != null) 
    {
        if (token.TokenType == MLPdfTokenType.Comment) 
        {
            if (token.Text.StartsWith("%PDF-1.")) 
                return true;
        }
        if (stm.Position > 1024)
            break;
    }
    return false;
}

Now, GetToken() is a call into a scanner that tokenizes a Stream into PDF tokens. This is non-trivial, so I'm not going to paste it here. I'm using the tokenizer instead of looking at substring to avoid a problem like this:

% the following is a PostScript file, NOT a PDF file
% you'll note that in our previous version, it started with %PDF-1.3,
% incorrectly marking it as a PDF
%
clippath stroke showpage

this code is marked as NOT a PDF by the above code snippet, whereas a more simplistic chunk of code will incorrectly mark it as a PDF.

I should also point out that the current ISO spec is devoid of the implementation notes that were in the previous Adobe-owned specification. Most importantly from the PDF Reference, version 1.6:

Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...