Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
541 views
in Technique[技术] by (71.8m points)

node.js - convert streamed buffers to utf8-string

I want to make a HTTP-request using node.js to load some text from a webserver. Since the response can contain much text (some Megabytes) I want to process each text chunk separately. I can achieve this using the following code:

var req = http.request(reqOptions, function(res) {
    ...
    res.setEncoding('utf8');
    res.on('data', function(textChunk) {
        // process utf8 text chunk
    });
});

This seems to work without problems. However I want to support HTTP-compression, so I use zlib:

var zip = zlib.createUnzip();

// NO res.setEncoding('utf8') here since we need the raw bytes for zlib
res.on('data', function(chunk) {
    // do something like checking the number of bytes downloaded
    zip.write(chunk); // give the raw bytes to zlib, s.b.
});

zip.on('data', function(chunk) {
    // convert chunk to utf8 text:
    var textChunk = chunk.toString('utf8');

    // process utf8 text chunk
});

This can be a problem for multi-byte characters like 'u00c4' which consists of two bytes: 0xC3 and 0x84. If the first byte is covered by the first chunk (Buffer) and the second byte by the second chunk then chunk.toString('utf8') will produce incorrect characters at the end/beginning of the text chunk. How can I avoid this?

Hint: I still need the buffer (more specifically the number of bytes in the buffer) to limit the number of downloaded bytes. So using res.setEncoding('utf8') like in the first example code above for non-compressed data does not suit my needs.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Single Buffer

If you have a single Buffer you can use its toString method that will convert all or part of the binary contents to a string using a specific encoding. It defaults to utf8 if you don't provide a parameter, but I've explicitly set the encoding in this example.

var req = http.request(reqOptions, function(res) {
    ...

    res.on('data', function(chunk) {
        var textChunk = chunk.toString('utf8');
        // process utf8 text chunk
    });
});

Streamed Buffers

If you have streamed buffers like in the question above where the first byte of a multi-byte UTF8-character may be contained in the first Buffer (chunk) and the second byte in the second Buffer then you should use a StringDecoder. :

var StringDecoder = require('string_decoder').StringDecoder;

var req = http.request(reqOptions, function(res) {
    ...
    var decoder = new StringDecoder('utf8');

    res.on('data', function(chunk) {
        var textChunk = decoder.write(chunk);
        // process utf8 text chunk
    });
});

This way bytes of incomplete characters are buffered by the StringDecoder until all required bytes were written to the decoder.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...