memory - C++: Fast way to read mapped file into a matrix

Question

Welcome To Ask or Share your Answers For Others

memory - C++: Fast way to read mapped file into a matrix

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

memory - C++: Fast way to read mapped file into a matrix

I'm trying to read a mapped file into a matrix. The file is something like this:

name;phone;city

Luigi Rossi;02341567;Milan

Mario Bianchi;06567890;Rome

....

and it's quiet big. The code I've written works properly but it's not so fast:

#include <iostream>
#include <fstream>
#include <string>
#include <boost/iostreams/device/mapped_file.hpp>

using namespace std;

int main() {

    int i;
    int j=0;
    int k=0;

    vector< vector<char> > M(10000000, vector<string>(3));

    mapped_file_source file("file.csv");

    // Check if file was successfully opened
    if(file.is_open()) {

      // Get pointer to the data
      const char * c = (const char *)file.data();

      int size=file.size();

      for(i = 0; i < (size+1); i++){

       if(c[i]=='
' || i==size){
        j=j+1;
        k=0;
       }else if(c[i]==';'){
        k=k+1;
       }else{
        M[j][k]+=c[i];
       }    
     }//end for


   }//end if    

 return(0)


}

Is there a faster way? I've read something about memcyp but I don't know how to use it to speed up my code.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:40:14+0000

I have numerous examples doing this/similar written up on SO.

Let me list the most relevant:

I've done quite a few of these benchmarks. Yes, for sequential freading, read/scanf have a tiny edge (see e.g. scanf/iostreams and files vs. mappings, and parsing floats, or read being slightly faster for 1-pass sequential read).
An interesting approach is to do parsing lazily (why copy the whole input into memory? What's the point memory mapping then). The answer here shows this approach (emulating a multimap there):
- Using boost::iostreams::mapped_file_source with std::multimap (approach #2)

In all other cases, consider slamming a Spirit Qi job on it, potentially using boost::string_ref instead of vector<char> (unless the mapped file is not "const", of course).

The string_ref is also shown int the last answer linked before. Another interesting example of this (with lazy conversions to un-escaped string values) is here How to parse mustache with Boost.Xpressive correctly?

DEMO

Here's that Qi job slammed on it:

it parses a 994 MiB file of ~32 million lines in 2.9s into a vector of
```
struct Line {
    boost::string_ref name, city;
    long id;
};
```
note that we parse the number, and store the strings by referring to their location in the memory map + length (string_ref)
it pretty-prints the data from 10 random lines
it can run as fast as 2.5s if you reserve 32m elements in the vector at once; the program does only a single memory allocation in that case.
NOTE: on a 64 bit system, the memory representation grows larger than the input size if the average line length is less than 40 bytes. This is because a string_ref is 16 bytes.

Live On Coliru

#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/utility/string_ref.hpp>

namespace qi = boost::spirit::qi;
using sref   = boost::string_ref;

namespace boost { namespace spirit { namespace traits {
    template <typename It>
    struct assign_to_attribute_from_iterators<sref, It, void> {
        static void call(It f, It l, sref& attr) { attr = { f, size_t(std::distance(f,l)) }; }
    };
} } }

struct Line {
    sref name, city;
    long id;
};

BOOST_FUSION_ADAPT_STRUCT(Line, (sref,name)(long,id)(sref,city))

int main() {
    boost::iostreams::mapped_file_source mmap("input.txt");

    using namespace qi;

    std::vector<Line> parsed;
    parsed.reserve(32000000);
    if (phrase_parse(mmap.begin(), mmap.end(), 
                omit[+graph] >> eol >>
                (raw[*~char_(";
")] >> ';' >> long_ >> ';' >> raw[*~char_(";
")]) % eol,
                qi::blank, parsed))
    {
        std::cout << "Parsed " << parsed.size() << " lines
";
    } else {
        std::cout << "Failed after " << parsed.size() << " lines
";
    }

    std::cout << "Printing 10 random items:
";
    for(int i=0; i<10; ++i) {
        auto& line = parsed[rand() % parsed.size()];
        std::cout << "city: '" << line.city << "', id: " << line.id << ", name: '" << line.name << "'
";
    }
}

With input generated like

do grep -v "'" /etc/dictionaries-common/words | sort -R | xargs -d\n -n 3 | while read a b c; do echo "$a $b;$RANDOM;$c"; done

The output is e.g.

Parsed 31609499 lines
Printing 10 random items:
city: 'opted', id: 14614, name: 'baronets theosophy'
city: 'denominated', id: 24260, name: 'insignia ophthalmic'
city: 'mademoiselles', id: 10791, name: 'smelter orienting'
city: 'ducked', id: 32155, name: 'encircled flippantly'
city: 'garotte', id: 3080, name: 'keeling South'
city: 'emirs', id: 14511, name: 'Aztecs vindicators'
city: 'characteristically', id: 5473, name: 'constancy Troy'
city: 'savvy', id: 3921, name: 'deafer terrifically'
city: 'misfitted', id: 14617, name: 'Eliot chambray'
city: 'faceless', id: 24481, name: 'shade forwent'

Categories

memory - C++: Fast way to read mapped file into a matrix

memory - C++: Fast way to read mapped file into a matrix

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

DEMO

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags