C++的 set 爆内存，求助

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个创建于 2678 天前的主题，其中的信息可能已经有所发展或是发生改变。

如题，本来想写个小程序，对比一下 c++,python,golang 处理日志文件的效率。找了个 600MB 的日志文件试水，结果因为本人水平有限，导致 c++的实现出现了大量的内存消耗，运行不完就被 kill 了。

程序的思路是逐行读取日志文件，用空格切分，第一个字段作为 key，剩下的字段去重后作为 value。

先贴一下 python 的实现

big_dict = {}
with open("data/test_log.txt") as f_in:
    for line in f_in:
        items = line.strip().split(' ')
        key = items[0]
        if key not in big_dict:
            big_dict[key] = set([])
        for i in items[1:]:
            big_dict[key].add(i)

print "total keys:", len(big_dict)

再贴一下 golang 的：

package main

import (
    "bufio"
    "fmt"
    "io"
    "os"
    "strings"
)


func process(fname string, big_dict map[string]map[string]int) {
    fin, err := os.Open(fname)
    defer fin.Close()
    if err != nil {
        fmt.Println("Open file error ", err)
        return
    }

    buf := bufio.NewReader(fin)
    line_count := 0
    for ; ; line_count++ {
        line, err := buf.ReadString('\n')
        if err != nil {
            if io.EOF == err {
                break
            }
            fmt.Println("Error in ReadBytes: ", err)
            return
        }
        items := strings.Split(line, " ")
        key := items[0]

        elem, ok := big_dict[key]
        if false == ok {
            big_dict[key] = make(map[string]int)
        }
        elem = big_dict[key]
        for i := 1; i < len(items); i++ {
            elem[items[i]] = 1
        }
    }
    fmt.Println("Total Line Count: ", line_count)
}

func main() {
    const fname = "data/test_log.txt"
    big_dict := make(map[string]map[string]int)
    process(fname, big_dict)
    fmt.Println("Total Key Count: ", len(big_dict))
}

最后贴一下 c++的。

#include <iostream>
#include <fstream>
#include <string>
#include<unordered_map>
#include<unordered_set>

using namespace std;

// data/test_log.txt 文件是一个 600MB 的日志文件
const string IN_FNAME = "data/test_log.txt";
unordered_map<string, unordered_set<string *>* > big_dict;

void process_file(const string fname, unordered_map<string,unordered_set<string*> *> & big_dict) {
    ifstream f_in;
    f_in.open(fname, ios::in);
    string line = "";
    int total_line = 0;
    size_t start =0, end = 0;
    while(getline(f_in, line)) {
        ++total_line;
        start =0, end = 0;// c++没有内建 string 的分割，自己 diy 个
        end = line.find_first_of(' ',start);
        string key = line.substr(start,end);
        // 寻找是否存在 key
        if(big_dict.find(key) == big_dict.end()) {
            unordered_set<string*> *p = new unordered_set<string*>;
            big_dict[key] = p;
        }

        start = end+1;
        while(start<line.size()) {
            end = line.find_first_of(' ',start);
            if(string::npos == end) end = line.size() - 1;
            string value = line.substr(start,end);
            big_dict[key]->insert(&value);//大部分的时间都在这个 insert 上了
            //这里我存的是 string 的指针，实际上无法得到去重的效果
            //但是我如果直接存 string 对象，会直接爆内存
            start = end + 1;
        }
    }
    f_in.close();

}

int main() {

    process_file(IN_FNAME, big_dict);

    cout<<"Total Keys:"<<big_dict.size()<<endl;

    return 0;
}

c++的实现中，big_dict[key]->insert(&value);大部分的时间都在这个 insert 上了，这里我存的是 string 的指针，实际上无法得到去重的效果。但是我如果直接存 string 对象，会直接爆内存。我存指针可以解决内存的问题，但是执行速度上依然是 go 快过 python，最慢的是 c++。

希望有大神能指点下，看我的 c++代码哪里出了问题。

第 1 条附言 · 2017-10-25 13:26:43 +08:00

爆内存的问题的确是我错误的使用了 substr，修改为 line.substr(start,end - start)后，的确内存占用很少了，大概只有 200M。但是执行的速度还是比不上 python 和 go，不管是我原来的代码，还是 @williamscorn 提供的代码，大概都要慢一倍的样子。希望大家集思广益，看看为啥c++会慢些。

48 条回复 • 2017-10-28 17:03:20 +08:00

lcdtyph

2017-10-25 09:53:28 +08:00 via iPhone

临时变量取指针存储就不提了，两个相等它们的地址就像等吗？用地址做 key 你起码要重新定义 comp 函数啊。

scenix

2017-10-25 10:06:36 +08:00

@lcdtyph 首先感谢你的回复，我自然知道这段代码是实际上无法得到去重的效果的，我文中和代码注释中已经做了说明。我现在想解决的是运行的慢的问题，我重新定义 comp 函数能不能把速度提上来？请赐教。

2017-10-25 10:06:47 +08:00

string value = line.substr(start,end); 这里

substr 定义是 basic_string substr( size_type pos = 0,
size_type count = npos ) const;

另外你这 C++到处指针看着挺不舒服的，临时变量存指针的问题楼上已经提过就不多说了

scenix

2017-10-25 10:12:50 +08:00

@03 感谢你的回复，我自然知道这段代码是实际上无法得到去重的效果的，我文中和代码注释中已经做了说明。原来 key 是直接存储的 string 对象，之所以存成指针是想着试试看光存指针会不会速度上来，结果这样改了下是不爆内存，但还是慢，就没心思管什么 const 不 const, 存成指针后里面东西还能不能用的问题了。

jmc891205

2017-10-25 10:18:02 +08:00

template < class Key, // unordered_set::key_type/value_type
_________ class Hash = hash<Key>, // unordered_set::hasher
_________ class Pred = equal_to<Key>, // unordered_set::key_equal
_________ class Alloc = allocator<Key> // unordered_set::allocator_type
________> class unordered_set;

第二个或第三个模版参数自己传一个进去

sagaxu

2017-10-25 10:21:56 +08:00

@scenix 你这不是不能去重的问题，是指针已经飞了，完全错误了

jmc891205

2017-10-25 10:22:49 +08:00

@scenix 如果你的 key 重复的很多的话那去重之后速度应该会快很多。因为现在你是每一次 insert 都要在所有的 key 里做一遍 search。

scenix

2017-10-25 10:28:09 +08:00

@jmc891205 @sagaxu 感谢二位回复，所以现在看来我存指针是完全错误的方向了。那么我直接存 string 对象作为 key，内存飞涨的问题怎么解决呢？只有 600M 的文件读入，python 大概需要 800M 内存的样子，go 使用的更少。
目前我定义为
```c++
unordered_map<string, unordered_set<string>* > big_dict;`
```

存的时候
```c++
big_dict[key]->insert(line.substr(start, end)); //这样内存吃不消
```

acros

2017-10-25 10:28:45 +08:00

楼上的意思是，你存错了 string 指针，value 是 substr 得到的栈内存地址····

lcdtyph

2017-10-25 10:38:44 +08:00 via iPhone

@scenix 你肯定是这种写法也取了临时 set 的指针，导致下次插入时候那块内存已经失效了。不是内存不够爆了。

acros

2017-10-25 10:38:52 +08:00

600m 文件似乎还不至爆内存，不是上面内存访问错误吧。

lcdtyph

2017-10-25 10:48:57 +08:00 via iPhone

@lcdtyph 看错了代码，不是这个原因。我还是等有电脑了再看吧…

arakashic

2017-10-25 10:52:22 +08:00

```
big_dict = {}
with open("data/test_log.txt") as f_in:
for line in f_in:
items = line.strip().split(' ')
key = items[0]
if key not in big_dict:
big_dict[key] = set([])
#下面这两行有何意义？又不影响 len(big_dict)
for i in items[1:]:
big_dict[key].add(i)

print "total keys:", len(big_dict)
```

scenix

2017-10-25 10:53:12 +08:00

@lcdtyph @acros 我运行的时候一直用 top 命令看内存，眼看着内存使用一路飙到 3G，然后被 kill，没有异常，没有 core，直接 kill 的。
我的 set 存的时候是这么写的：

auto *p = new unordered_set<string>;
big_dict[key] = p;

可以看到只 new 了没有 delete。但是我如果不在这个 new 的 set 里面存东西的话，也不会爆内存，大量的时间和内存都是在下面这句消耗的：

big_dict[key]->insert(line.substr(start, end));

scenix

2017-10-25 11:00:06 +08:00

@arakashic 感谢回复，程序的思路是逐行读取日志文件，用空格切分，第一个字段作为 key，剩下的字段去重后作为 value。只是为了看下处理效率，当然接着写的话，可以把 big_dict 里面的东西每个(key,value)作为一行输出到一个文件里吧。

hncqp

2017-10-25 11:01:29 +08:00

@scenix
line.substr(start,end); 修改为 line.substr(start,end - start);
set 存 string，不要存临时地址
big_dict[key]->insert(&value); 修改为迭代器访问，不要每次都去 find

williamscorn

2017-10-25 11:03:59 +08:00

#include <iostream>
#include <fstream>
#include <sstream>
#include <map>
#include <set>
using namespace std;

const string IN_FNAME = "data/test_log.txt";
map<string,set<string> >dict;

int main() {
ios::sync_with_stdio(false);
fstream rin(IN_FNAME);
rin.tie(0);

string buf,key,val;
stringstream ss;
while(getline(rin,buf)){
set<string>tmp;
ss.str(buf);//format:key val_1 val_2 ... val_n
ss>>key;
while(!ss.eof()){
ss>>val;
tmp.insert(val);
}
dict.insert(make_pair(key,tmp));
}
cout<<"Total Keys:"<<dict.size()<<'\n';
}

Gingko

2017-10-25 11:12:04 +08:00

@williamscorn 正解

williamscorn

2017-10-25 11:13:37 +08:00

@Gingko
少打了两个头文件...
bits/stdc++.h 用多了...
clang 居然还能正确编译...
#include <utility>
#include <string>

wevsty

2017-10-25 11:24:56 +08:00

这个 cpp 写的。。。
line.substr(start,end)这里的问题前面已经有人指出来了，end 不应该是结束为止的标号，而是复制的长度。
同理 end = line.size() - 1;一样是有问题的。
unordered_map<string, unordered_set<string>* > big_dict;
这个定义是一个 key 指向一个不会重复的 string 指针，big_dict[key]->insert(&value);实际是插入了 value 这个 string 的指针，然而 value 在循环结束的时候就因为生存周期结束而销毁了，所以你才觉得这样不会爆内存。

不要用那么多指针来掩饰，你程序里想表达的数据结构实际上就是：
unordered_map<string,unordered_set<string> >
在 map 里套 set，那么 key 至少存在 2 次，存在重复查找，重复存放一堆问题，效率能高才怪了
比如日志中某一行是
“ key value1 value2 ”那么运行完成以后数据结构实际上是
{'key'：{'key':'value2'}}
而你的 python 代码对应的结果应该是
{'key':['value1','value2']}
从结果上看，不要谈效率，代码的实现完全都是不对的。

wevsty

2017-10-25 11:33:17 +08:00

@williamscorn 按照我的理解，从楼主的意图来说，也许不应该用 std::set，而是用 std::list 或者 std::vector 更合适一些。

williamscorn

2017-10-25 11:35:05 +08:00

@wevsty
是的，vector 存好，sort 之后 unique+erase
我只是写了最直接的版本，给楼主留了很多的优化空间

scenix

2017-10-25 12:16:42 +08:00

@wevsty @williamscorn @hncqp 非常感谢提点。爆内存的问题的确是我错误的使用了 substr，修改为 line.substr(start,end - start)后，的确内存占用很少了，大概只有 200M。但是执行的速度还是比不上 python 和 go，不管是我原来的代码，还是 @williamscorn 提供的代码，大概都要慢一倍的样子。

P.S. @williamscorn 提供的代码中我每个循环加了一句 ss.clear(); 其余没有动。

wevsty

2017-10-25 12:29:23 +08:00

@scenix 速度的问题，上测试用例。对于测试用例 @williamscorn 的代码也未必就是速度最优的选择，剩下的具体问题具体分析。

scenix

2017-10-25 12:33:09 +08:00

@wevsty 大概就是这样一个日志文件：

$ head data/test_log.txt
03:07:02 giantbee systemd: Removed slice user-0.slice. Sep 25 03:07:02 giantbee systemd: Removed slice user-0.slice.
03:07:02 giantbee systemd: Stopping user-0.slice. Sep 25 03:07:02 giantbee systemd: Stopping user-0.slice.
03:07:02 giantbee kibana: "type":"log" "@timestamp":"2017-09-24T19:07:02Z" "tags":["warning" "elasticsearch" "admin"] "pid":22721 "message":"Unable to revive connection: http://localhost:9200/" Sep 25 03:07:02 giantbee kibana: {"type":"log","@timestamp":"2017-09-24T19:07:02Z","tags":["warning","elasticsearch","admin"],"pid":22721,"message":"Unable to revive connection: http://localhost:9200/"}
03:07:02 giantbee kibana: "type":"log" "@timestamp":"2017-09-24T19:07:02Z" "tags":["warning" "elasticsearch" "admin"] "pid":22721 "message":"No living connections" Sep 25 03:07:02 giantbee kibana: {"type":"log","@timestamp":"2017-09-24T19:07:02Z","tags":["warning","elasticsearch","admin"],"pid":22721,"message":"No living connections"}
03:07:05 giantbee kibana: "type":"log" "@timestamp":"2017-09-24T19:07:05Z" "tags":["warning" "elasticsearch" "admin"] "pid":22721 "message":"Unable to revive connection: http://localhost:9200/" Sep 25 03:07:05 giantbee kibana: {"type":"log","@timestamp":"2017-09-24T19:07:05Z","tags":["warning","elasticsearch","admin"],"pid":22721,"message":"Unable to revive connection: http://localhost:9200/"}
03:07:05 giantbee kibana: "type":"log" "@timestamp":"2017-09-24T19:07:05Z" "tags":["warning" "elasticsearch" "admin"] "pid":22721 "message":"No living connections" Sep 25 03:07:05 giantbee kibana: {"type":"log","@timestamp":"2017-09-24T19:07:05Z","tags":["warning","elasticsearch","admin"],"pid":22721,"message":"No living connections"}
03:07:07 giantbee kibana: "type":"log" "@timestamp":"2017-09-24T19:07:07Z" "tags":["warning" "elasticsearch" "admin"] "pid":22721 "message":"Unable to revive connection: http://localhost:9200/" Sep 25 03:07:07 giantbee kibana: {"type":"log","@timestamp":"2017-09-24T19:07:07Z","tags":["warning","elasticsearch","admin"],"pid":22721,"message":"Unable to revive connection: http://localhost:9200/"}
03:07:07 giantbee kibana: "type":"log" "@timestamp":"2017-09-24T19:07:07Z" "tags":["warning" "elasticsearch" "admin"] "pid":22721 "message":"No living connections" Sep 25 03:07:07 giantbee kibana: {"type":"log","@timestamp":"2017-09-24T19:07:07Z","tags":["warning","elasticsearch","admin"],"pid":22721,"message":"No living connections"}
03:07:10 giantbee kibana: "type":"log" "@timestamp":"2017-09-24T19:07:10Z" "tags":["warning" "elasticsearch" "admin"] "pid":22721 "message":"Unable to revive connection: http://localhost:9200/" Sep 25 03:07:10 giantbee kibana: {"type":"log","@timestamp":"2017-09-24T19:07:10Z","tags":["warning","elasticsearch","admin"],"pid":22721,"message":"Unable to revive connection: http://localhost:9200/"}
03:07:10 giantbee kibana: "type":"log" "@timestamp":"2017-09-24T19:07:10Z" "tags":["warning" "elasticsearch" "admin"] "pid":22721 "message":"No living connections" Sep 25 03:07:10 giantbee kibana: {"type":"log","@timestamp":"2017-09-24T19:07:10Z","tags":["warning","elasticsearch","admin"],"pid":22721,"message":"No living connections"}

另外 @williamscorn 的代码中 sstream 的>>操作是不是不等价于 python 和 go 的 split，这次是空格，下次我要用\t 或者是|来分割字符串的时候是不是 sstream 就不好用了啊？

pezy

2017-10-25 13:58:51 +08:00

插一句，你的 cpp 编译命令是怎样的？运行环境是什么？

whatTheGhost

2017-10-25 14:26:30 +08:00

@williamscorn set 和 string 都被深拷贝了吧。

scenix

2017-10-25 14:36:57 +08:00

@pezy centos7 用的 cmake 开了-O3 优化

wevsty

2017-10-25 15:50:52 +08:00

stringstream 的实现好像确实不是很快，所以我改了一下。
修改以后在 xubuntu 下面跑了一下，测试的数据是用图里 python 里写的 write_test_file 这个函数生成的。
[img]https://i.loli.net/2017/10/25/59f03fd257b2c.png[/img]
在 O2 优化下程序的时间是 0.267 秒，O3 优化的话和 O2 差不太多，python 下面则是 0.477 秒。
下一楼贴代码

wevsty

2017-10-25 15:51:16 +08:00

#include <iostream>
#include <fstream>
#include <sstream>
#include <map>
#include <set>
#include <vector>
#include <string>
#include <time.h>
#include <unordered_map>
#include <unordered_set>
using namespace std;

const string IN_FNAME = "/home/xubuntu/v2ex/test/test_py.txt";
unordered_map<string, unordered_set<string> > dict;

//字符串分割
std::vector<string> string_split(const string &str_data,
const string &str_separator
, string::size_type n_max_split_count = (string::size_type) - 1)
{
std::vector<string> vec_split_str;
string str_subtmp;
string::size_type n_datalen = str_data.length();
string::size_type n_separator = str_separator.length();
string::size_type n_start = 0;
string::size_type n_index = 0;
if (n_max_split_count == 0)
{
return vec_split_str;
}
do
{
if (n_max_split_count == 0)
{
n_max_split_count--;
break;
}
n_index = str_data.find(str_separator, n_start);
if (n_index != string::npos)
{
str_subtmp = str_data.substr(n_start, n_index - n_start);
if (str_subtmp.length() != 0)
{
vec_split_str.push_back(str_subtmp);
}
n_start = n_index + n_separator;
if (n_start >= n_datalen)
{
break;
}
}
else
{
str_subtmp = str_data.substr(n_start);
if (str_subtmp.length() != 0)
{
vec_split_str.push_back(str_subtmp);
}
break;
}
} while (n_index != string::npos);
return std::move(vec_split_str);
}

void read_message(const string& filepath)
{
fstream rin(filepath);
string buf;
string sep(" ");
std::vector<string> vec_list;
vec_list.reserve(20);
while (getline(rin, buf)) {
vec_list = string_split(buf, sep);
for (size_t i = 1; i < vec_list.size(); i++)
{
dict[vec_list[0]].insert(vec_list[i]);
}
vec_list.clear();
}
}

int main()
{
clock_t start = clock();
read_message(IN_FNAME);
clock_t end = clock();
cout << "time: " << ((double)end - (double)start) / CLOCKS_PER_SEC << " sec" << '\n';
cout << "Total Keys:" << dict.size() << '\n';
return 0;
}

arzterk

2017-10-25 16:01:04 +08:00

可以用 C++14 的右值进一步提速。
或者直接用 multi_hashmap 做一次全插入，然后在去重。

pezy

2017-10-25 16:01:54 +08:00

加上 std::ios::sync_with_stdio(false); 这句话，试试呢

acgnsstech

2017-10-25 17:24:57 +08:00

把 map 改成 vector 效率还能提升一倍！

mainjzb

2017-10-25 17:31:56 +08:00

pezy 说的对加上 std::ios::sync_with_stdio(false); cout 的输出效率是有问题的。或者换成 printf 试一下

wevsty

2017-10-25 17:33:29 +08:00

@wevsty
更正一下，字符串分割函数的 max_split 功能一不小心写错了，不过不影响这个例子中的运行结果。
//字符串分割
std::vector<string> string_split(const string &str_data,
const string &str_separator
, string::size_type n_max_split_count = (string::size_type) - 1)
{
std::vector<string> vec_split_str;
string str_subtmp;
string::size_type n_datalen = str_data.length();
string::size_type n_separator = str_separator.length();
string::size_type n_start = 0;
string::size_type n_index = 0;
do
{
if (n_max_split_count == 0)
{
break;
}
n_max_split_count--;
n_index = str_data.find(str_separator, n_start);
if (n_index != string::npos)
{
str_subtmp = str_data.substr(n_start, n_index - n_start);
if (str_subtmp.length() != 0)
{
vec_split_str.push_back(str_subtmp);
}
n_start = n_index + n_separator;
if (n_start >= n_datalen)
{
break;
}
}
else
{
str_subtmp = str_data.substr(n_start);
if (str_subtmp.length() != 0)
{
vec_split_str.push_back(str_subtmp);
}
break;
}
} while (n_index != string::npos);
return vec_split_str;
}

billwsy

2017-10-25 23:28:13 +08:00 via iPhone

试试用 string_view，老版本的话用 https://github.com/abseil/abseil-cpp/blob/master/absl/strings/string_view.h

xziar

2017-10-26 00:42:27 +08:00

big_dict[key]->insert(value);
就这个，每次都要在 big_dict 里查找一边 key。虽然是 O(1)但用多了也会影响速度(并不确定有多大影响)。
这个写法就非常不 C++。之前 find 的时候就该缓存一下查找结果。
具体的你可以贴出最终代码，vs 或者 vtune 什么的都可以跑 profile 找程序热点。
字符串分割的话 boost 之类的库都有，我也自己写过一个简单的（导出 string_view 避免拷贝的）

/**
** @brief split source using judger, putting slice into container
** @param src source
** @param judger a function that accepts one element and return (bool) whether it is delim
** @param container a container that allows push_back to put a slice of std::basic_string_view<CharT>
** @param keepblank whether should keep blank splice
**/
template<class T, class CharT = T::value_type, class Judger, class Container>
inline void split(const T& src, const Judger judger, Container& container, const bool keepblank = true)
{
using namespace std;
size_t cur = 0, last = 0, len = src.length();
for (; cur < len; cur++)
{
if (judger(src[cur]))
{
if (keepblank || cur != last)
container.push_back(basic_string_view<CharT>(&src[last], cur - last));
last = cur + 1;
}
}
if (keepblank || cur != last)
container.push_back(basic_string_view<CharT>(&src[last], cur - last));
}

gnaggnoyil

2017-10-26 10:27:06 +08:00

用 string_view 减少不必要的字符串复制,当然,你得自己保证生命周期不出岔子.split 也返回 string_view,不然就等于把原来的一行数据又给复制至少一遍.

scenix

2017-10-26 11:39:19 +08:00

@wevsty 感谢你的回复，我重新检查了一下 cmake，虽然我开了 O3 优化，但是不小心打开了 gdb 的选项，去掉后速度明显上来了。
我编译了你的代码和我的代码，速度上有些不一样。能否在你的机器上编译运行一下我优化过的代码？看看速度会不会快些。
我没有用 string_view，stringstream，只是优化了插入部分的写法。
盼复。

#include<unordered_set>

using namespace std;

// data/test_log.txt 文件是一个 600MB 的日志文件
const string IN_FNAME = "data/test_log.txt";
unordered_map<string, unordered_set<string>* > big_dict;

void process_file(const string fname, unordered_map<string,unordered_set<string> *> & big_dict) {
ifstream f_in;
f_in.open(fname, ios::in);
string line = "";
int total_line = 0;
size_t start =0, end = 0;
while(getline(f_in, line)) {
++total_line;
start =0, end = 0;// c++没有内建 string 的分割，自己 diy 个
end = line.find_first_of(' ',start);
const string key = line.substr(start, end - start);
// 寻找是否存在 key
auto iter = big_dict.find(key);
unordered_set<string> *p = NULL;
if(iter == big_dict.end()) {
p = new unordered_set<string>;
// 原来的写法是 big_dict[key]=p，需要查找一次，优化之
big_dict.insert(make_pair(key,p));
} else {
// 如果已经有结果了，直接暂存结果到指针 p 中
p = iter->second;
}

start = end+1;
while(start<line.size()) {
end = line.find_first_of(' ',start);
if(string::npos == end) end = line.size() - 1;
// 原来的写法是 big_dict[key]->insert，需要查找一次，优化之
p->insert(line.substr(start, end - start));
start = end + 1;
}
}
f_in.close();

}
int main() {

process_file(IN_FNAME, big_dict);

cout<<"Total Keys:"<<big_dict.size()<<endl;

return 0;
}

scenix

2017-10-26 11:40:03 +08:00

@wevsty 少帖了几个 include 补上

#include <iostream>
#include <fstream>
#include <string>
#include<unordered_map>
#include<unordered_set>

wevsty

2017-10-26 12:49:46 +08:00

@scenix 跑了一下 O2 优化，0.18 秒。
然而写成这样子，真的很难看，基本丧失了复用性。

xziar

2017-10-26 13:27:32 +08:00

@scenix 不明白你为什么还是要用 set 的指针做 value。
有 new 没 delete，这就不是好的 C++代码，跟别提代码耦合在一起了。

我随便找了自己的文档测了一下，程序热点在 hash，其次是 getline。
vs2017.4 x64 默认 o2

scenix

2017-10-26 14:13:47 +08:00

@wevsty
@xziar

scenix

2017-10-26 14:17:12 +08:00

@xziar
@wevsty 感谢二位，我这边只是随手写的测试代码，想看下不同语言在面对类似情景下的性能表现，没打算复用，东西存进去就没打算放出来，也就没写 delete。
不过我修改了一下 python 的代码，性能也有不少提高，二位能否看下这段 python 在你们机器上的表现？
我本机如果用 pypy 去跑，性能和 c++几乎一样，真是意外。

big_dict = {}
with open("data/test_log.txt") as f_in:
for line in f_in:
items = line.strip().split(' ')
key = items[0]
if key not in big_dict:
big_dict[key] = set(items[1:])
else:
big_dict[key].update(items[1:]) // 原来的 for 循环去掉了，直接用 set 内建的 update 方法
print "total keys:", len(big_dict)

yokuhama

2017-10-26 16:20:47 +08:00

粗看下来，楼主的那个 map 申请的是栈内存，而非堆内存，栈是有大小限制的。。。。然而效率问题只能实测，不能说谁快谁慢。

ryd994

2017-10-26 20:45:48 +08:00

与其让网友帮你隔空 debug，为什么不把自己的数据 print 出来呢？
printf debug 不丢人，不是什么事情都要找 gdb 的
valgrind is your friend. 我个人的习惯是提交前一定要 valgrind 查一下

xziar

2017-10-28 02:21:19 +08:00

@yokuhama 无论什么都会用到栈内存，但 unordered_map 的数据肯定是在堆上的。

@scenix 一段代码的效率和测试数据也有关的，可能我的数据倾向于往 set 里塞重复东西，你的数据却重复比较少。不同数据没什么好比的。
C++的话，自带的 STL 的 unordered 并不见得就是效率最高的，换用其他第三方库的 hashmap 可能效率能更高。能预估数据大小的话也可以提前预留些空间。
至于 python 和 C++速度相等也不是没可能啊。

scenix

2017-10-28 17:03:20 +08:00

@xziar 就是心里不平衡，老子费劲巴拉的写了一两天，还让各路神仙隔空 debug，结果跟 10 行 python 差不多。