V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
zhouyin
V2EX  ›  分享发现

nodejs Python PHP ruby go perl 处理单个 4 百兆 csv 文件比较

  •  
  •   zhouyin · 15 天前 · 1233 次点击

    ###耗时

    perl 最慢 等不及处理完 就停止了 perl

    nodejs 1 分钟多

    php 30 多秒

    ruby 30 多秒

    python 11 秒左右

    go 4 秒左右

    ###时间上 go 和 python 胜出

    ###功能上面 这个 csv 文件不标准 有个字段有个单个双引号

    go 和 nodejs 和 ruby 都报错 无法处理完 上面它们两个的时间是把那个单引号移除后的 csv 文件

    php 没报错 但因为单个双引号忽略了很多行 它把那些双引号当分界符了

    功能上 python 胜出 python 完全能处理不标准的 csv 最后能生成正确 csv 就几行代码

    ###代码写起来 nodejs 最恶心

    nodejs 屌什么屌 非常像 ghostscirpt 作者评价 perl 的话:perl 像从狗的肛门里吐出来的东西

    写这么个小项目 感觉 nodejs 才像从狗的肛门里吐出来的东西

    23 条回复    2025-02-10 19:31:20 +08:00
    ysc3839
        1
    ysc3839  
       15 天前 via Android
    所以代码呢?
    zhouyin
        2
    zhouyin  
    OP
       15 天前
    代码传不上来

    看这里

    https://cowtransfer.com/s/f0a48d2009fd4f
    zhouyin
        3
    zhouyin  
    OP
       15 天前
    hefish
        4
    hefish  
       15 天前
    哈哈,说的非常高级。
    gainsurier
        5
    gainsurier  
       15 天前 via iPhone
    估计 C 写需要一秒吗
    zhouyin
        6
    zhouyin  
    OP
       15 天前
    @gainsurier
    python 和 php ruby 不就是 c 实现的么 只是 python 实现得好
    chenqh
        7
    chenqh  
       15 天前
    python 为什么会那么快?难道是 C 库?
    chenqh
        8
    chenqh  
       15 天前
    等等 nodejs 怎么这么快?JIT 呢?比 php 和 ruby 这种没 JIT 都慢?
    zhouyin
        9
    zhouyin  
    OP
       15 天前
    @gainsurier
    还有 nodejs c++实现 没 python 做得好
    henbf
        10
    henbf  
       14 天前   ❤️ 1
    喷 Node.js 之前反思一下自己是不是应该先搞清楚 I/O 和流的基本概念
    zhouyin
        11
    zhouyin  
    OP
       14 天前
    @henbf
    我不是 nodejs 高手 我把 a.js 更新了 使用了输出流 但现在报堆溢出错误了 :

    ```bash
    -bash-4.2# node a.js
    (node:17974) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 drain listeners added to [WriteStream]. Use emitter.setMaxListeners() to increase limit
    (Use `node --trace-warnings ...` to show where the warning was created)
    (node:17974) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 drain listeners added to [WriteStream]. Use emitter.setMaxListeners() to increase limit
    (node:17974) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 drain listeners added to [WriteStream]. Use emitter.setMaxListeners() to increase limit

    <--- Last few GCs --->

    [17974:0x1c3dbf0] 40306 ms: Scavenge (reduce) 2046.8 (2082.1) -> 2046.5 (2082.6) MB, 44.4 / 0.0 ms (average mu = 0.342, current mu = 0.316) allocation failure
    [17974:0x1c3dbf0] 40396 ms: Scavenge (reduce) 2047.2 (2082.6) -> 2046.8 (2082.8) MB, 31.1 / 0.0 ms (average mu = 0.342, current mu = 0.316) allocation failure


    <--- JS stacktrace --->

    FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
    1: 0x7fcfb6136908 node::Abort() [/lib64/libnode.so.93]
    2: 0x7fcfb6024451 [/lib64/libnode.so.93]
    3: 0x7fcfb732a552 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/lib64/libnode.so.93]
    4: 0x7fcfb732a8e7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/lib64/libnode.so.93]
    5: 0x7fcfb74ea305 [/lib64/libnode.so.93]
    6: 0x7fcfb74ea3e5 [/lib64/libnode.so.93]
    7: 0x7fcfb74fe77c v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/lib64/libnode.so.93]
    8: 0x7fcfb74ff0a1 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/lib64/libnode.so.93]
    9: 0x7fcfb7502269 v8::internal::Heap::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/lib64/libnode.so.93]
    10: 0x7fcfb75022f7 v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/lib64/libnode.so.93]
    11: 0x7fcfb74c27d0 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [/lib64/libnode.so.93]
    12: 0x7fcfb74badb4 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [/lib64/libnode.so.93]
    13: 0x7fcfb74bcbdf v8::internal::FactoryBase<v8::internal::Factory>::NewRawOneByteString(int, v8::internal::AllocationType) [/lib64/libnode.so.93]
    14: 0x7fcfb74c4d5d v8::internal::Factory::NewStringFromUtf8(v8::base::Vector<char const> const&, v8::internal::AllocationType) [/lib64/libnode.so.93]
    15: 0x7fcfb733d59d v8::String::NewFromUtf8(v8::Isolate*, char const*, v8::NewStringType, int) [/lib64/libnode.so.93]
    16: 0x7fcfb6215390 node::StringBytes::Encode(v8::Isolate*, char const*, unsigned long, node::encoding, v8::Local<v8::Value>*) [/lib64/libnode.so.93]
    17: 0x7fcfb6123ef3 [/lib64/libnode.so.93]
    18: 0x7fcfb71ba3cc [/lib64/libnode.so.93]
    Aborted
    ```
    henbf
        12
    henbf  
       14 天前
    @zhouyin 你的写的不对

    const { createReadStream, createWriteStream } = require("fs");
    const { parse } = require("csv-parse");

    const inputPath = "../outpy.csv";
    const outputPath = "./test.txt";

    const readStream = createReadStream(inputPath);
    const writeStream = createWriteStream(outputPath, { flags: "a" });

    const parser = parse({ delimiter: ",", from_line: 2 });

    readStream.pipe(parser);

    parser.on("data", (row) => {
    writeStream.write(row.join(",") + "\n");
    });

    parser.on("end", () => {
    console.log("finished");
    writeStream.end();
    });

    parser.on("error", (error) => {
    console.error("CSV Parsing Error:", error);
    });
    zhouyin
        13
    zhouyin  
    OP
       14 天前
    一开始我就是差不多你这样写的 没想到速度没提升 所以改成那样 以为 write 那里有缓冲区

    一字不换把你的代码 运行 结果 耗时 一分钟多 望 python 莫及

    -bash-4.2# time node a.js
    finished

    real 1m3.579s
    user 1m4.103s
    sys 0m2.478s
    henbf
        14
    henbf  
       14 天前
    @zhouyin 这中间还要看你对 csv 的每一行进行了怎么样的处理,你用 python 只是一读一写没有任何额外的处理,相当于复制。用 Node.js ,你却把每一行转换成数组,写的时候又把数组转换成字符串,当然慢了。

    const { createReadStream, createWriteStream } = require("fs");

    const inputPath = "../outpy.csv";
    const outputPath = "./test.txt";


    const readStream = createReadStream(inputPath, { highWaterMark: 256 * 1024 });
    const writeStream = createWriteStream(outputPath, { flags: "a" });

    readStream.pipe(writeStream);

    readStream.on("end", () => {
    console.log("finished");
    writeStream.end();
    });

    readStream.on("error", (err) => {
    console.error("Error reading file:", err);
    });

    writeStream.on("error", (err) => {
    console.error("Error writing file:", err);
    });
    zhouyin
        15
    zhouyin  
    OP
       14 天前 via Android
    @henbf
    python 返回的是数组 只是写入的也是数组
    zhouyin
        16
    zhouyin  
    OP
       14 天前
    @henbf

    我又用了一个库 csvwriter 慢得不得了

    python 库就是设计得好 不服不行
    zhouyin
        17
    zhouyin  
    OP
       14 天前
    @zhouyin
    用了 csvwriter 时间 3 分多

    -bash-4.2# time node a.js
    finished

    real 3m45.028s
    user 4m12.751s
    sys 2m59.847s
    henbf
        18
    henbf  
       14 天前
    @zhouyin ✅✅✅,Node.js 不适合解析 csv ,Python 牛逼
    stabc
        19
    stabc  
       14 天前
    1. 解析 csv ,要一个字符一个字符拆分和拼接,底层语言绝对优势,因为可以根据位置拿来直接用,而 node 每次都创建新 string 对象。

    2. python 标准库就有 csv 模块,所以也是底层在执行,那么他比 go 语言慢那么多,说明写的比较差。

    3. 我刚才简单测试了一下,node 如果优化一下解析过程,减少字符串拼接,解析 400M 的 csv 文件,总用时可以压缩到 5 秒以内。
    gesse
        20
    gesse  
       14 天前
    @henbf 哈哈哈
    julyclyde
        21
    julyclyde  
       13 天前
    @stabc 为什么,因为“标准库有”所以就“底层”了?
    https://github.com/python/cpython/blob/main/Lib/csv.py
    python 的 csv 模块是个纯 python 的啊,并不是 C 的
    stabc
        22
    stabc  
       13 天前   ❤️ 2
    @julyclyde 你这个是接口层,底层在这里: https://github.com/python/cpython/blob/main/Modules/_csv.c
    julyclyde
        23
    julyclyde  
       12 天前
    @stabc 谢谢你的指正。我去学习一下
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2837 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 24ms · UTC 13:47 · PVG 21:47 · LAX 05:47 · JFK 08:47
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.