pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (2024)

Could you try running it once, and then timing the subsequent run? We compile the graph for each set of different tensor dimensions that are run and then cache it, so it's likely the first run will be slower.

# Trace the model and convert the functionality to scripttraced_model = torch.jit.trace(model, input)# compile graph for input dimensionstraced_model(input)# time start_time2 = time.time()output2 = traced_model(input)print('Time for total prediction 2 = {0}'.format(time.time()-start_time2))

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (1) eellison on 10 Apr 2019

Thank you. It works, therefore there is no issue on the python side. But the main issue remains: Calling
output = module->forward(inputs).toTensor();
reduces the speed from 120 Hz to around 28 Hz, so the prediction is around 3x as slow in C++ as in Python.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (2) catrueeb on 11 Apr 2019

πŸ‘1

Could you post a repro? Hard to investigate otherwise.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (3) eellison on 11 Apr 2019

@eellison sorry for the late response. Here you go:

#include <torch/script.h>int main() { std::shared_ptr<torch::jit::script::Module> module = torch::jit::load("model.pt"); assert(module != nullptr); std::cout << "Model loaded correctly\n"; torch::set_num_threads(1); double delta; double delta_sec; for (int i = 0; i < 3; i++) { std::clock_t start = clock(); std::vector<torch::jit::IValue> inputs; inputs.emplace_back(torch::ones({ 1, 1, 128, 128 })); module->forward(inputs); delta = (clock() - start); delta_sec = delta / (double)CLOCKS_PER_SEC; printf("Seconds for prediction: %f \n", delta_sec); } return 0;}

Output on Windows 10, using VS2017 is:

Model loaded correctlySeconds for prediction: 0.099000Seconds for prediction: 0.039000Seconds for prediction: 0.037000

The model used for this test: https://github.com/catrueeb/CNN-test-model
The corresponding python code can be seen above, model is loaded with
model = torch.jit.load('model.pt')
In C++, the .forward() method is 4x slower.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (4) catrueeb on 15 Apr 2019

cc @wanchaol Performance, @yf225 C++

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (5) eellison on 15 Apr 2019

Is this a CPU model? If so, https://github.com/mingfeima/convnet-benchmarks/blob/master/pytorch/run.sh#L16-L25 can be useful for speeding it up.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (6) yf225 on 15 Apr 2019

Is this a CPU model? If so, https://github.com/mingfeima/convnet-benchmarks/blob/master/pytorch/run.sh#L16-L25 can be useful for speeding it up.

@yf225 Is there a way to do this on windows?

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (7) catrueeb on 16 Apr 2019

My test shows that script model's performance in inference is basically the same with the original one. My model is a simple one layer CNN + 2 layer FC. Here is how I test:

 model = torch.load(model_conf['model_path'], map_location=lambda storage, loc: storage) tmp = [1]*2000000 tmodel = torch.jit.trace(model, Variable(torch.Tensor([tmp]).long(), requires_grad=False)) model.eval() tmodel.eval() pred = tmodel(Variable(torch.Tensor([tmp]).long(), requires_grad=False)) test_num = 100 files = [] for i in range(test_num): files.append([1]*2000000) start_time = time.time() for f in files: tmp = Variable(torch.Tensor([f]).long(), requires_grad=False) pred = tmodel(tmp) end_time = time.time() print('script model time total: {0} for {2} samples, avg per one: {1}' .format((end_time-start_time), float(end_time-start_time)/test_num, test_num)) start_time = time.time() for f in files: tmp = Variable(torch.Tensor([f]).long(), requires_grad=False) pred = model(tmp) end_time = time.time() print('normal model time total: {0} for {2} samples, avg per one: {1}' .format((end_time-start_time), float(end_time-start_time)/test_num, test_num))

The result is:
traced model time total: 20.64520502090454 for 100 samples, avg per one: 0.20645205020904542
normal model time total: 21.036952018737793 for 100 samples, avg per one: 0.21036952018737792

I run the code on CPU with one thread. The results using two threads are also very similar .

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (8) wangxiaoying on 25 Apr 2019

@wangxiaoying Thank you for the test. I also got the same results for the scripted model after using concecutive runs, as written above. But the prediction is still 4x slower in c++ for me. And I could not figure out how to speed it up, using VS2017 on Windows 10.

@wangxiaoying Thank you for the test. I also got the same results for the scripted model after using concecutive runs, as written above. But the prediction is still 4x slower in c++ for me. And I could not figure out how to speed it up, using VS2017 on Windows 10.

I also implemented my code using C++ and the time usage is basically the same with python. (BTW, I compiled pytorch with openblas on OSX)

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (10) wangxiaoying on 26 Apr 2019

Having same issue here predicting using C++ API. The prediction is taking 7~8x slower.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (11) goncamateus on 3 Jun 2019

πŸ‘16πŸ˜•1

Same issue here. Any suggestions to speed up?

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (12) marchss on 6 Jul 2019

Is this all windows issues for all of you ? Could you please comment with more details ? @yf225

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (13) eellison on 11 Jul 2019

I'm currently exporting some pix2pix model to C++.
I tested model on CPU on different operating systems with the same cpp code(Linux and windows is the same machine, around the same characteristic mac machine)
Linux: forward pass nn took 2.14318 seconds
Windows: forward pass nn took 8.54115 seconds
Mac: forward pass nn took 4.85829 seconds
My mac python version is running ~2 sec as on Linux and its twice slower than python on mac.

I din't configure parallelism, maybe that the reason for such big difference. Is there any guid for this?
Is this due to not optimizing underlying math libraries?

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (14) YuriyPryyma on 17 Jul 2019

@YuriyPryyma Could you try out the parallelism settings in https://github.com/mingfeima/convnet-benchmarks/blob/e07c4814cc9ca1fdcbda1ff3ea4fcb386ed7691a/pytorch/run.sh#L16-L25 ? On Windows you can use set OMP_NUM_THREADS=... instead of export OMP_NUM_THREADS=... to set the environment variables.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (15) yf225 on 17 Jul 2019

@yf225 I test this on Windows. OMP_NUM_THREADS did utilize my CPU more
For OMP_NUM_THREADS=1 got 33% process load with result 8.4 seconds
For OMP_NUM_THREADS=4 got 66%-100% process load with result 8.1 seconds
So it's not much help compared to results on Linux.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (16) YuriyPryyma on 17 Jul 2019

@peterjc123 Do you know about how to speed up CPU code on Windows?

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (17) yf225 on 19 Jul 2019

@yf225 Well, my first concern is whether some flags are missing in the compiler/linker options because similar things happen before we make building PyTorch and LibTorch to go through the same code path. If this is not the issue, then we may need a profiler log to see exactly which op/function is causing the difference in performance.
@catrueeb You may need to change the c++ code a little bit because the comparison is currently unfair. For python, you are only measuring the time of a forward pass. But in C++, you also included the time to generate and feed an input tensor.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (18) peterjc123 on 19 Jul 2019

I'm pretty sure this is the same thing as https://github.com/pytorch/pytorch/issues/20156

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (19) ezyang on 14 Aug 2019

πŸ‘1

@peterjc123 Do you know about how to speed up CPU code on Windows?

Hi,
Scripted CNNs are predicting much slower, especially on C++.
On the python side, this test shows the decrease in performance:

model = Model(128, 128, 19**2, 0.1)input = torch.tensor(torch.ones(1,1,128,128))torch.cuda.synchronize()torch.set_num_threads(1)start_time1 = time.time()output1 = model(input)print('Time for total prediction 1 = {0}'.format(time.time()-start_time1))# Trace the model and convert the functionality to scripttraced_model = torch.jit.trace(model, input)start_time2 = time.time()output2 = traced_model(input)print('Time for total prediction 2 = {0}'.format(time.time()-start_time2))

The output is:

Time for total prediction 1 = 0.008975505828857422Time for total prediction 2 = 0.015474081039428711

So after tracing the module, prediction is much slower. When I load the model in C++, the _forward_ method actually decreases the speed of the real-time prediction from 120Hz to 30Hz. Is there a way this performance could be improved?

I also get the same problem. DO YOU ALREADY SETTLE IT?

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (20) LIANGXINKAI on 3 Sep 2019

This is because the cpu jit fuser is diabled on Windows. I'll try to figure it out during free time.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (21) peterjc123 on 3 Sep 2019

time

look forward your work

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (22) LIANGXINKAI on 3 Sep 2019

This is because the cpu jit fuser is diabled on Windows. I'll try to figure it out during free time.

How can I use the β€œ[WIP] Enable CPU fused kernel on Windows #25578”
? Is there fixed release libtorch for windows?

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (23) LIANGXINKAI on 4 Sep 2019

My case:
C++ Operation Time(s) 8.87875s
Python Operation Time(s) 2.6850814819335938s
in windows10,CPU.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (24) kaonick on 4 Sep 2019

@LIANGXINKAI It's not finished yet. This issue will be closed when it's ready.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (25) peterjc123 on 4 Sep 2019

@peterjc123 have the same issue. I'd appreciate if you could fix it.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (26) Immocat on 5 Sep 2019

face The same problem, windows libtorch 1.3.0 dowloaded from official site much slower

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (27) myhub on 27 Oct 2019

Having same issue here predicting using C++ API. The prediction is taking 7~8x slower.

Hi,
Scripted CNNs are predicting much slower, especially on C++.
On the python side, this test shows the decrease in performance:

model = Model(128, 128, 19**2, 0.1)input = torch.tensor(torch.ones(1,1,128,128))torch.cuda.synchronize()torch.set_num_threads(1)start_time1 = time.time()output1 = model(input)print('Time for total prediction 1 = {0}'.format(time.time()-start_time1))# Trace the model and convert the functionality to scripttraced_model = torch.jit.trace(model, input)start_time2 = time.time()output2 = traced_model(input)print('Time for total prediction 2 = {0}'.format(time.time()-start_time2))

The output is:

Time for total prediction 1 = 0.008975505828857422Time for total prediction 2 = 0.015474081039428711

So after tracing the module, prediction is much slower. When I load the model in C++, the _forward_ method actually decreases the speed of the real-time prediction from 120Hz to 30Hz. Is there a way this performance could be improved?

I am using recent libtorch 1.3.1 to run model inference with C++; both debug & release version code compiled successfully;
But for my super-resolution task where input image size is 720*1080 or even bigger, model inference process on CPU is much slower than pytorch version and takes nearly all the 16G memory, and I have to reboot my computer....
Besides, the libtoch doc/tutorial is not sufficient and it takes long time to debug a simple pytorch issuse in libtorch code;
Really hope libtorch community could do something solve those issues, it's really a meaningful thing for model deployment.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (28) fedral on 19 Dec 2019

I am increasing the priority of this issue based on user activity

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (29) ezyang on 2 Jan 2020

I'm pretty sure this is the same thing as #20156

So is the original issue solved? Is there a workground? cc @ezyang

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (30) peterjc123 on 15 Jan 2020

In Python, when we initialize torch module we explicitly call at::init_num_threads() to initialize OMP/MKL. Could you add the same call in your C++ binary at the beginning and report results?

_Originally posted by @ilia-cher in https://github.com/pytorch/pytorch/issues/20156#issuecomment-490227743_

Would you please try this one first?

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (31) peterjc123 on 15 Jan 2020

To my knowledge, neither of these issues have been fixed. The workaround is to manually enable OMP in your C++ program, similar to how we do it in Python.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (32) ezyang on 15 Jan 2020

I yesterday converted a model using torch.jit.trace function.
The inference speed with CPU python on Linux is almost the same as the speed with CPU C++ on Windows.
My environment is pytorch==1.3.1 (py3.7_cuda10.1.243_cudnn7.6.3_0) on Linux, libtorch-win-shared-with-deps-1.3.1.zip on Windows.

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (33) zhonhel on 16 Jan 2020

@yf225 Well, my first concern is whether some flags are missing in the compiler/linker options because similar things happen before we make building PyTorch and LibTorch to go through the same code path. If this is not the issue, then we may need a profiler log to see exactly which op/function is causing the difference in performance.
@catrueeb You may need to change the c++ code a little bit because the comparison is currently unfair. For python, you are only measuring the time of a forward pass. But in C++, you also included the time to generate and feed an input tensor.

I also experienced the same problem in libtorch 1.6. Is there any way to solve this problem? the difference is only 5-10 ms in my case.

Edit:
To simulate my libtorch issues, I have made some simple inference code to check the bug.
my environment:

->Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz (12 CPUs), ~2.2GHz->NVIDIA GeForce GTX 1050 2 GB -> CUDA v10.2 + cuDNN v7.6.5->python v3.8.3 - pytorch v1.6.0->libtorch v 1.6.0->Visual Studio 2017 (v 141) + Windows SDK v. 10.0.17763.0 ->C++ languange staandard = c++14

Convert to torchscript code.

import torchimport torchvision# An instance of your model.model = torchvision.models.resnet50(pretrained=True)model.eval()# An example input you would normally provide to your model's forward() method.example = torch.rand(1, 3, 224, 224)# Use torch.jit.trace to generate a torch.jit.ScriptModule via tracing.traced_script_module = torch.jit.trace(model, example)print(traced_script_module)traced_script_module.save('resNet50.pt')

python inference code - simple version

import timeimport cv2import numpy as npimport torchimport torchvisionmeans = torch.tensor([0.485, 0.456, 0.406], device = 'cuda:0', dtype=torch.float32).view(1, 3, 1, 1)stds = torch.tensor([0.229, 0.224, 0.225], device = 'cuda:0', dtype=torch.float32).view(1, 3, 1, 1)# An instance of your model.model = torchvision.models.resnet50(pretrained=True)model.eval()model.cuda()inputSize = (224, 224)images = []img = cv2.imread('puppy.jpg')img = cv2.resize(img, inputSize, interpolation=cv2.INTER_CUBIC)images.append(img)images = np.array(images, dtype=np.float32)images = torch.from_numpy(images).cuda().float()images = images.permute(0, 3, 1, 2).contiguous()images = ((images / 255.0) - means) / stdsfor i in range (100): results = model(images)start = time.time_ns()for i in range (1000): results = model(images)#GPU computationstop = time.time_ns()time_milli = (stop-start)/1000000print('Avg. processing time: {} ms'.format(time_milli/1000))results = torch.argmax(results, dim=1)results = results.cpu().detach().numpy()#this variable contains label index of the defined charactersresults = results.astype(np.uint8)print(results)

python inference code - torchscript version

import timeimport cv2import numpy as npimport torchimport torchvisionmeans = torch.tensor([0.485, 0.456, 0.406], device = 'cuda:0', dtype=torch.float32).view(1, 3, 1, 1)stds = torch.tensor([0.229, 0.224, 0.225], device = 'cuda:0', dtype=torch.float32).view(1, 3, 1, 1)# An instance of your model.model = torch.jit.load('resNet50.pt')model.eval()model.cuda()inputSize = (224, 224)images = []img = cv2.imread('puppy.jpg')img = cv2.resize(img, inputSize, interpolation=cv2.INTER_CUBIC)images.append(img)images = np.array(images, dtype=np.float32)images = torch.from_numpy(images).cuda().float()images = images.permute(0, 3, 1, 2).contiguous()images = ((images / 255.0) - means) / stdsfor i in range (100): results = model(images)start = time.time_ns()for i in range (1000): results = model(images)#GPU computationstop = time.time_ns()time_milli = (stop-start)/1000000print('Avg. processing time: {} ms'.format(time_milli/1000))results = torch.argmax(results, dim=1)results = results.cpu().detach().numpy()#this variable contains label index of the defined charactersresults = results.astype(np.uint8)print(results)

c++ inference code using libtorch - torchscript version

#include <iostream>#include <string>#include <memory>#include <opencv2/core.hpp>#include <opencv2/imgproc.hpp>#include <opencv2/highgui.hpp>#include <torch/torch.h>#include <torch/script.h> // One-stop header.int main() { std::string weightPath = "E:/toolsku/ReportLibtorchBug/resNet50.pt"; torch::jit::script::Module model; torch::Device targetDevice = torch::kCPU; try { model = torch::jit::load(weightPath);// Deserialize the ScriptModule from a file using torch::jit::load(). if (torch::cuda::is_available()) { std::cout << "GPU is available -> Switch to GPU mode" << std::endl; targetDevice = torch::kCUDA;//to GPU } model.eval(); model.to(targetDevice); } catch (const c10::Error& e) { std::cerr << "Error in loading the model!\n"; return -1; } torch::NoGradGuard no_grad; std::cout << "Success in loading the model!\n"; std::vector<torch::Tensor> batch_data;// using a tensor list int netHeight = 224, netWidth = 224; cv::Size inpDimension(netWidth, netHeight); torch::TensorOptions options(torch::kFloat32); torch::Tensor means = torch::tensor({ 0.485, 0.456, 0.406 }, options).view({ 1, 3, 1, 1 }).to(targetDevice); torch::Tensor stds = torch::tensor({ 0.229, 0.224, 0.225 }, options).view({ 1, 3, 1, 1 }).to(targetDevice); std::string imgPath = "E:/toolsku/ReportLibtorchBug/puppy.jpg"; cv::Mat img = cv::imread(imgPath, cv::IMREAD_COLOR); cv::resize(img, img, inpDimension, cv::INTER_CUBIC); img.convertTo(img, CV_32FC3, 1.0f); torch::Tensor input1 = torch::from_blob(img.data, { 1,netHeight, netWidth, 3 }, options).clone().toType(torch::kFloat32); batch_data.push_back(input1); torch::Tensor input_tensor = torch::cat(batch_data, 0); input_tensor = input_tensor.to(targetDevice); input_tensor = input_tensor.permute({ 0, 3, 1, 2 }); input_tensor = (input_tensor.div_(255.0) - means) / stds; std::chrono::time_point<std::chrono::steady_clock> start; std::chrono::time_point<std::chrono::steady_clock> stop; double processingTime; torch::Tensor out_tensor; for (int i = 0; i < 500; ++i) {//warmup out_tensor = model.forward({ input_tensor }).toTensor(); } start = std::chrono::high_resolution_clock::now(); for (int i = 0; i < 1000; ++i) { out_tensor = model.forward({ input_tensor }).toTensor(); } stop = std::chrono::high_resolution_clock::now(); processingTime = std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count() / 1000.0; std::cout << "Avg. processing time: " << (processingTime / 1000) << " ms\n"; out_tensor = torch::argmax(out_tensor, 1); out_tensor = out_tensor.to(torch::kCPU);//make the variable is in CPU std::cout << out_tensor << std::endl; return 0;}

The output of the program

(ioN3) E:\toolsku\ReportLibtorchBug>python resNet50_simple.pyAvg. processing time: 12.1563201 ms[208](ioN3) E:\toolsku\ReportLibtorchBug>python resnet50_torchscript.pyAvg. processing time: 12.3262982 ms[208](ioN3) E:\toolsku\ReportLibtorchBug>resNet50_libtorch.exeGPU is available -> Switch to GPU modeSuccess in loading the model!Avg. processing time: 13.192 ms 208[ CPULongType{1} ]

The time difference is not much(~1 ms), but in my project, the time difference become bigger with bigger batch size (5-10 ms). The problem is probably in the compilation, but I can't found any related documentation about it

ReportLibtorchBug.zip

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (34) albertchristianto on 15 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

pytorch πŸš€ - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (2024)
Top Articles
Latest Posts
Article information

Author: Twana Towne Ret

Last Updated:

Views: 5847

Rating: 4.3 / 5 (64 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Twana Towne Ret

Birthday: 1994-03-19

Address: Apt. 990 97439 Corwin Motorway, Port Eliseoburgh, NM 99144-2618

Phone: +5958753152963

Job: National Specialist

Hobby: Kayaking, Photography, Skydiving, Embroidery, Leather crafting, Orienteering, Cooking

Introduction: My name is Twana Towne Ret, I am a famous, talented, joyous, perfect, powerful, inquisitive, lovely person who loves writing and wants to share my knowledge and understanding with you.