pytorch 🚀 - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU)

Could you try running it once, and then timing the subsequent run? We compile the graph for each set of different tensor dimensions that are run and then cache it, so it's likely the first run will be slower.

# Trace the model and convert the functionality to scripttraced_model = torch.jit.trace(model, input)# compile graph for input dimensionstraced_model(input)# time start_time2 = time.time()output2 = traced_model(input)print('Time for total prediction 2 = {0}'.format(time.time()-start_time2))

eellison on 10 Apr 2019

Thank you. It works, therefore there is no issue on the python side. But the main issue remains: Calling
output = module->forward(inputs).toTensor();
reduces the speed from 120 Hz to around 28 Hz, so the prediction is around 3x as slow in C++ as in Python.

catrueeb on 11 Apr 2019

👍1

Could you post a repro? Hard to investigate otherwise.

eellison on 11 Apr 2019

@eellison sorry for the late response. Here you go:

#include <torch/script.h>int main() { std::shared_ptr<torch::jit::script::Module> module = torch::jit::load("model.pt"); assert(module != nullptr); std::cout << "Model loaded correctly\n"; torch::set_num_threads(1); double delta; double delta_sec; for (int i = 0; i < 3; i++) { std::clock_t start = clock(); std::vector<torch::jit::IValue> inputs; inputs.emplace_back(torch::ones({ 1, 1, 128, 128 })); module->forward(inputs); delta = (clock() - start); delta_sec = delta / (double)CLOCKS_PER_SEC; printf("Seconds for prediction: %f \n", delta_sec); } return 0;}

Output on Windows 10, using VS2017 is:

Model loaded correctlySeconds for prediction: 0.099000Seconds for prediction: 0.039000Seconds for prediction: 0.037000

The model used for this test: https://github.com/catrueeb/CNN-test-model
The corresponding python code can be seen above, model is loaded with
model = torch.jit.load('model.pt')
In C++, the .forward() method is 4x slower.

catrueeb on 15 Apr 2019

cc @wanchaol Performance, @yf225 C++

eellison on 15 Apr 2019

Is this a CPU model? If so, https://github.com/mingfeima/convnet-benchmarks/blob/master/pytorch/run.sh#L16-L25 can be useful for speeding it up.

yf225 on 15 Apr 2019

Is this a CPU model? If so, https://github.com/mingfeima/convnet-benchmarks/blob/master/pytorch/run.sh#L16-L25 can be useful for speeding it up.

@yf225 Is there a way to do this on windows?

catrueeb on 16 Apr 2019

My test shows that script model's performance in inference is basically the same with the original one. My model is a simple one layer CNN + 2 layer FC. Here is how I test:

 model = torch.load(model_conf['model_path'], map_location=lambda storage, loc: storage) tmp = [1]*2000000 tmodel = torch.jit.trace(model, Variable(torch.Tensor([tmp]).long(), requires_grad=False)) model.eval() tmodel.eval() pred = tmodel(Variable(torch.Tensor([tmp]).long(), requires_grad=False)) test_num = 100 files = [] for i in range(test_num): files.append([1]*2000000) start_time = time.time() for f in files: tmp = Variable(torch.Tensor([f]).long(), requires_grad=False) pred = tmodel(tmp) end_time = time.time() print('script model time total: {0} for {2} samples, avg per one: {1}' .format((end_time-start_time), float(end_time-start_time)/test_num, test_num)) start_time = time.time() for f in files: tmp = Variable(torch.Tensor([f]).long(), requires_grad=False) pred = model(tmp) end_time = time.time() print('normal model time total: {0} for {2} samples, avg per one: {1}' .format((end_time-start_time), float(end_time-start_time)/test_num, test_num))

The result is:
traced model time total: 20.64520502090454 for 100 samples, avg per one: 0.20645205020904542
normal model time total: 21.036952018737793 for 100 samples, avg per one: 0.21036952018737792

I run the code on CPU with one thread. The results using two threads are also very similar .

wangxiaoying on 25 Apr 2019

@wangxiaoying Thank you for the test. I also got the same results for the scripted model after using concecutive runs, as written above. But the prediction is still 4x slower in c++ for me. And I could not figure out how to speed it up, using VS2017 on Windows 10.

catrueeb on 25 Apr 2019

@wangxiaoying Thank you for the test. I also got the same results for the scripted model after using concecutive runs, as written above. But the prediction is still 4x slower in c++ for me. And I could not figure out how to speed it up, using VS2017 on Windows 10.

I also implemented my code using C++ and the time usage is basically the same with python. (BTW, I compiled pytorch with openblas on OSX)

wangxiaoying on 26 Apr 2019

Having same issue here predicting using C++ API. The prediction is taking 7~8x slower.

goncamateus on 3 Jun 2019

👍16😕1

Same issue here. Any suggestions to speed up?

marchss on 6 Jul 2019

Is this all windows issues for all of you ? Could you please comment with more details ? @yf225

eellison on 11 Jul 2019

I'm currently exporting some pix2pix model to C++.
I tested model on CPU on different operating systems with the same cpp code(Linux and windows is the same machine, around the same characteristic mac machine)
Linux: forward pass nn took 2.14318 seconds
Windows: forward pass nn took 8.54115 seconds
Mac: forward pass nn took 4.85829 seconds
My mac python version is running ~2 sec as on Linux and its twice slower than python on mac.

I din't configure parallelism, maybe that the reason for such big difference. Is there any guid for this?
Is this due to not optimizing underlying math libraries?

YuriyPryyma on 17 Jul 2019

@YuriyPryyma Could you try out the parallelism settings in https://github.com/mingfeima/convnet-benchmarks/blob/e07c4814cc9ca1fdcbda1ff3ea4fcb386ed7691a/pytorch/run.sh#L16-L25 ? On Windows you can use set OMP_NUM_THREADS=... instead of export OMP_NUM_THREADS=... to set the environment variables.

yf225 on 17 Jul 2019

@yf225 I test this on Windows. OMP_NUM_THREADS did utilize my CPU more
For OMP_NUM_THREADS=1 got 33% process load with result 8.4 seconds
For OMP_NUM_THREADS=4 got 66%-100% process load with result 8.1 seconds
So it's not much help compared to results on Linux.

YuriyPryyma on 17 Jul 2019

@peterjc123 Do you know about how to speed up CPU code on Windows?

yf225 on 19 Jul 2019

@yf225 Well, my first concern is whether some flags are missing in the compiler/linker options because similar things happen before we make building PyTorch and LibTorch to go through the same code path. If this is not the issue, then we may need a profiler log to see exactly which op/function is causing the difference in performance.
@catrueeb You may need to change the c++ code a little bit because the comparison is currently unfair. For python, you are only measuring the time of a forward pass. But in C++, you also included the time to generate and feed an input tensor.

peterjc123 on 19 Jul 2019

I'm pretty sure this is the same thing as https://github.com/pytorch/pytorch/issues/20156

ezyang on 14 Aug 2019

👍1

@peterjc123 Do you know about how to speed up CPU code on Windows?

Hi,
Scripted CNNs are predicting much slower, especially on C++.
On the python side, this test shows the decrease in performance:
model = Model(128, 128, 19**2, 0.1)input = torch.tensor(torch.ones(1,1,128,128))torch.cuda.synchronize()torch.set_num_threads(1)start_time1 = time.time()output1 = model(input)print('Time for total prediction 1 = {0}'.format(time.time()-start_time1))# Trace the model and convert the functionality to scripttraced_model = torch.jit.trace(model, input)start_time2 = time.time()output2 = traced_model(input)print('Time for total prediction 2 = {0}'.format(time.time()-start_time2))
The output is:
Time for total prediction 1 = 0.008975505828857422Time for total prediction 2 = 0.015474081039428711
So after tracing the module, prediction is much slower. When I load the model in C++, the _forward_ method actually decreases the speed of the real-time prediction from 120Hz to 30Hz. Is there a way this performance could be improved?

I also get the same problem. DO YOU ALREADY SETTLE IT?

LIANGXINKAI on 3 Sep 2019

This is because the cpu jit fuser is diabled on Windows. I'll try to figure it out during free time.

peterjc123 on 3 Sep 2019

time

look forward your work

LIANGXINKAI on 3 Sep 2019

This is because the cpu jit fuser is diabled on Windows. I'll try to figure it out during free time.

How can I use the “[WIP] Enable CPU fused kernel on Windows #25578”
? Is there fixed release libtorch for windows?

LIANGXINKAI on 4 Sep 2019

My case:
C++ Operation Time(s) 8.87875s
Python Operation Time(s) 2.6850814819335938s
in windows10,CPU.

kaonick on 4 Sep 2019

@LIANGXINKAI It's not finished yet. This issue will be closed when it's ready.

peterjc123 on 4 Sep 2019

@peterjc123 have the same issue. I'd appreciate if you could fix it.

Immocat on 5 Sep 2019

face The same problem, windows libtorch 1.3.0 dowloaded from official site much slower

myhub on 27 Oct 2019

Having same issue here predicting using C++ API. The prediction is taking 7~8x slower.

Hi,
Scripted CNNs are predicting much slower, especially on C++.
On the python side, this test shows the decrease in performance:
model = Model(128, 128, 19**2, 0.1)input = torch.tensor(torch.ones(1,1,128,128))torch.cuda.synchronize()torch.set_num_threads(1)start_time1 = time.time()output1 = model(input)print('Time for total prediction 1 = {0}'.format(time.time()-start_time1))# Trace the model and convert the functionality to scripttraced_model = torch.jit.trace(model, input)start_time2 = time.time()output2 = traced_model(input)print('Time for total prediction 2 = {0}'.format(time.time()-start_time2))
The output is:
Time for total prediction 1 = 0.008975505828857422Time for total prediction 2 = 0.015474081039428711
So after tracing the module, prediction is much slower. When I load the model in C++, the _forward_ method actually decreases the speed of the real-time prediction from 120Hz to 30Hz. Is there a way this performance could be improved?

I am using recent libtorch 1.3.1 to run model inference with C++; both debug & release version code compiled successfully;
But for my super-resolution task where input image size is 720*1080 or even bigger, model inference process on CPU is much slower than pytorch version and takes nearly all the 16G memory, and I have to reboot my computer....
Besides, the libtoch doc/tutorial is not sufficient and it takes long time to debug a simple pytorch issuse in libtorch code;
Really hope libtorch community could do something solve those issues, it's really a meaningful thing for model deployment.

fedral on 19 Dec 2019

I am increasing the priority of this issue based on user activity

ezyang on 2 Jan 2020

I'm pretty sure this is the same thing as #20156

So is the original issue solved? Is there a workground? cc @ezyang

peterjc123 on 15 Jan 2020

In Python, when we initialize torch module we explicitly call at::init_num_threads() to initialize OMP/MKL. Could you add the same call in your C++ binary at the beginning and report results?

_Originally posted by @ilia-cher in https://github.com/pytorch/pytorch/issues/20156#issuecomment-490227743_

Would you please try this one first?

peterjc123 on 15 Jan 2020

To my knowledge, neither of these issues have been fixed. The workaround is to manually enable OMP in your C++ program, similar to how we do it in Python.

ezyang on 15 Jan 2020

I yesterday converted a model using torch.jit.trace function.
The inference speed with CPU python on Linux is almost the same as the speed with CPU C++ on Windows.
My environment is pytorch==1.3.1 (py3.7_cuda10.1.243_cudnn7.6.3_0) on Linux, libtorch-win-shared-with-deps-1.3.1.zip on Windows.

zhonhel on 16 Jan 2020

@yf225 Well, my first concern is whether some flags are missing in the compiler/linker options because similar things happen before we make building PyTorch and LibTorch to go through the same code path. If this is not the issue, then we may need a profiler log to see exactly which op/function is causing the difference in performance.
@catrueeb You may need to change the c++ code a little bit because the comparison is currently unfair. For python, you are only measuring the time of a forward pass. But in C++, you also included the time to generate and feed an input tensor.

I also experienced the same problem in libtorch 1.6. Is there any way to solve this problem? the difference is only 5-10 ms in my case.

Edit:
To simulate my libtorch issues, I have made some simple inference code to check the bug.
my environment:

->Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz (12 CPUs), ~2.2GHz->NVIDIA GeForce GTX 1050 2 GB -> CUDA v10.2 + cuDNN v7.6.5->python v3.8.3 - pytorch v1.6.0->libtorch v 1.6.0->Visual Studio 2017 (v 141) + Windows SDK v. 10.0.17763.0 ->C++ languange staandard = c++14

Convert to torchscript code.

import torchimport torchvision# An instance of your model.model = torchvision.models.resnet50(pretrained=True)model.eval()# An example input you would normally provide to your model's forward() method.example = torch.rand(1, 3, 224, 224)# Use torch.jit.trace to generate a torch.jit.ScriptModule via tracing.traced_script_module = torch.jit.trace(model, example)print(traced_script_module)traced_script_module.save('resNet50.pt')

python inference code - simple version

import timeimport cv2import numpy as npimport torchimport torchvisionmeans = torch.tensor([0.485, 0.456, 0.406], device = 'cuda:0', dtype=torch.float32).view(1, 3, 1, 1)stds = torch.tensor([0.229, 0.224, 0.225], device = 'cuda:0', dtype=torch.float32).view(1, 3, 1, 1)# An instance of your model.model = torchvision.models.resnet50(pretrained=True)model.eval()model.cuda()inputSize = (224, 224)images = []img = cv2.imread('puppy.jpg')img = cv2.resize(img, inputSize, interpolation=cv2.INTER_CUBIC)images.append(img)images = np.array(images, dtype=np.float32)images = torch.from_numpy(images).cuda().float()images = images.permute(0, 3, 1, 2).contiguous()images = ((images / 255.0) - means) / stdsfor i in range (100): results = model(images)start = time.time_ns()for i in range (1000): results = model(images)#GPU computationstop = time.time_ns()time_milli = (stop-start)/1000000print('Avg. processing time: {} ms'.format(time_milli/1000))results = torch.argmax(results, dim=1)results = results.cpu().detach().numpy()#this variable contains label index of the defined charactersresults = results.astype(np.uint8)print(results)

python inference code - torchscript version

import timeimport cv2import numpy as npimport torchimport torchvisionmeans = torch.tensor([0.485, 0.456, 0.406], device = 'cuda:0', dtype=torch.float32).view(1, 3, 1, 1)stds = torch.tensor([0.229, 0.224, 0.225], device = 'cuda:0', dtype=torch.float32).view(1, 3, 1, 1)# An instance of your model.model = torch.jit.load('resNet50.pt')model.eval()model.cuda()inputSize = (224, 224)images = []img = cv2.imread('puppy.jpg')img = cv2.resize(img, inputSize, interpolation=cv2.INTER_CUBIC)images.append(img)images = np.array(images, dtype=np.float32)images = torch.from_numpy(images).cuda().float()images = images.permute(0, 3, 1, 2).contiguous()images = ((images / 255.0) - means) / stdsfor i in range (100): results = model(images)start = time.time_ns()for i in range (1000): results = model(images)#GPU computationstop = time.time_ns()time_milli = (stop-start)/1000000print('Avg. processing time: {} ms'.format(time_milli/1000))results = torch.argmax(results, dim=1)results = results.cpu().detach().numpy()#this variable contains label index of the defined charactersresults = results.astype(np.uint8)print(results)

c++ inference code using libtorch - torchscript version

#include <iostream>#include <string>#include <memory>#include <opencv2/core.hpp>#include <opencv2/imgproc.hpp>#include <opencv2/highgui.hpp>#include <torch/torch.h>#include <torch/script.h> // One-stop header.int main() { std::string weightPath = "E:/toolsku/ReportLibtorchBug/resNet50.pt"; torch::jit::script::Module model; torch::Device targetDevice = torch::kCPU; try { model = torch::jit::load(weightPath);// Deserialize the ScriptModule from a file using torch::jit::load(). if (torch::cuda::is_available()) { std::cout << "GPU is available -> Switch to GPU mode" << std::endl; targetDevice = torch::kCUDA;//to GPU } model.eval(); model.to(targetDevice); } catch (const c10::Error& e) { std::cerr << "Error in loading the model!\n"; return -1; } torch::NoGradGuard no_grad; std::cout << "Success in loading the model!\n"; std::vector<torch::Tensor> batch_data;// using a tensor list int netHeight = 224, netWidth = 224; cv::Size inpDimension(netWidth, netHeight); torch::TensorOptions options(torch::kFloat32); torch::Tensor means = torch::tensor({ 0.485, 0.456, 0.406 }, options).view({ 1, 3, 1, 1 }).to(targetDevice); torch::Tensor stds = torch::tensor({ 0.229, 0.224, 0.225 }, options).view({ 1, 3, 1, 1 }).to(targetDevice); std::string imgPath = "E:/toolsku/ReportLibtorchBug/puppy.jpg"; cv::Mat img = cv::imread(imgPath, cv::IMREAD_COLOR); cv::resize(img, img, inpDimension, cv::INTER_CUBIC); img.convertTo(img, CV_32FC3, 1.0f); torch::Tensor input1 = torch::from_blob(img.data, { 1,netHeight, netWidth, 3 }, options).clone().toType(torch::kFloat32); batch_data.push_back(input1); torch::Tensor input_tensor = torch::cat(batch_data, 0); input_tensor = input_tensor.to(targetDevice); input_tensor = input_tensor.permute({ 0, 3, 1, 2 }); input_tensor = (input_tensor.div_(255.0) - means) / stds; std::chrono::time_point<std::chrono::steady_clock> start; std::chrono::time_point<std::chrono::steady_clock> stop; double processingTime; torch::Tensor out_tensor; for (int i = 0; i < 500; ++i) {//warmup out_tensor = model.forward({ input_tensor }).toTensor(); } start = std::chrono::high_resolution_clock::now(); for (int i = 0; i < 1000; ++i) { out_tensor = model.forward({ input_tensor }).toTensor(); } stop = std::chrono::high_resolution_clock::now(); processingTime = std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count() / 1000.0; std::cout << "Avg. processing time: " << (processingTime / 1000) << " ms\n"; out_tensor = torch::argmax(out_tensor, 1); out_tensor = out_tensor.to(torch::kCPU);//make the variable is in CPU std::cout << out_tensor << std::endl; return 0;}

The output of the program

(ioN3) E:\toolsku\ReportLibtorchBug>python resNet50_simple.pyAvg. processing time: 12.1563201 ms[208](ioN3) E:\toolsku\ReportLibtorchBug>python resnet50_torchscript.pyAvg. processing time: 12.3262982 ms[208](ioN3) E:\toolsku\ReportLibtorchBug>resNet50_libtorch.exeGPU is available -> Switch to GPU modeSuccess in loading the model!Avg. processing time: 13.192 ms 208[ CPULongType{1} ]

The time difference is not much(~1 ms), but in my project, the time difference become bigger with bigger batch size (5-10 ms). The problem is probably in the compilation, but I can't found any related documentation about it

ReportLibtorchBug.zip

albertchristianto on 15 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

pytorch 🚀 - Performance issue with torch.jit.trace(), slow prediction in C++ (CPU) | bleepcoder.com (2024)