Discrete Cosine Transform optimized by using NVIDIA Tensorcore
Results:
    Comparing DCT tensorcores , cublas and cufft
    dim_y 3000   dim_x 3000
    cublas_dct
    cublas took[ms]: 18.1094
    cublas_idct
    cublas took[ms]: 18.3419
    fftw_dct
    fftw took[ms]: 507
    cufft_float_fft
    cufft took[ms]: 1.89949
    cufft_double_fft
    cufft took[ms]: 3.19274
On Ubuntu, BLAS and LAPACK can be installed in one command:
	sudo apt-get install liblapack-dev -y ; sudo apt-get install liblapack3 -y ; sudo apt-get install libopenblas-base -y ; sudo apt-get install libopenblas-dev -y ;
Therefore, include library in the Makefile like:
	g++ ... -L/usr/lib -llapack -lblas