Milan @milan

6 posts5 participants0 posts today

**HGPU group** @hgpu@mast.hpc.social · 1d

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives

#CUDA #PTX #Triton #LLM #DeepLearning #DL

hgpu.org · 1dTileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric PrimitivesLarge deep learning models have achieved state-of-the-art performance in a wide range of tasks. These models often necessitate distributed systems for efficient training and inference. The fundamen…

**HGPU group** @hgpu@mast.hpc.social · 1d

HGPU group @hgpu@mast.hpc.social

Efficient allocation of image recognition and LLM tasks on multi-GPU system

#CUDA #Performance #LLM #ImageRecognition

https://hgpu.org/?p=29839

hgpu.org · 1dEfficient allocation of image recognition and LLM tasks on multi-GPU systemThis work is concerned with the evaluation of the performance of parallelization of learning and tuning processes for image classification and large language models. For machine learning model in i…

**Natasha Nox** @Natanox@chaos.social · 1d

Natasha Nox @Natanox@chaos.social

ffs, why does their docker only support Navi 31 and not Navi 32?
https://hub.docker.com/r/rocm/pytorch

I just wish both #Nvidia and #AMD would stop with that whole licensing bullshit around #CUDA and #ROCm and just include that damn stuff in the default driver.
I just want to run #Codestral on my local machine so I can use it with non-public code. Will be troublesome enough to cram it into 16gb VRAM.
#computer #Linux #AI

**Alexandre Mutel** @xoofx@mastodon.social · 3d *

3d *

Alexandre Mutel @xoofx@mastodon.social

Ported https://salykova.github.io/sgemm-gpu to Vulkan (nice article!)

it's 2x slower than Cuda. That one was tricky to port, (e.g. need to alias shared buffer to allow LDS/STS.128), half of it with AI, 2nd half going over lines 1-by-1

Ported the Kernel 5 from https://seb-v.github.io/optimization/update/2025/01/20/Fast-GPU-Matrix-multiplication.html to Vulkan

It's only 15% slower than Cuda (and it works on AMD)

In both cases, it's quite difficult to reason about SPIR-V ISA, apart the AMD GPU Analyzer that is helping!

Digging deeper

salykova blog · Jan 12Beating cuBLAS in Single-Precision General Matrix MultiplicationThis blog post focuses on a GPU implementation of SGEMM (Single-precision GEneral Matrix Multiply) operation defined as C := alphaAB + beta*C. We’ll review the algorithm’s design and discuss optimization techniques such as inlined PTX, asynchronous memory copies, double-buffering, avoiding shared memory bank conflicts, and efficient coalesced storage through shared memory.

#vulkan #cuda

**wasmVision** @wasmvision@mastodon.social · 3d

wasmVision @wasmvision@mastodon.social

wasmVision 0.3.0 is out! We have some exiting new features for you such as MCP server support, and experimental GPU acceleration for vision models. Performance and stability improvements too. Go get it right now!

#wasm #computervision #opencv #golang #tinygo #rust #clang #mcp #cuda

https://github.com/wasmvision/wasmvision/releases/tag/v0.3.0

What's Changed

fix: add missing blurc processor from list of known processors by @deadprogram in #49
modules: update wazero to version 1.9 by @deadprogram in #50
all: update to use Go 1.24.1 by @d...

GitHubRelease 0.3.0 · wasmvision/wasmvisionWhat's Changed fix: add missing blurc processor from list of known processors by @deadprogram in #49 modules: update wazero to version 1.9 by @deadprogram in #50 all: update to use Go 1.24.1 by @d...

**Watson Tech World** @WatsonTechWorld@mastodon.social · 5d

Watson Tech World @WatsonTechWorld@mastodon.social

One of the best interviews on AI and GPUs I've every seen was posted earlier today. Jensen Huang is really super smart in my opinion, and I think this interview is definitely worth watching. I can understand how it was a person like him who turned NVIDIA into a company with a market cap more than 1 trillion USD.

Jensen Huang on GPUs - Computerphile
https://www.youtube.com/watch?v=G6R7UOFx1bw

YouTubeJensen Huang on GPUs - ComputerphileBy Computerphile

#NVIDIA #GPU #AI

**CEOTECH.IT** @ceotech@mastodon.social · Mar 24

Mar 24

CEOTECH.IT @ceotech@mastodon.social

NVIDIA GeForce RTX 5060 e 5060 Ti: arrivo imminente
#Aprile2025 #Componenti #CUDA #Gaming #GDDR7 #GeForce #GPU #Hardware #Leak #Notizie #Novità #NVIDIA #NVIDIARTX #PC #RTX5060 #RTX5060Ti #Rumors #SchedeGrafiche #SchedeVideo #TechNews #Tecnologia

https://www.ceotech.it/nvidia-geforce-rtx-5060-e-5060-ti-arrivo-imminente/

**HGPU group** @hgpu@mast.hpc.social · Mar 23

Mar 23

HGPU group @hgpu@mast.hpc.social

ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming

#SYCL #CUDA #oneAPI #AI #Triton #Compilers #Intel

https://hgpu.org/?p=29825

hgpu.org · Mar 23ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU ProgrammingIn the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programm…

**st1nger** @st1nger@infosec.exchange · Mar 22

Mar 22

st1nger @st1nger@infosec.exchange

#PyTorch internals (2019) #python #tensor #cuda #ML #NLP https://blog.ezyang.com/2019/05/pytorch-internals/

blog.ezyang.comPyTorch internals : ezyang’s blog

**.:\dGh/:.** @darkghosthunter@mastodon.social · Mar 18

Mar 18

.:\dGh/:. @darkghosthunter@mastodon.social

Is there any difference between computing AI workloads in Vulkan, OpenCL and CUDA?

I know that some people say that NVIDIA doesn't support (quite well) OpenCL or Vulkan, performance is achieved by using CUDA. But what is the story for other vendors (Intel, AMD, QualComm, Apple) ?

#AI #Programming #AIProgramming

**NeussWave** @NeussWave@neuss.social · Mar 17

Mar 17

NeussWave @NeussWave@neuss.social

Ich frage mich gerade, ob ich mir #CUDA reinschaufeln mag. Also eigentlich... aber...

**Paul Houle** @UP8@mastodon.social · Mar 15

Mar 15

Paul Houle @UP8@mastodon.social

Just got my RSS reader YOShInOn building with uv and running under WSL2 with the Cuda libraries, despite a slight version mismatch... All I gotta do is switch it from arangodb (terrible license) to postgres, and it might have a future... With sentence_transformers running under WSL2 I might even be able to deduplicate the million images in my Fraxinus image sorter

screenshot with two windows; on left there is a summary of the output of a recommendation engine, on the right there is a list of python packages used by that same engine

#python #programming #ai

**Hacker News 50** @hn50@social.lansky.name · Mar 12

Mar 12

Hacker News 50 @hn50@social.lansky.name

Sorting algorithms with CUDA

Link: https://ashwanirathee.com/blog/2025/sort2/
Discussion: https://news.ycombinator.com/item?id=43338405

ashwanirathee.comSorting Algorithms with CUDA! | Ashwani RatheeA simple, whitespace theme for academics. Based on [*folio](https://github.com/bogoli/-folio) design.

#cuda

Replied in thread

**Giuseppe Bilotta** @giuseppebilotta@fediscience.org · Mar 10

Mar 10

Giuseppe Bilotta @giuseppebilotta@fediscience.org

Even now, Thrust as a dependency is one of the main reason why we have a #CUDA backend, a #HIP / #ROCm backend and a pure #CPU backend in #GPUSPH, but not a #SYCL or #OneAPI backend (which would allow us to extend hardware support to #Intel GPUs). <https://doi.org/10.1002/cpe.8313>

This is also one of the reason why we implemented our own #BLAS routines when we introduced the semi-implicit integrator. A side-effect of this choice is that it allowed us to develop the improved #BiCGSTAB that I've had the opportunity to mention before <https://doi.org/10.1016/j.jcp.2022.111413>. Sometimes I do wonder if it would be appropriate to “excorporate” it into its own library for general use, since it's something that would benefit others. OTOH, this one was developed specifically for GPUSPH and it's tightly integrated with the rest of it (including its support for multi-GPU), and refactoring to turn it into a library like cuBLAS is

a. too much effort
b. probably not worth it.

Again, following @eniko's original thread, it's really not that hard to roll your own, and probably less time consuming than trying to wrangle your way through an API that may or may not fit your needs.

**HGPU group** @hgpu@mast.hpc.social · Mar 10

Mar 10

HGPU group @hgpu@mast.hpc.social

WgPy: GPU-accelerated NumPy-like array library for web browsers

#CUDA #Python #WebGL #Package

https://hgpu.org/?p=29808

hgpu.org · Mar 10WgPy: GPU-accelerated NumPy-like array library for web browsersTo execute scientific computing programs such as deep learning at high speed, GPU acceleration is a powerful option. With the recent advancements in web technologies, interfaces like WebGL and WebG…

**N-gated Hacker News** @ngate@mastodon.social · Mar 7

Mar 7

N-gated Hacker News @ngate@mastodon.social

AMD YOLO: because why not base your entire #business #strategy on a meme? Thanks to AMD's cultural enlightenment, they're now #shipping #boxes faster than philosophical musings on singularity! Who knew rewriting a stack could be as easy as beating #NVIDIA at their own game? Just don't tell CUDA—it might get jealous!
https://geohot.github.io//blog/jekyll/update/2025/03/08/AMD-YOLO.html #AMD #YOLO #meme #CUDA #competition #HackerNews #ngated

the singularity is nearer · Mar 7AMD YOLOAMD is sending us the two MI300X boxes we asked for. They are in the mail.

**Andrew Jones (hpcnotes)** @hpcnotes@mast.hpc.social · Mar 5

Mar 5

Andrew Jones (hpcnotes) @hpcnotes@mast.hpc.social

Which language/technology did you use when developing your first parallel program?

#OpenMP? #MPI? #CUDA? #pthreads? #Coarrays? #UPC? Something else?

#HPC

**Dr. Moritz Lehmann** @ProjectPhysX@mast.hpc.social · Mar 3 *

Mar 3 *

Dr. Moritz Lehmann @ProjectPhysX@mast.hpc.social

Hot Aisle's 8x AMD #MI300X server is the fastest computer I've ever tested in #FluidX3D #CFD, achieving a peak #LBM performance of 205 GLUPs/s, and a combined VRAM bandwidth of 23 TB/s.
The #RTX 5090 looks like a toy in comparison.

MI300X beats even Nvidia's GH200 94GB. This marks a very fascinating inflection point in #GPGPU: #CUDA is not the performance leader anymore.
You need a cross-vendor language like #OpenCL to leverage its power.

FluidX3D on #GitHub: https://github.com/ProjectPhysX/FluidX3D