Working on a cgo-free CUDA binding in Go for ML stuff Week 3 - open source [P]
Our take
The recent exploration of a cgo-free CUDA binding in Go, as highlighted in the article by Eitamr, shines a light on an important intersection of programming languages and machine learning (ML) development. The ongoing evolution of tools and frameworks in this space is crucial for enhancing productivity and efficiency. As developers increasingly turn to languages like Go for their ability to create scalable applications, the limitations posed by cgo in CUDA projects can significantly hinder progress. This challenge is not unique to Go; it resonates with broader themes in software development, much like the discussions around embracing APIs in data science from articles such as Beyond the Model: Why Data Scientists Must Embrace APIs and API Documentation, where the focus is on creating efficient data-driven solutions.
Eitamr’s initiative to develop a proof of concept that loads libcuda.so at runtime using pure Go is a commendable step towards making CUDA more accessible within the Go ecosystem. By tackling the issues that arise from thread affinity and the complexities of multi-threading in CUDA, this project not only seeks to streamline the development process but also aligns with the growing demand for lightweight, efficient solutions in machine learning. As more developers leverage containerization for their applications, the size of Docker images becomes a critical factor, paralleling discussions on optimizing tools and processes, as seen in [PapersWithCode new features - week 1 [P]](/post/paperswithcode-new-features-week-1-p-cmpk32zrg0g8ds0glb7e2j6dd). This attention to detail can significantly enhance the usability of machine learning frameworks and tools.
The ongoing development in Eitamr's project highlights the importance of community-driven contributions to open-source software. By sharing his journey, including the challenges faced and the solutions devised, he encourages collaboration and knowledge sharing within the developer community. This human-centered approach fosters an environment where innovation can thrive, especially in a field as dynamic as machine learning. The insights gained from his experimentation not only benefit his own projects but also provide valuable lessons for others venturing into similar territories. As developers experiment and share their findings, they contribute to a culture of continuous learning and improvement that is essential for the future of technology.
Looking ahead, there are significant implications for the future of programming languages and frameworks in the context of machine learning. As Go continues to grow in popularity due to its simplicity and efficiency, projects like Eitamr's may pave the way for more robust and streamlined CUDA bindings without the overhead of cgo. This could potentially spark a wave of innovation where developers feel empowered to push the boundaries of what is possible in machine learning applications using Go. The question remains: will we see a broader adoption of such solutions that eliminate cgo dependencies, and how will this shift influence the development of machine learning tools in the coming years? The answers could shape the landscape of both programming practices and machine learning capabilities, making it an exciting time to watch this space.
At our work we use CUDA in Rust since the company switched to it recently. Rust has pretty good Driver API bindings but it made me wonder why the hell we cant have something decent in Go without cgo.
I mostly build ML tools in the last month and Go is my main language for pretty much everything.
Problem is most Go CUDA projects still need cgo and the full toolkit at build time. That breaks cross compilation and makes Docker images huge which sucks when working on machine learning projects.
So last month I started messing around with a proof of concept that loads libcuda.so at runtime using purego. No cgo at all.
Biggest pain was thread affinity. CUDA keeps context per thread so goroutines switching around kept breaking things. I built a simple executor that locks an OS thread with runtime.LockOSThread and funnels all calls through a channel.
Heres roughly what using it looks like right now:
func run() error { cuda.Init() dev, _ := cuda.GetDevice(0) ctx, _ := dev.Primary() defer ctx.Close() a, _ := cuda.Alloc[float32](ctx, 1024) b, _ := cuda.Alloc[float32](ctx, 1024) c, _ := cuda.Alloc[float32](ctx, 1024) stream, _ := ctx.NewStream() start, _ := ctx.NewEvent() stop, _ := ctx.NewEvent() start.Record(stream) fn.LaunchOn(bg, stream, cfg, cuda.Arg(a), cuda.Arg(b), cuda.Arg(c), cuda.ArgValue(int32(1024)), ) stop.Record(stream) stop.Synchronize() duration, _ := start.Elapsed(stop) fmt.Printf("GPU time: %v\n", duration) return nil } On my 4070 Ti a 10M vector add showed CPU timer at like 160us but actual GPU event timing was 434us. That difference surprised me.
The project is still super early and moves slow cuz i only code on weekends and im a total noob with CUDA. Slowly adding Graphs and multi gpu support.
THIS IS SO early , so treat it more like a learning cuda repo, but im having fun learning cuda. Thought some of you might find it interesting too.
repo is github.com/eitamring/gocudrv if you wanna take a look.
Would be cool if anyone with 5xxx series cards wants to try it wink wink
[link] [comments]
Read on the original site
Open the publisher's page for the full experience