Hacker Newsnew | past | comments | ask | show | jobs | submit | apl's commentslogin

Neural architecture search (NAS) is a thing! But it's almost exclusively based on meta-gradients. Again, wouldn't put my money on GAs ever outperforming gradient-based methods again.


GAs follow gradients too. It's a different learning approach. All learning follows gradients in one form or another except brute force search, which is not feasible for anything larger than about 2^80 bits of state space. Evolution is not brute force.


Is the parameter space convex?


Can you combine NAS and GAs?


An AlexNet/ResNet-type moment may be in the cards for GAs, but I wouldn't put any money on it. They're typically only one better than brute-force. This can be good enough (and is certainly easy to implement), but if you can get a gradient for your problem -- you should use that. And nowadays, you can typically get a gradient!

Most recent advances in the fields you mentioned were driven by gradient-based optimization (e.g., drug design, routing, or chip design: https://www.nature.com/articles/s41586-021-03544-w).

Nature can't SGD through genomes but has a metric ton of time, so evolution might be near-ideal for sexual reproduction. We typically don't have billions of generations, trillions of instantiations, and complex environments to play with when optimizing functions... It's telling that the fastest-evolving biological system (our brain!) certainly doesn't employ large-scale GA; if anything, it probably approximates gradients via funky distributed rules.

EDIT: The most modern application I can think of was some stuff from OpenAI (https://openai.com/blog/evolution-strategies/). But the point here is one of computational feasibility -- if they could backprop through the same workload, they would.


If biological evolution wasn't much better than brute force we would not be here. There is no way you could randomly generate a functional genome for anything non-trivial in any reasonable multiple of the age of the universe even if you had a trillion Earths to work with.

But ... GAs are not biological evolution. I think the real issue is that present day GAs only approximate some aspects of biological evolution, but they're very "chunky" in the same way that primitive neural network models are. They get generation and selection but actual biological evolution involves much deeper processes than that. Evolutionary theory is rich and quite fascinating.


Several hints here are severely outdated.

For instance, never train a model in end-to-end FP16. Use mixed precision, either via native TF/PyTorch or as a freebie when using TF32 on A100s. This’ll ensure that only suitable ops are run with lower precision; no need to fiddle with anything. Also, PyTorch DDP in multi-node regimes hasn’t been slower or less efficient than Horovod in ages.

Finally, buying a local cluster of TITAN Xs is an outright weird recommendation for massive models. VRAM limitations alone make this a losing proposition.


Hi there - OP here - thanks for reading!

This blog is more of an intro to a few high level concepts (multi-GPU and multi-node training, fp32 vs fp16, buying hardware and dedicated machines vs AWS/GCP, etc) for startups that are early into their deep learning journey, and that might need a nudge in the right direction.

If you're looking for a deep dive into the best GPUs to buy (cost/perf, etc), the link in the below comment gives a pretty good overview.

PS - I can send you some benchmarks we did that show (at least for us) Horovod is ~10% faster than DDP for multi-node training FWIW. Email is in my profile!


> Finally, buying a local cluster of TITAN Xs is an outright weird recommendation for massive models. VRAM limitations alone make this a losing proposition.

Do you have an alternative recommendation?


You can check out some of the benchmarks here: https://lambdalabs.com/blog/nvidia-rtx-a6000-benchmarks/

It provides some modern, real life, deep learning benchmarks using the mixed precision (TF32) that gp was referring to.


Hard disagree. V100s are a perfectly valid comparison point. They're usually what's available at scale (on AWS, in private clusters, etc.) because nobody's rolled out enough A100s at this point. If you look at any paper from OpenAI et al. (basically: not Google), you'll see performance numbers for large V100 clusters.


Yes and you'll see parameters tuned for V100, not parameters tuned for m1 somehow limping along on a V100 in emulation mode.

I wouldn't complain about a benchmark executing any real world SOTA model on m1 and V100, but those will most likely not even run on the M1 due to memory constraints.

So this article is like using an ios game to evaluate a Mac pro. You can do it, but it's not really useful.


You can count the number of GPUs having more than M1 memory(16 GB) in a single hand.


Isn't the M1 GPU memory shared with everything else? Can the GPU realistically used that much? Won't the OS and base apps use up at least 2-3GB?


The M1 can only address 8 GB with its NPU/GPU.


You can almost 1:1 translate this by swapping "tf" and "torch". No need to use nn.Conv2d -- there's a functional API for all these layers:

https://pytorch.org/docs/master/nn.functional.html#conv2d

Torch doesn't have "same" padding, so you have to manually calculate the correct padding value for your input/output shapes.


That looks great, thanks!


> gradient descent no longer has to be written by hand

Nobody's been writing derivatives by hand for 5+ years. All major frameworks (PyTorch, Tensorflow, MXNet, autodiff, Chainer, Theano, etc.) have decent to great automatic differentiation.

The differences and improvements are more subtle (easy parallelization/vectorization, higher-order gradients, good XLA support).


For high performance CUDA kernels people still need to write derivatives by hand. I know this as for my own research, and for many production systems, I'd still need to write it myself. Many of my architectures wouldn't have been possible without writing the CUDA myself (Quasi-Recurrent Neural Network[1]) or using optimized hand written black boxes (cuDNN RNN). The lack of open optimized hand written CUDA kernels has actually been an impediment to progress in the field.

Automatic differentiation allows for great flexibility and composability but the performance is still far from good, even with the various JITs available. Jax seems to be one of the most flexible and optimized for many use cases for now however.

[1]: https://github.com/salesforce/pytorch-qrnn


Right, you still need to write derivative rules by hand for the primitive operations of an auto-diff system. Automatic differentiation provides composition, it doesn't solve the root mathematical problem of differentiating operations at the lowest level.

So yes, if need a new primitive to add an efficient CUDA kernel, you will probably also have to write its derivative manually too. JAX has a few shortcuts that occasionally make this easier but fundamentally it has the same challenge as any auto-diff system.


I still strongly disagree. Few of these hand written CUDA kernels outside of the frameworks are about implementing derivative rules, they're about eliminating the CUDA call overheads or avoiding the layered computational / memory inefficiencies that existing ML compilers have trouble handling.

Next to none of the frameworks are yet able to JIT you a performant RNN, yet RNNs only use very standard components[1]. OpenAI had a massive speed and memory usage boost for attention by implementing what amounts to a few standard primitives together[2].

There are massive gaps in the optimizations that existing ML compilers provide. The landscape is starting to get better but it's still filled with many pitholes.

[1]: https://twitter.com/stanfordnlp/status/1224106217192087552

[2]: https://openai.com/blog/sparse-transformer/


It depends what you define as primitive. I've had plenty of compositions of existing primitives for which the auto-derived backprop was orders of magnitude slower than a hand written one. I didn't need to write my own backprop, but I benefited tremendously from it. I don't think my experience is particularly rare.


But is autodiff combined with a blackbox jit a real solution? The jit either works for your new model or it does not. If it does not, you can do pretty much nothing about it, other than ping jax authors or get your own hands dirty with jax internal code. Why is noone working on a usable low-level framework, where I can implement QRNN or more complicated stuff without relying on a black-box jit? Jax could have chosen to be this, but instead is a fancy solution to a non-problem.


How has your experience with CUDA been? Is it as painful as it appears at first glance? I've done a ton of python and C, and yet whenever I look at C++ code, it just screams stay away.

But I have some almost-reasonably-performant pytorch that I'd rather not just use as a cash burning machine, so it looks like it might be time to dive into CUDA :-\


The CUDA I've written has never been joyous but it also hasn't been as horrific as I'd expected. There's a period of hair pulling but persistence will get you through it. The majority of CUDA code is closer to C than C++ too which is helpful. I'll be looking at diving back into CUDA in the near future given the exact speed issues we've been mentioning so feel free to get in touch.


Mainly because it is genuinely exhausting for any medical practitioner. That lots of patients "enjoy" googling symptoms and coming up with far-fetched self-diagnoses is a given. But couple that with the perceived intellectual superiority of (software) engineers and you get a recipe for disaster. It's the equivalent of a doctor leaning over your shoulder while you're coding and telling you to remove random keywords.


Yeah I don't buy this.

Like any field I think there is a spectrum of quality and there are some really great doctors that know a lot, some really bad ones, and a lot of mediocre ones.

I've had a doctor (in the bay area) tell me that I should smoke a cigarette instead of having coffee if I'm having trouble sleeping, but want to keep working on something. Another talk positively about the butter coffee guy. I think the main reason they don't talk about a lot of options is probably time constraint and the common case being right most of the time. This means if you're actually not a common case you're probably better off investing your own time to try and figure things out too.

I like this article though, I think there is some similarity of style in troubleshooting software and disease diagnosis (just very different things to reason about).


I don't think it's as bad as removing random keywords, but given how limited resources are in medicine, I can see how a provider wouldn't want to explain why every individual's hypothesis is likely incorrect.

On the flip side, misdiagnoses are surprisingly common, and I think it's worthwhile for any provider to take a closer look if a patient has concerns.

If anything, I think this illustrates how much we need to reorganize and improve medicine. It's not like medicine is alone in this respect either, many sectors are inefficient, but when medicine is life altering and can be life or death, it's pretty high on the list IMO.


I think it’s because there’s a lot of ambiguity and “best guesses” in medical science and there’s no equivalent to programming documentation or manpages for medical treatment. Building code is for all intents and purposes a pretty objective and repeatable field of study. Whereas I would liken a medical treatment more to penetration testing because you’re trying to get an established logical system to accept new input/logic (medicine/procedures) rather than trying to build a logical system from scratch (or with building blocks)


Everybody feels the same way - mechanics feel exactly the same. IT tech support people feel the same.

And, you know, it seems like an increasing number of medical professionals just defer to whatever the patient wants anyway. What's the point of saying you're the expert if you won't be the expert?


For this particular problem, Mask R-CNN would have been the way to go -- it spits out instances as opposed to just deciding, for each pixel, to which class it belongs. Or an SSD (if we don't care about the mask at all).


There's many perspectives on everything. Deep ConvNets, for instance, can be expressed as a continuously evolving ODE. Here's a fantastic paper on this view:

https://papers.nips.cc/paper/7892-neural-ordinary-differenti...


Their applications are tabular data (for which MLPs have never been the method of choice) and MNIST (which I could classify at 85% with a rusty nail), so it's not super impressive.

NNs and the associated toolkit shine with structured high-dimensional data where CNNs, RNNs, or modern shenanigans like Transformer networks excel. I sincerely doubt that these networks turn out to be reducible to polynomial regression in any practically useful sense of the notion. But who knows.


Terminology note: data like images and voice which have strong spatial or temporal patterns are actually referred to as "unstructured" data; while data you get from running "SELECT * FROM some_table" or the carefully designed variable of a clinical trial are referred to as "structured" data.

If this seems backwards to you (as it did to me at first) note that unstructured data can be captured raw from instruments like cameras and microphones, while structured data usually involved a programmer coding exactly what ends up in each variable.

As you say, deep neural networks based on CNNs are SOTA on unstructured image data, RNNs are SOTA on unstructured voice and text data, while tree models like random forest and boosted trees usually SOTA on problems involving structured data. The reason seems to be the that the inductive biases inherent to CNNs and RNNs, such as translation invariance, are a good fit for the natural structure of such data, while the the strong ability of trees to find rules is well suited to data where every variable is cleanly and unambiguously coded.


Yeah, that's right. Doing too little proofreading with HN comments...


NNs can approximate any computable function. Any function can also be approximated by a polynomial. This is all proven math and has been known for a long time. I think of these as different facets of the same thing and each has different tools for computing, analyzing and understanding.


A hash table can also approximate any function, so the "universal function approximation" thing is a bit oversold. It isn't really what matters. What matters is how well methods generalize beyond the training data, and how much data they need to do this well.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: